OpenStack client hangs forever after simultaneous add/delete controller

Bug #1490523 reported by Andrey Sledzinskiy
36
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Dmitry Ilyin
7.0.x
Fix Released
Critical
Dmitry Ilyin

Bug Description

Deployment timed out after simultaneous add/delete controller

Steps:
1. Create next cluster - HA, Ubuntu, Neutron Vlan, 3 controllers, 1 compute
2. Deploy cluster
3. After deployment delete primary controller (node-1) and add 1 controller
4. Start re-deployment

Actual result - deployment failed with time out, node-2 was marked with error

{

    "build_id": "257",
    "build_number": "257",
    "release_versions":

{

    "2015.1.0-7.0":

{

    "VERSION":

{

    "build_id": "257",
    "build_number": "257",
    "api": "1.0",
    "fuel-library_sha": "bc04a7092d92400c79e6ea6ede25e7b67c6a6355",
    "nailgun_sha": "3189ccfb8c1dac888e351f535b03bdbc9d392406",
    "feature_groups":

            [
                "mirantis"
            ],
            "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd",
            "openstack_version": "2015.1.0-7.0",
            "fuel-agent_sha": "1e8f38bbb864ed99aa8fe862b6367e82afec3263",
            "production": "docker",
            "python-fuelclient_sha": "9643fa07f1290071511066804f962f62fe27b512",
            "astute_sha": "53c86cba593ddbac776ce5a3360240274c20738c",
            "fuel-ostf_sha": "644db51186dc23c9b27e9b5486c120c8363dc87c",
            "release": "7.0",
            "fuelmain_sha": "0e54d68392b359bc122e5bbba9249c729eeaf579"
        }
    }

},
"auth_required": true,
"api": "1.0",
"fuel-library_sha": "bc04a7092d92400c79e6ea6ede25e7b67c6a6355",
"nailgun_sha": "3189ccfb8c1dac888e351f535b03bdbc9d392406",
"feature_groups":

    [
        "mirantis"
    ],
    "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd",
    "openstack_version": "2015.1.0-7.0",
    "fuel-agent_sha": "1e8f38bbb864ed99aa8fe862b6367e82afec3263",
    "production": "docker",
    "python-fuelclient_sha": "9643fa07f1290071511066804f962f62fe27b512",
    "astute_sha": "53c86cba593ddbac776ce5a3360240274c20738c",
    "fuel-ostf_sha": "644db51186dc23c9b27e9b5486c120c8363dc87c",
    "release": "7.0",
    "fuelmain_sha": "0e54d68392b359bc122e5bbba9249c729eeaf579"

}

Logs are attached

Environment is available for debugging

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
tags: added: life-cycle-management
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Stanislaw Bogatkin (sbogatkin)
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Also the same problem with adding 2 controllers to cluster:
1. Deploy Neutron tun cluster with 1 controller
2. After deployment add 2 controllers

 Actual result - deployment failed with time out

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Changed in fuel:
status: New → Confirmed
Revision history for this message
Dmitry Ilyin (idv1985) wrote :

2015-08-31T04:25:28 info: [655] 6893236e-3408-4940-b582-60377bad81db: run task '---priority: 4200type: puppetuids:- '2'parameters: puppet_modules: "/etc/puppet/modules" puppet_manifest: "/etc/puppet/modules/osnailyfacter/modular/keystone/keystone.pp" timeout: 3600 cwd: "/"' on node 2
2015-08-31T05:25:30 info: [655] Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"6893236e-3408-4940-b582-60377bad81db", "nodes"=>[{"uid"=>"2", "status"=>"error", "role"=>"primary-controller", "error_type"=>"deploy"}]}}
2015-08-31T05:25:30 info: [655] Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"6893236e-3408-4940-b582-60377bad81db", "status"=>"error", "error"=>"Timeout of deployment is exceeded."}}
2015-08-31T05:25:54 info: [659] Processing RPC call 'dump_environment'

Keystone.pp task have habged on the primary controller for an hour

Dmitry Ilyin (idv1985)
Changed in fuel:
assignee: Stanislaw Bogatkin (sbogatkin) → Dmitry Ilyin (idv1985)
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Raised to critical because it keeps reproducing
https://bugs.launchpad.net/fuel/+bug/1491319

Changed in fuel:
importance: High → Critical
Revision history for this message
Sergey Kolekonov (skolekonov) wrote :

Seems like I've also run into this problem. I restarted Puppet manifests on the existing controller and keystone task hung.
I observe segfaults in apache logs (/var/log/apache2/error.log)

[Wed Sep 02 10:30:57.391828 2015] [core:notice] [pid 11480:tid 139722188437376] AH00052: child pid 6658 exit signal Segmentation fault (11)
[Wed Sep 02 10:30:57.391835 2015] [core:error] [pid 11480:tid 139722188437376] AH00546: no record of generation 0 of exiting child 6658

There's no snapshot because it's a custom ISO

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/219668

Revision history for this message
Vladimir Kuklin (vkuklin) wrote : Re: Keystone admin endpoint hangs under wsgi forever

It seems that the issue is that openstack command hangs forever under wsgi apache for admin.

apache log for admin endpoint contains following errors:

[Wed Sep 02 12:48:37.402535 2015] [core:error] [pid 13350:tid 140616465893120] [client 10.109.2.7:43021] End of script output before headers: admin
[Wed Sep 02 12:48:37.402995 2015] [core:error] [pid 13350:tid 140616398751488] [client 10.109.2.4:35605] End of script output before headers: admin
[Wed Sep 02 12:48:37.403121 2015] [core:error] [pid 13349:tid 140616432322304] [client 10.109.2.4:35618] End of script output before headers: admin
[Wed Sep 02 12:48:37.403238 2015] [core:error] [pid 13350:tid 140616449107712] [client 10.109.2.4:35632] End of script output before headers: admin
[Wed Sep 02 12:48:37.403352 2015] [core:error] [pid 13349:tid 140616340002560] [client 10.109.2.4:35648] End of script output before headers: admin
[Wed Sep 02 12:48:37.403495 2015] [core:error] [pid 13350:tid 140616440715008] [client 10.109.2.4:35665] End of script output before headers: admin
[Wed Sep 02 12:48:37.403537 2015] [core:error] [pid 13350:tid 140616610293504] [client 10.109.2.4:35681] End of script output before headers: admin
[Wed Sep 02 12:48:37.403576 2015] [core:error] [pid 13349:tid 140616356787968] [client 10.109.2.4:35693] End of script output before headers: admin
[Wed Sep 02 12:48:37.403615 2015] [core:error] [pid 13349:tid 140616491071232] [client 10.109.2.4:35709] End of script output before headers: admin
[Wed Sep 02 12:48:37.403653 2015] [core:error] [pid 13350:tid 140616381966080] [client 10.109.2.7:44125] End of script output before headers: admin
[Wed Sep 02 12:48:37.403767 2015] [core:error] [pid 13349:tid 140616593508096] [client 10.109.2.4:35723] End of script output before headers: admin
[Wed Sep 02 12:48:37.403809 2015] [core:error] [pid 13349:tid 140616610293504] [client ::1:57343] End of script output before headers: admin
[Wed Sep 02 12:48:37.403885 2015] [core:error] [pid 13349:tid 140616381966080] [client 10.109.2.4:35742] End of script output before headers: admin
[Wed Sep 02 12:48:37.404301 2015] [core:error] [pid 13350:tid 140616432322304] [client 10.109.2.7:43085] End of script output before headers: admin
[Wed Sep 02 12:48:37.404367 2015] [core:error] [pid 13350:tid 140616373573376] [client 10.109.2.7:43128] End of script output before headers: admin

summary: - Deployment timed out after simultaneous add/delete controller
+ Keystone admin endpoint hangs under wsgi forever
Changed in fuel:
assignee: Dmitry Ilyin (idv1985) → MOS Keystone (mos-keystone)
Changed in fuel:
assignee: MOS Keystone (mos-keystone) → Alexander Makarov (amakarov)
Revision history for this message
Alexander Makarov (amakarov) wrote :

Notification system hangs if [oslo_messaging_rabbit]rabbit_hosts contains unreachable hosts only

Changed in fuel:
assignee: Alexander Makarov (amakarov) → MOS Oslo (mos-oslo)
Boris Bobrov (bbobrov)
summary: - Keystone admin endpoint hangs under wsgi forever
+ Keystone hangs forever after simultaneous add/delete controller
summary: - Keystone hangs forever after simultaneous add/delete controller
+ OpenStack client hangs forever after simultaneous add/delete controller
Changed in fuel:
assignee: MOS Oslo (mos-oslo) → Dmitry Ilyin (idv1985)
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

RCA:

It seems that issue happened due to the issue with openstack client being called when keystone wsgi application was not ready to answer it. It means that we need to add timeouts into puppet provider.

Status:

fix is on review, we are verifying and should merge it until 05.09.2015

Changed in fuel:
assignee: Dmitry Ilyin (idv1985) → Matthew Mosesohn (raytrac3r)
status: Confirmed → In Progress
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Here is RCA from the OpenStack point of view:

To reproduce the issue with just OpenStack, without deployment modifications:

1. Install MOS environment with Ceilometer. The deployment should have only one controller to ease further steps. The Ceilometer is needed because it triggers Fuel to populate [DEFAULT]/notification_driver parameter in /etc/keystone/keystone.conf
2. Firewall the RabbitMQ, so that keystone can not reach it. You can also optionally add invalid hosts to the rabbit_hosts parameter in keystone.conf.
3. Restart apache
4. Try to use any keystone command in a regular way, for instance 'openstack user list'. The command will either hang or return 503 after some time.
5. If you de-firewall RabbitMQ node, the Keystone will become responsive again.

The root cause is that keystone tries to send notifications about its actions, and there is no RabbitMQ alive where it could be sent. Oslo.messaging indefinitely in a loop tries to reconnect to RabbitMQ servers in the list and always fails (the errors could be seen in keystone log). This is a valid behaviour from the point of view of oslo.messaging, as it was designed not to loose notifications. Keeping in mind that there is already a deployment fix ready, there is no reason to change that.

Changed in fuel:
assignee: Matthew Mosesohn (raytrac3r) → Dmitry Ilyin (idv1985)
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Dmitry, this bug is not related at all to your analysis. As I told you before, we had thought that these 2 had been 2 similar bugs, but deduplicate them later.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/219668
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=5bc271551ada60b0970f2d3bdc69aa4496c41d9f
Submitter: Jenkins
Branch: master

commit 5bc271551ada60b0970f2d3bdc69aa4496c41d9f
Author: Dmitry Ilyin <email address hidden>
Date: Wed Sep 2 15:15:23 2015 +0300

    Add timeout to the openstack comamnd

    * Timeout openstack command after 20 seconds
      in case if it's not responding.
    * Retry failed openstack comamnds

    Change-Id: I63b7d6e89888d9fc97ecf88c46a01ece80ca9c63
    Related-Bug: 1490523

Changed in fuel:
assignee: Dmitry Ilyin (idv1985) → Matthew Mosesohn (raytrac3r)
Changed in fuel:
assignee: Matthew Mosesohn (raytrac3r) → Dmitry Ilyin (idv1985)
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Vladimir, I agree that issues should be split, but since we didn't file ours separately, I am writing here that we do not consider such Keystone/oslo.messaging behaviour as a bug. That is just to be explicit.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-plugin-detach-keystone (master)

Reviewed: https://review.openstack.org/219724
Committed: https://git.openstack.org/cgit/stackforge/fuel-plugin-detach-keystone/commit/?id=418cbc01cd203982166e217244969c459a4f03a5
Submitter: Jenkins
Branch: master

commit 418cbc01cd203982166e217244969c459a4f03a5
Author: Matthew Mosesohn <email address hidden>
Date: Wed Sep 2 17:36:59 2015 +0300

    Add rabbitmq workaround if rabbitmq plugin enabled

    Keystone admin fails if no RabbitMQ servers are available,
    so AMQP should be correctly configured.

    FIXME. RabbitMQ plugin cannot notify other plugins of its
    settings because in normal deployments, those role are
    not present in the deployment graph and fail to resolve.

    Change-Id: Ifb03165d44395ac7957180c454b70c148d653df3
    Partial-Bug: #1490523

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

Can we run test again? We have several Partial fixes that should fix the issue in general

Revision history for this message
Alexey Shtokolov (ashtokolov) wrote :

Fix was verified on ISO#281

tags: added: on-verification
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Issue reproduced on next scenario:
1. Deploy Neutron tun cluster with 1 controller
2. After deployment add 2 controllers and re-deploy
3. After deployment add 2 controllers and 1 compute and re-deploy

Actual result - deployment failed on first added controller (node-5):
2015-09-07 19:36:38 ERR [668] Task '{"priority"=>4700, "type"=>"puppet", "uids"=>["5"], "parameters"=>{"puppet_modules"=>"/etc/puppet/modules", "puppet_manifest"=>"/etc/puppet/modules/osnailyfacter/modular/swift/keystone.pp", "timeout"=>1800, "cwd"=>"/"}}' failed on node 5

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

the issue is actually different. here apache mod_wsgi fails due to segfaults:

2015-09-07T18:05:20.858706+00:00 err: [Mon Sep 07 18:05:14.493726 2015] [core:notice] [pid 4163:tid 140140617865088] AH00052: child pid 20360 exit signal Segmentation fault (11)
2015-09-07T18:05:20.858706+00:00 err: [Mon Sep 07 18:05:14.493765 2015] [core:error] [pid 4163:tid 140140617865088] AH00546: no record of generation 0 of exiting child 20360
2015-09-07T18:05:20.858706+00:00 err: [Mon Sep 07 18:05:14.493773 2015] [core:notice] [pid 4163:tid 140140617865088] AH00052: child pid 20361 exit signal Segmentation fault (11)
2015-09-07T18:05:20.858706+00:00 err: [Mon Sep 07 18:05:14.493776 2015] [core:error] [pid 4163:tid 140140617865088] AH00546: no record of generation 0 of exiting child 20361
2015-09-07T18:05:20.858706+00:00 err: [Mon Sep 07 18:05:15.495039 2015] [core:notice] [pid 4163:tid 140140617865088] AH00052: child pid 20386 exit signal Segmentation fault (11)
2015-09-07T18:05:20.858706+00:00 err: [Mon Sep 07 18:05:15.495079 2015] [core:error] [pid 4163:tid 140140617865088] AH00546: no record of generation 1 of exiting child 20386
2015-09-07T18:05:20.858706+00:00 err: [Mon Sep 07 18:05:16.496341 2015] [core:notice] [pid 4163:tid 140140617865088] AH00052: child pid 20390 exit signal Segmentation fault (11)
2015-09-07T18:05:20.858706+00:00 err: [Mon Sep 07 18:05:16.496380 2015] [core:error] [pid 4163:tid 140140617865088] AH00546: no record of generation 0 of exiting child 20390
2015-09-07T18:05:20.858706+00:00 err: [Mon Sep 07 18:05:17.497652 2015] [core:notice] [pid 4163:tid 140140617865088] AH00052: child pid 20400 exit signal Segmentation fault (11)
2015-09-07T18:05:20.858706+00:00 err: [Mon Sep 07 18:05:17.497690 2015] [core:error] [pid 4163:tid 140140617865088] AH00546: no record of generation 0 of exiting child 20400
2015-09-07T18:05:20.858706+00:00 err: [Mon Sep 07 18:05:18.499008 2015] [core:notice] [pid 4163:tid 140140617865088] AH00052: child pid 20439 exit signal Segmentation fault (11)
2015-09-07T18:05:20.858706+00:00 err: [Mon Sep 07 18:05:18.499055 2015] [core:error] [pid 4163:tid 140140617865088] AH00546: no record of generation 0 of exiting child 20439
2015-09-07T18:05:20.858962+00:00 err: [Mon Sep 07 18:05:19.500107 2015] [core:notice] [pid 4163:tid 140140617865088] AH00052: child pid 20465 exit signal Segmentation fault (11)
2015-09-07T18:05:20.858962+00:00 err: [Mon Sep 07 18:05:19.500145 2015] [core:error] [pid 4163:tid 140140617865088] AH00546: no record of generation 0 of exiting child 20465

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

According to Vladimir investigation, this issue should be closed, this should be reopened - https://bugs.launchpad.net/mos/+bug/1491576

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

I tested this case on iso 288 and it passed

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

verified on
{

    "build_id": "288",
    "build_number": "288",
    "release_versions":

{

    "2015.1.0-7.0":

{

    "VERSION":

{

    "build_id": "288",
    "build_number": "288",
    "api": "1.0",
    "fuel-library_sha": "121016a09b0e889994118aa3ea42fa67eabb8f25",
    "nailgun_sha": "93477f9b42c5a5e0506248659f40bebc9ac23943",
    "feature_groups":

            [
                "mirantis"
            ],
            "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd",
            "openstack_version": "2015.1.0-7.0",
            "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d",
            "production": "docker",
            "python-fuelclient_sha": "1ce8ecd8beb640f2f62f73435f4e18d1469979ac",
            "astute_sha": "a717657232721a7fafc67ff5e1c696c9dbeb0b95",
            "fuel-ostf_sha": "1f08e6e71021179b9881a824d9c999957fcc7045",
            "release": "7.0",
            "fuelmain_sha": "6b83d6a6a75bf7bca3177fcf63b2eebbf1ad0a85"
        }
    }

},
"auth_required": true,
"api": "1.0",
"fuel-library_sha": "121016a09b0e889994118aa3ea42fa67eabb8f25",
"nailgun_sha": "93477f9b42c5a5e0506248659f40bebc9ac23943",
"feature_groups":

    [
        "mirantis"
    ],
    "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd",
    "openstack_version": "2015.1.0-7.0",
    "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d",
    "production": "docker",
    "python-fuelclient_sha": "1ce8ecd8beb640f2f62f73435f4e18d1469979ac",
    "astute_sha": "a717657232721a7fafc67ff5e1c696c9dbeb0b95",
    "fuel-ostf_sha": "1f08e6e71021179b9881a824d9c999957fcc7045",
    "release": "7.0",
    "fuelmain_sha": "6b83d6a6a75bf7bca3177fcf63b2eebbf1ad0a85"

}

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.