Controller replacement fails: RabbitMQ goes down after node deletion

Bug #1541029 reported by Artem Panchenko
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Bogdan Dobrelya
8.0.x
Fix Released
High
Bogdan Dobrelya

Bug Description

Environment deployment fails if controller node is replaced (old node is removed and new one is added):

2016-02-02 02:39:16 DEBUG [796] Task time summary: dump_rabbitmq_definitions with status error on node 1 took 00:00:22
2016-02-02 02:39:16 DEBUG [796] Task time summary: dump_rabbitmq_definitions with status error on node 2 took 00:00:22
2016-02-02 02:39:16 DEBUG [796] Task time summary: dump_rabbitmq_definitions with status error on node 6 took 00:00:22
2016-02-02 02:39:16 INFO [796] Casting message to Nailgun:{"method"=>"deploy_resp", "args"=> {"task_uuid"=>"31bd49c1-fb2c-412f-962a-11381a7b781a", "nodes"=> [{"uid"=>"1", "status"=>"error", "error_type"=>"deploy", "role"=>"hook", "hook"=>"dump_rabbitmq_definitions", "error_msg"=>"Puppet run failed. Check puppet logs for details"}, {"uid"=>"2", "status"=>"error", "error_type"=>"deploy", "role"=>"hook", "hook"=>"dump_rabbitmq_definitions", "error_msg"=>"Puppet run failed. Check puppet logs for details"}, {"uid"=>"6", "status"=>"error", "error_type"=>"deploy", "role"=>"hook", "hook"=>"dump_rabbitmq_definitions", "error_msg"=>"Puppet run failed. Check puppet logs for details"}], "error"=> "Failed to execute hook 'dump_rabbitmq_definitions' Puppet run failed. Check puppet logs for details"}}
2016-02-02 02:39:16 INFO [796] Casting message to Nailgun:{"method"=>"deploy_resp", "args"=> {"task_uuid"=>"31bd49c1-fb2c-412f-962a-11381a7b781a", "nodes"=> [{"uid"=>"3", "status"=>"error", "role"=>"hook", "error_type"=>"deploy"}, {"uid"=>"5", "status"=>"error", "role"=>"hook", "error_type"=>"deploy"}]}}
2016-02-02 02:39:16 INFO [796] Casting message to Nailgun:{"method"=>"deploy_resp", "args"=> {"task_uuid"=>"31bd49c1-fb2c-412f-962a-11381a7b781a", "status"=>"error", "error"=> "Method granular_deploy. Failed to execute hook 'dump_rabbitmq_definitions' Puppet run failed. Check puppet logs for details --- uids: - '1' - '2' - '6' parameters: puppet_modules: /etc/puppet/modules puppet_manifest: /etc/puppet/modules/osnailyfacter/modular/astute/dump_rabbitmq_definitions.pp timeout: 180 cwd: / priority: 100 fail_on_error: true type: puppet id: dump_rabbitmq_definitions . Inspect Astute logs for the details"}}

node-2 2016-02-02T02:39:09.797343 err: (/Stage[main]/Main/Exec[rabbitmq-dump-definitions]/returns) change from notrun to 0 failed: curl -u nova:ok4dJUpvLZunACcORtB0kiqX http://localhost:15672/api/definitions -o /etc/rabbitmq/definitions.full returned 7 instead of one of [0]

root@node-2:~# pcs status | grep -A 2 p_rabbitmq-server
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-1.test.domain.local ]
     Slaves: [ node-2.test.domain.local node-6.test.domain.local ]

root@node-2:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@messaging-node-2' ...
Error: unable to connect to node 'rabbit@messaging-node-2': nodedown
Segmentation fault

Steps to reproduce:

            1. Deploy environment with 3 controllers, 2 computes and 1 compute+cinder nodes
            2. Remove 1 controller node and add 1 controller+cinder node
            3. Deploy changes

Expected result: controller is replaced, deployment is successful, environment passes OSTF

Actual result: deployment of controllers fails on dump_rabbitmq_definitions task

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Revision history for this message
Ryan Moe (rmoe) wrote :

The rabbitmq-server process definitely wasn't running on node-2. It looks like it went down around 1 minute after node-4 was fenced (00:46). I don't see any rabbitmq server logs in the snapshot though so I can't tell what happened to node-2 at that time. lrmd.log (http://paste.openstack.org/show/485778/) on node-2 has some more information.

Dmitry Klenov (dklenov)
Changed in fuel:
status: New → Confirmed
Revision history for this message
Michael Polenchuk (mpolenchuk) wrote :

Is it Ok pacemaker has node-2 node name, but rabbitmq has messaging-node-2?

Revision history for this message
Michael Polenchuk (mpolenchuk) wrote :

Environment deployment fails if controller node is replaced (old node is removed and new one is added).
.....
2. Remove 1 compute+cinder node and add 1 controller+cinder node

What actually happened on the env?

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Dmitry Bilunov (dbilunov)
tags: added: team-bugfix
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Michael, yes it is ok, we use different node name to distinguish between admin network IPs and messaging network IPs

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

I suppose it happens because node-2 cannot join the cluster infinitely. This seems to be a rabbitmq issue itself. We need @binarin help here

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Segfaulting points to some issue of the rabbitmq-server itself

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

There is a lot of "Segmentation fault (core dumped)" in logs - so I need those dumps. But as they always happen in the end of rabbitmqctl run they probably caused by some incorrect cleanup and aren't the root cause of this issue.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

A note, rabbit fence daemon has nothing to do here because of the bug https://bugs.launchpad.net/fuel/+bug/1538597

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

Wasted some time because https://bugs.launchpad.net/fuel/+bug/1530296 was not backorted to 8.0:

2016-02-02T02:28:48.473500+00:00 node-2 info: INFO: p_rabbitmq-server: join_to_cluster(): Joining to cluster by node

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Ryan, the rabbit logs may be missing because of the bug #1541397

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

After more RCA done, this bug seems not a duplicate, the RC looks different and related to the messaging- prefix for the rabbit nodes

Changed in fuel:
assignee: Dmitry Bilunov (dbilunov) → Bogdan Dobrelya (bogdando)
status: Confirmed → In Progress
tags: added: feature-network-template
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The root cause may be the action stop can not kill an unresponsive "beam.smp" process.
How to reproduce:
# kill -STOP `pidof beam.smp`
# ocf_handler_rabbitmq-server stop
(it throws an error snippet http://pastebin.com/d4Ki8wi5 and rabbitmqctl segfaults)
# ps -f -p `pidof beam.smp`

Expected: it shall be empty after the action stop finished
Actual: beam is left running and action stop fails with
lrmd: ERROR: RMQ-runtime (beam) couldn't be stopped and will likely became unmanaged. Take care of it manually!
lrmd: INFO: p_rabbitmq-server: stop: action end.
Exit status: Error: Generic (1)

Normally, with fencing enabled, the failed node would be recovered by STONITH. But as we don't use fencing, the only option we have is to ensure the stop action kills the beam.smp and succeeds

Changed in fuel:
status: In Progress → Triaged
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/276154

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/276201

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/276154
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=7fa3440ad3b03c0b6242fc6f892f3a0105a0c85b
Submitter: Jenkins
Branch: master

commit 7fa3440ad3b03c0b6242fc6f892f3a0105a0c85b
Author: Bogdan Dobrelya <email address hidden>
Date: Thu Feb 4 11:57:28 2016 +0100

    Fix action_stop for the rabbit OCF

    The action_stop may sometimes stop the rabbitmq-server gracefully
    by the PID, but leave unresponsive beam.smp processes running and
    spoiling rabbits. Those shall be stopped as well. The solution is:
    - make proc_stop() to accept a pid=none to use a name matching instead
    - make kill_rmq_and_remove_pid() to stop by the beam process matching as well
    - fix stop_server_process() to ensure there is no beam process left running

    Closes-bug: #1541029

    Change-Id: Ib9669d15bb714be8a88fd65d7f1815173da788d3
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/8.0)

Reviewed: https://review.openstack.org/276201
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=d3b77ffa1581145015726255c956d937c2b273e2
Submitter: Jenkins
Branch: stable/8.0

commit d3b77ffa1581145015726255c956d937c2b273e2
Author: Bogdan Dobrelya <email address hidden>
Date: Thu Feb 4 11:57:28 2016 +0100

    Fix action_stop for the rabbit OCF

    The action_stop may sometimes stop the rabbitmq-server gracefully
    by the PID, but leave unresponsive beam.smp processes running and
    spoiling rabbits. Those shall be stopped as well. The solution is:
    - make proc_stop() to accept a pid=none to use a name matching instead
    - make kill_rmq_and_remove_pid() to stop by the beam process matching as well
    - fix stop_server_process() to ensure there is no beam process left running

    Closes-bug: #1541029

    Change-Id: Ib9669d15bb714be8a88fd65d7f1815173da788d3
    Signed-off-by: Bogdan Dobrelya <email address hidden>

tags: added: on-verification
tags: removed: on-verification
tags: added: on-verification
Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :

Verified on MOS 8.0, build 562.

ISO details:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "562"
  build_id: "562"
  fuel-nailgun_sha: "558ca91a854cf29e395940c232911ffb851899c1"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "658be72c4b42d3e1436b86ac4567ab914bfb451b"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "33634ec27be77ecfb0b56b7e07497ad86d1fdcd3"
  fuel-ostf_sha: "3bc76a63a9e7d195ff34eadc29552f4235fa6c52"
  fuel-mirror_sha: "fb45b80d7bee5899d931f926e5c9512e2b442749"
  fuelmenu_sha: "78ffc73065a9674b707c081d128cb7eea611474f"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "d605bcbabf315382d56d0ce8143458be67c53434"

tags: removed: on-verification
Revision history for this message
Alexey Galkin (agalkin) wrote :

Verified as fixed in 9.0-242.

Result: http://paste.openstack.org/show/495442/

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.