Fuel for OpenStack

Controller replacement fails: RabbitMQ goes down after node deletion

Bug #1541029 reported by Artem Panchenko on 2016-02-02

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Released	High	Bogdan Dobrelya	Fuel for OpenStack 9.0
	8.0.x	Fix Released	High	Bogdan Dobrelya	Fuel for OpenStack 8.0

Bug Description

Environment deployment fails if controller node is replaced (old node is removed and new one is added):

2016-02-02 02:39:16 DEBUG [796] Task time summary: dump_rabbitmq_definitions with status error on node 1 took 00:00:22
2016-02-02 02:39:16 DEBUG [796] Task time summary: dump_rabbitmq_definitions with status error on node 2 took 00:00:22
2016-02-02 02:39:16 DEBUG [796] Task time summary: dump_rabbitmq_definitions with status error on node 6 took 00:00:22
2016-02-02 02:39:16 INFO [796] Casting message to Nailgun:{"method"=>"deploy_resp", "args"=> {"task_uuid"=>"31bd49c1-fb2c-412f-962a-11381a7b781a", "nodes"=> [{"uid"=>"1", "status"=>"error", "error_type"=>"deploy", "role"=>"hook", "hook"=>"dump_rabbitmq_definitions", "error_msg"=>"Puppet run failed. Check puppet logs for details"}, {"uid"=>"2", "status"=>"error", "error_type"=>"deploy", "role"=>"hook", "hook"=>"dump_rabbitmq_definitions", "error_msg"=>"Puppet run failed. Check puppet logs for details"}, {"uid"=>"6", "status"=>"error", "error_type"=>"deploy", "role"=>"hook", "hook"=>"dump_rabbitmq_definitions", "error_msg"=>"Puppet run failed. Check puppet logs for details"}], "error"=> "Failed to execute hook 'dump_rabbitmq_definitions' Puppet run failed. Check puppet logs for details"}}
2016-02-02 02:39:16 INFO [796] Casting message to Nailgun:{"method"=>"deploy_resp", "args"=> {"task_uuid"=>"31bd49c1-fb2c-412f-962a-11381a7b781a", "nodes"=> [{"uid"=>"3", "status"=>"error", "role"=>"hook", "error_type"=>"deploy"}, {"uid"=>"5", "status"=>"error", "role"=>"hook", "error_type"=>"deploy"}]}}
2016-02-02 02:39:16 INFO [796] Casting message to Nailgun:{"method"=>"deploy_resp", "args"=> {"task_uuid"=>"31bd49c1-fb2c-412f-962a-11381a7b781a", "status"=>"error", "error"=> "Method granular_deploy. Failed to execute hook 'dump_rabbitmq_definitions' Puppet run failed. Check puppet logs for details --- uids: - '1' - '2' - '6' parameters: puppet_modules: /etc/puppet/modules puppet_manifest: /etc/puppet/modules/osnailyfacter/modular/astute/dump_rabbitmq_definitions.pp timeout: 180 cwd: / priority: 100 fail_on_error: true type: puppet id: dump_rabbitmq_definitions . Inspect Astute logs for the details"}}

node-2 2016-02-02T02:39:09.797343 err: (/Stage[main]/Main/Exec[rabbitmq-dump-definitions]/returns) change from notrun to 0 failed: curl -u nova:ok4dJUpvLZunACcORtB0kiqX http://localhost:15672/api/definitions -o /etc/rabbitmq/definitions.full returned 7 instead of one of [0]

root@node-2:~# pcs status | grep -A 2 p_rabbitmq-server
Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
Masters: [ node-1.test.domain.local ]
Slaves: [ node-2.test.domain.local node-6.test.domain.local ]

root@node-2:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@messaging-node-2' ...
Error: unable to connect to node 'rabbit@messaging-node-2': nodedown
Segmentation fault

Steps to reproduce:

            1. Deploy environment with 3 controllers, 2 computes and 1 compute+cinder nodes
            2. Remove 1 controller node and add 1 controller+cinder node
            3. Deploy changes

Expected result: controller is replaced, deployment is successful, environment passes OSTF

Actual result: deployment of controllers fails on dump_rabbitmq_definitions task

See original description

Tags:

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2016-02-02:

fail_error_neutron_tun_ha_addremove-fuel-snapshot-2016-02-02_02-39-22.tar.xz Edit (68.9 MiB, application/octet-stream)

Revision history for this message

Ryan Moe (rmoe) wrote on 2016-02-02:

The rabbitmq-server process definitely wasn't running on node-2. It looks like it went down around 1 minute after node-4 was fenced (00:46). I don't see any rabbitmq server logs in the snapshot though so I can't tell what happened to node-2 at that time. lrmd.log (http://paste.openstack.org/show/485778/) on node-2 has some more information.

Dmitry Klenov (dklenov) on 2016-02-03

Changed in fuel:
status:	New → Confirmed

Revision history for this message

Michael Polenchuk (mpolenchuk) wrote on 2016-02-03:

Is it Ok pacemaker has node-2 node name, but rabbitmq has messaging-node-2?

Revision history for this message

Michael Polenchuk (mpolenchuk) wrote on 2016-02-03:

Environment deployment fails if controller node is replaced (old node is removed and new one is added).
.....
2. Remove 1 compute+cinder node and add 1 controller+cinder node

What actually happened on the env?

Dmitry Bilunov (dbilunov) on 2016-02-03

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Dmitry Bilunov (dbilunov)

Matthew Mosesohn (raytrac3r) on 2016-02-03

tags:

added: team-bugfix

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2016-02-03:

Michael, yes it is ok, we use different node name to distinguish between admin network IPs and messaging network IPs

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2016-02-03:

I suppose it happens because node-2 cannot join the cluster infinitely. This seems to be a rabbitmq issue itself. We need @binarin help here

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-02-03:

Segfaulting points to some issue of the rabbitmq-server itself

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2016-02-03:

could be a duplicate of https://bugs.launchpad.net/mos/+bug/1523622

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-02-03:

There is a lot of "Segmentation fault (core dumped)" in logs - so I need those dumps. But as they always happen in the end of rabbitmqctl run they probably caused by some incorrect cleanup and aren't the root cause of this issue.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-02-03:

#10

A note, rabbit fence daemon has nothing to do here because of the bug https://bugs.launchpad.net/fuel/+bug/1538597

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-02-03:

#11

Wasted some time because https://bugs.launchpad.net/fuel/+bug/1530296 was not backorted to 8.0:

2016-02-02T02:28:48.473500+00:00 node-2 info: INFO: p_rabbitmq-server: join_to_cluster(): Joining to cluster by node

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-02-03:

#12

https://bugs.launchpad.net/fuel/+bug/1472230/comments/55

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-02-03:

#13

@Ryan, the rabbit logs may be missing because of the bug #1541397

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-02-04:

#14

After more RCA done, this bug seems not a duplicate, the RC looks different and related to the messaging- prefix for the rabbit nodes

Changed in fuel:
assignee:	Dmitry Bilunov (dbilunov) → Bogdan Dobrelya (bogdando)
status:	Confirmed → In Progress
tags:	added: feature-network-template

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-02-04:

#15

The root cause may be the action stop can not kill an unresponsive "beam.smp" process.
How to reproduce:
# kill -STOP `pidof beam.smp`
# ocf_handler_rabbitmq-server stop
(it throws an error snippet http://pastebin.com/d4Ki8wi5 and rabbitmqctl segfaults)
# ps -f -p `pidof beam.smp`

Expected: it shall be empty after the action stop finished
Actual: beam is left running and action stop fails with
lrmd: ERROR: RMQ-runtime (beam) couldn't be stopped and will likely became unmanaged. Take care of it manually!
lrmd: INFO: p_rabbitmq-server: stop: action end.
Exit status: Error: Generic (1)

Normally, with fencing enabled, the failed node would be recovered by STONITH. But as we don't use fencing, the only option we have is to ensure the stop action kills the beam.smp and succeeds

Changed in fuel:
status:	In Progress → Triaged

Artem Panchenko (apanchenko-8) on 2016-02-04

description:

updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-04: Fix proposed to fuel-library (master)

#16

Fix proposed to branch: master
Review: https://review.openstack.org/276154

Changed in fuel:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-04: Fix proposed to fuel-library (stable/8.0)

#17

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/276201

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-04: Fix merged to fuel-library (master)

#18

Reviewed: https://review.openstack.org/276154
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=7fa3440ad3b03c0b6242fc6f892f3a0105a0c85b
Submitter: Jenkins
Branch: master

commit 7fa3440ad3b03c0b6242fc6f892f3a0105a0c85b
Author: Bogdan Dobrelya <email address hidden>
Date: Thu Feb 4 11:57:28 2016 +0100

Fix action_stop for the rabbit OCF

    The action_stop may sometimes stop the rabbitmq-server gracefully
    by the PID, but leave unresponsive beam.smp processes running and
    spoiling rabbits. Those shall be stopped as well. The solution is:
    - make proc_stop() to accept a pid=none to use a name matching instead
    - make kill_rmq_and_remove_pid() to stop by the beam process matching as well
    - fix stop_server_process() to ensure there is no beam process left running

Closes-bug: #1541029

Change-Id: Ib9669d15bb714be8a88fd65d7f1815173da788d3
Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-04: Fix merged to fuel-library (stable/8.0)

#19

Reviewed: https://review.openstack.org/276201
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=d3b77ffa1581145015726255c956d937c2b273e2
Submitter: Jenkins
Branch: stable/8.0

commit d3b77ffa1581145015726255c956d937c2b273e2
Author: Bogdan Dobrelya <email address hidden>
Date: Thu Feb 4 11:57:28 2016 +0100

Fix action_stop for the rabbit OCF

Closes-bug: #1541029

Change-Id: Ib9669d15bb714be8a88fd65d7f1815173da788d3
Signed-off-by: Bogdan Dobrelya <email address hidden>

Anastasia Palkina (apalkina) on 2016-02-05

tags:

added: on-verification

Anastasia Palkina (apalkina) on 2016-02-08

tags:

removed: on-verification

Dmitriy Kruglov (dkruglov) on 2016-02-12

tags:

added: on-verification

Revision history for this message

Dmitriy Kruglov (dkruglov) wrote on 2016-02-14:

#20

Verified on MOS 8.0, build 562.

ISO details:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "562"
  build_id: "562"
  fuel-nailgun_sha: "558ca91a854cf29e395940c232911ffb851899c1"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "658be72c4b42d3e1436b86ac4567ab914bfb451b"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "33634ec27be77ecfb0b56b7e07497ad86d1fdcd3"
  fuel-ostf_sha: "3bc76a63a9e7d195ff34eadc29552f4235fa6c52"
  fuel-mirror_sha: "fb45b80d7bee5899d931f926e5c9512e2b442749"
  fuelmenu_sha: "78ffc73065a9674b707c081d128cb7eea611474f"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "d605bcbabf315382d56d0ce8143458be67c53434"