Fuel for OpenStack

RabbitMQ cluster locks up when a member is removed

Bug #1288831 reported by Bogdan Dobrelya on 2014-03-06

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Committed	High	Fuel Library (Deprecated)	Fuel for OpenStack 4.1.1

Bug Description

{"build_id": "2014-03-04_12-31-13", "mirantis": "yes", "build_number": "112", "nailgun_sha": "d98b61e073d32c45c98099a11ff263a68b7ba205", "ostf_sha": "dc54d99ddff2f497b131ad1a42362515f2a61afa", "fuelmain_sha": "16637e2ea0ae6fe9a773aceb9d76c6e3a75f6c3b", "astute_sha": "f15f5615249c59c826ea05d26707f062c88db32a", "release": "4.1", "fuellib_sha": "15a55ccff0f59929b32d087679d19e896bde8e0d"}

Reproduce:
* Deploy Centos HA, nova-network FLATdhcp, tagged interfaces, DEBUG=TRUE: 3 controllers, 1 compute
* log on to the 1st controller node and issue the commands (see below):
service rabbitmq-server stop
sleep 30; . openrc; heat list
service rabbitmq-server start
rabbitmqctl list_queues
heat list
* If there is no issues for rabbitmq startup, list_queues and heat list results (see below, normal results were marked as (OK)):
- repeat the same steps for other controllers, one by one.
* Otherwise, in case there were any issues for given controller node (see below, issues were marked as (Hangs)):
- reboot the given node and check OS services and rabbitmq:
chkconfig | grep openstack | awk '{print $1}' | xargs -n1 -I{} service {} status
rabbitmqctl list_queues
. openrc; heat list
- check the results: All Openstack services will be stopped and RabbitMQ will not be able to show its queues. And that is the subject of the issue...

Issue:
- Once stopped, RabbitMQ became broken and won't start back, after reboot it starts but remains unoperational.
- None of the Openstack services start after controller node reboot
- 'heat list' hangs every the time after RabbitMQ was stopped for the 1st time.

Console actions and results:

*Pre-patched behavior*
{"build_id": "2014-02-26_13-39-45", "mirantis": "yes", "build_number": "211", "nailgun_sha": "ea08cef3e06a72f47cfaa8cd8fe6d034e2cf722e", "ostf_sha": "8e6681b6d06c7cb20a84c1cc740d5f2492fb9d85", "fuelmain_sha": "baa8bb07393698f1186cb67bb65f1b93907c59bd", "astute_sha": "10cccc87f2ee35510e43c8fa19d2bf916ca1fced", "release": "4.1", "fuellib_sha": "0a2e5bdc01c1e3bb285acb7b39125101e950ac72"}
Centos HA, nova-network FLATdhcp, tagged interfaces: 3 controllers, 1 compute

[root@node-7 ~]# service rabbitmq-server stop
[root@node-7 ~]# . openrc; heat list
(OK)
[root@node-7 ~]# service rabbitmq-server start
Starting rabbitmq-server: RabbitMQ is going to make 3 attempts to find master node and start.
3 attempts left to start RabbitMQ Server before consider start failed.
SUCCESS
rabbitmq-server.
[root@node-7 ~]# rabbitmqctl list_queues
(OK)

Reboot the node, and check:
[root@node-7 ~]chkconfig | grep openstack | awk '{print $1}' | xargs -n1 -I{} service {} status
(All OS services are running)

*Patched behavior*
{"build_id": "2014-03-04_12-31-13", "mirantis": "yes", "build_number": "112", "nailgun_sha": "d98b61e073d32c45c98099a11ff263a68b7ba205", "ostf_sha": "dc54d99ddff2f497b131ad1a42362515f2a61afa", "fuelmain_sha": "16637e2ea0ae6fe9a773aceb9d76c6e3a75f6c3b", "astute_sha": "f15f5615249c59c826ea05d26707f062c88db32a", "release": "4.1", "fuellib_sha": "15a55ccff0f59929b32d087679d19e896bde8e0d"}
Centos HA, nova-network FLATdhcp, tagged interfaces: 3 controllers, 1 compute

[root@node-7 ~]# service rabbitmq-server stop
[root@node-7 ~]# . openrc; heat list
(Hangs)

[root@node-1 ~]# service rabbitmq-server start
Starting rabbitmq-server: RabbitMQ is going to make 3 attempts to find master node and start.
3 attempts left to start RabbitMQ Server before consider start failed.
(Hangs)

If reboot the node, rabbitMQ starts, but:
[root@node-1 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-1' ...
[{nodes,[{disc,['rabbit@node-3','rabbit@node-2','rabbit@node-1']}]},
{running_nodes,['rabbit@node-3','rabbit@node-2','rabbit@node-1']}]
...done.
[root@node-1 ~]# rabbitmqctl list_queues
Listing queues ...

=ERROR REPORT==== 6-Mar-2014::14:50:04 ===
Discarding message {'$gen_call',{<0.17752.11>,#Ref<0.0.1.206018>},{info,[name,messages]}} from <0.17752.11> to <0.1694.0> in an old incarnation (1) of this node (2)
(Hangs)

[root@node-1 ~]# rabbitmqctl list_consumers
Listing consumers ...

=ERROR REPORT==== 6-Mar-2014::14:55:37 ===
Discarding message {'$gen_call',{<0.2839.13>,#Ref<0.0.2.95633>},consumers} from <0.2839.13> to <0.1694.0> in an old incarnation (1) of this node (2)
(Hangs)

Reboot the node, and check:
[root@node-7 ~]chkconfig | grep openstack | awk '{print $1}' | xargs -n1 -I{} service {} status
(All OS services are stopped)

See original description

Tags:

Bogdan Dobrelya (bogdando) on 2014-03-06

description:

updated

Vladimir Kuklin (vkuklin) on 2014-03-06

Changed in fuel:
status:	New → Incomplete

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2014-03-06:

Bogdan, sorry but I can't reproduce issue. After reboot primary controller rabbit works fine without error reports.
{
build_id: "2014-03-05_07-31-01",
mirantis: "yes",
build_number: "235",
nailgun_sha: "f58aad317829112913f364347b14f1f0518ad371",
ostf_sha: "dc54d99ddff2f497b131ad1a42362515f2a61afa",
fuelmain_sha: "16637e2ea0ae6fe9a773aceb9d76c6e3a75f6c3b",
astute_sha: "f15f5615249c59c826ea05d26707f062c88db32a",
release: "4.1",
fuellib_sha: "73313007c0914e602246ea41fa5e8ca2dfead9f8"
}

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-03-06:

logs snapshot (reproduced for node-2) Edit (6.6 MiB, application/x-tar)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-03-06:

Reproduced for node-2, see logs (node-1 was OK though).

I've saved logs in vcs as well (2 commits) here https://github.com/bogdando/log_snapshots
1st commit is stable state after deployment, 2nd one - is reproduced issue for node-2.
So, you can just clone it and run git diff against node-2 dir to see that have happened, e.g.
git difftool HEAD~1 fuel-snapshot-2014-03-06_18-46-23/localhost/var/log/remote/node-2.test.domain.local

Ryan Moe (rmoe) on 2014-03-06

Changed in fuel:
status:	Incomplete → Confirmed

Ryan Moe (rmoe) on 2014-03-06

summary:	- RabbitMQ HA regression + RabbitMQ cluster locks up when a member is removed
Changed in fuel:
milestone:	4.1 → 4.1.1
importance:	Critical → High

Revision history for this message

Ryan Moe (rmoe) wrote on 2014-03-07:

This is only an issue on CentOS. I'm able to reproduce this issue and the only workaround I've found is rebooting all 3 controllers.

Revision history for this message

Ryan Moe (rmoe) wrote on 2014-03-07:

After upgrading RabbitMQ to 3.2.4 I can't reproduce this issue anymore.

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-03-07:

The fix is to upgrade to RabbitMQ 3 which already almost did for 4.1.1.

Changed in fuel:
status:	Confirmed → Triaged

Bogdan Dobrelya (bogdando) on 2014-03-07

description:

updated

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2014-03-13:

What is the status for upgrade to RabbitMQ 3?

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-03-13:

We have RabbitMQ packages ready in OSCI-1016, we should test them extensively with 5.0 before backporting them for 4.1.1.

Vladimir Kuklin (vkuklin) on 2014-03-24

Changed in fuel:
milestone:	4.1.1 → 5.0
tags:	added: backports-4.1.1

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-03-24:

#10

When upgrading RabbitMQ to 3.x, primary-controller manifests should be updated to set HA policy for all queues:
https://bugs.launchpad.net/fuel/+bug/1296922

Andrew Woodward (xarses) on 2014-04-08

tags:

added: ha

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-04-18:

#11

Considering the Ryan's report I close this issue as non reproducible with RabbitMQ3 we currently have in master branch

Changed in fuel:
status:	Triaged → Fix Committed

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-04-18:

#12

Bogdan, please have a look at the rules for tracking bugs targeted for backporting described here:
https://lists.launchpad.net/fuel-dev/msg00698.html

RabbitMQ3 wasn't yet uploaded for 4.1.1, so this should be changed to In Progress with target milestone set to 4.1.1.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-04-18:

#13

Yes, my mistake, you are completely right

Dmitry Borodaenko (angdraug) on 2014-04-18

Changed in fuel:
status:	Fix Committed → In Progress
milestone:	5.0 → 4.1.1

Mike Scherbakov (mihgen) on 2014-05-08

tags:

added: release-notes

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-05-08:

#14

4.1.1 now has RabbitMQ 3.2.

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

Meg McRoberts (dreidellhasa) wrote on 2014-05-16:

#15

Added to "Fixed Issues" list in 5.0 Release Notes.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

logs snapshot (reproduced for node-2) Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.