RabbitMQ cluster locks up when a member is removed

Bug #1288831 reported by Bogdan Dobrelya
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Fuel Library (Deprecated)

Bug Description

{"build_id": "2014-03-04_12-31-13", "mirantis": "yes", "build_number": "112", "nailgun_sha": "d98b61e073d32c45c98099a11ff263a68b7ba205", "ostf_sha": "dc54d99ddff2f497b131ad1a42362515f2a61afa", "fuelmain_sha": "16637e2ea0ae6fe9a773aceb9d76c6e3a75f6c3b", "astute_sha": "f15f5615249c59c826ea05d26707f062c88db32a", "release": "4.1", "fuellib_sha": "15a55ccff0f59929b32d087679d19e896bde8e0d"}

Reproduce:
* Deploy Centos HA, nova-network FLATdhcp, tagged interfaces, DEBUG=TRUE: 3 controllers, 1 compute
* log on to the 1st controller node and issue the commands (see below):
service rabbitmq-server stop
sleep 30; . openrc; heat list
service rabbitmq-server start
rabbitmqctl list_queues
heat list
* If there is no issues for rabbitmq startup, list_queues and heat list results (see below, normal results were marked as (OK)):
- repeat the same steps for other controllers, one by one.
* Otherwise, in case there were any issues for given controller node (see below, issues were marked as (Hangs)):
- reboot the given node and check OS services and rabbitmq:
chkconfig | grep openstack | awk '{print $1}' | xargs -n1 -I{} service {} status
rabbitmqctl list_queues
. openrc; heat list
- check the results: All Openstack services will be stopped and RabbitMQ will not be able to show its queues. And that is the subject of the issue...

Issue:
- Once stopped, RabbitMQ became broken and won't start back, after reboot it starts but remains unoperational.
- None of the Openstack services start after controller node reboot
- 'heat list' hangs every the time after RabbitMQ was stopped for the 1st time.

Console actions and results:

*Pre-patched behavior*
{"build_id": "2014-02-26_13-39-45", "mirantis": "yes", "build_number": "211", "nailgun_sha": "ea08cef3e06a72f47cfaa8cd8fe6d034e2cf722e", "ostf_sha": "8e6681b6d06c7cb20a84c1cc740d5f2492fb9d85", "fuelmain_sha": "baa8bb07393698f1186cb67bb65f1b93907c59bd", "astute_sha": "10cccc87f2ee35510e43c8fa19d2bf916ca1fced", "release": "4.1", "fuellib_sha": "0a2e5bdc01c1e3bb285acb7b39125101e950ac72"}
Centos HA, nova-network FLATdhcp, tagged interfaces: 3 controllers, 1 compute

[root@node-7 ~]# service rabbitmq-server stop
[root@node-7 ~]# . openrc; heat list
(OK)
[root@node-7 ~]# service rabbitmq-server start
Starting rabbitmq-server: RabbitMQ is going to make 3 attempts to find master node and start.
3 attempts left to start RabbitMQ Server before consider start failed.
SUCCESS
rabbitmq-server.
[root@node-7 ~]# rabbitmqctl list_queues
(OK)

Reboot the node, and check:
[root@node-7 ~]chkconfig | grep openstack | awk '{print $1}' | xargs -n1 -I{} service {} status
(All OS services are running)

*Patched behavior*
{"build_id": "2014-03-04_12-31-13", "mirantis": "yes", "build_number": "112", "nailgun_sha": "d98b61e073d32c45c98099a11ff263a68b7ba205", "ostf_sha": "dc54d99ddff2f497b131ad1a42362515f2a61afa", "fuelmain_sha": "16637e2ea0ae6fe9a773aceb9d76c6e3a75f6c3b", "astute_sha": "f15f5615249c59c826ea05d26707f062c88db32a", "release": "4.1", "fuellib_sha": "15a55ccff0f59929b32d087679d19e896bde8e0d"}
Centos HA, nova-network FLATdhcp, tagged interfaces: 3 controllers, 1 compute

[root@node-7 ~]# service rabbitmq-server stop
[root@node-7 ~]# . openrc; heat list
(Hangs)

[root@node-1 ~]# service rabbitmq-server start
Starting rabbitmq-server: RabbitMQ is going to make 3 attempts to find master node and start.
3 attempts left to start RabbitMQ Server before consider start failed.
(Hangs)

If reboot the node, rabbitMQ starts, but:
[root@node-1 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-1' ...
[{nodes,[{disc,['rabbit@node-3','rabbit@node-2','rabbit@node-1']}]},
 {running_nodes,['rabbit@node-3','rabbit@node-2','rabbit@node-1']}]
...done.
[root@node-1 ~]# rabbitmqctl list_queues
Listing queues ...

=ERROR REPORT==== 6-Mar-2014::14:50:04 ===
Discarding message {'$gen_call',{<0.17752.11>,#Ref<0.0.1.206018>},{info,[name,messages]}} from <0.17752.11> to <0.1694.0> in an old incarnation (1) of this node (2)
(Hangs)

[root@node-1 ~]# rabbitmqctl list_consumers
Listing consumers ...

=ERROR REPORT==== 6-Mar-2014::14:55:37 ===
Discarding message {'$gen_call',{<0.2839.13>,#Ref<0.0.2.95633>},consumers} from <0.2839.13> to <0.1694.0> in an old incarnation (1) of this node (2)
(Hangs)

Reboot the node, and check:
[root@node-7 ~]chkconfig | grep openstack | awk '{print $1}' | xargs -n1 -I{} service {} status
(All OS services are stopped)

description: updated
Changed in fuel:
status: New → Incomplete
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Bogdan, sorry but I can't reproduce issue. After reboot primary controller rabbit works fine without error reports.
{
build_id: "2014-03-05_07-31-01",
mirantis: "yes",
build_number: "235",
nailgun_sha: "f58aad317829112913f364347b14f1f0518ad371",
ostf_sha: "dc54d99ddff2f497b131ad1a42362515f2a61afa",
fuelmain_sha: "16637e2ea0ae6fe9a773aceb9d76c6e3a75f6c3b",
astute_sha: "f15f5615249c59c826ea05d26707f062c88db32a",
release: "4.1",
fuellib_sha: "73313007c0914e602246ea41fa5e8ca2dfead9f8"
}

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Reproduced for node-2, see logs (node-1 was OK though).

I've saved logs in vcs as well (2 commits) here https://github.com/bogdando/log_snapshots
1st commit is stable state after deployment, 2nd one - is reproduced issue for node-2.
So, you can just clone it and run git diff against node-2 dir to see that have happened, e.g.
git difftool HEAD~1 fuel-snapshot-2014-03-06_18-46-23/localhost/var/log/remote/node-2.test.domain.local

Ryan Moe (rmoe)
Changed in fuel:
status: Incomplete → Confirmed
Ryan Moe (rmoe)
summary: - RabbitMQ HA regression
+ RabbitMQ cluster locks up when a member is removed
Changed in fuel:
milestone: 4.1 → 4.1.1
importance: Critical → High
Revision history for this message
Ryan Moe (rmoe) wrote :

This is only an issue on CentOS. I'm able to reproduce this issue and the only workaround I've found is rebooting all 3 controllers.

Revision history for this message
Ryan Moe (rmoe) wrote :

After upgrading RabbitMQ to 3.2.4 I can't reproduce this issue anymore.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

The fix is to upgrade to RabbitMQ 3 which already almost did for 4.1.1.

Changed in fuel:
status: Confirmed → Triaged
description: updated
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

What is the status for upgrade to RabbitMQ 3?

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

We have RabbitMQ packages ready in OSCI-1016, we should test them extensively with 5.0 before backporting them for 4.1.1.

Changed in fuel:
milestone: 4.1.1 → 5.0
tags: added: backports-4.1.1
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

When upgrading RabbitMQ to 3.x, primary-controller manifests should be updated to set HA policy for all queues:
https://bugs.launchpad.net/fuel/+bug/1296922

Andrew Woodward (xarses)
tags: added: ha
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Considering the Ryan's report I close this issue as non reproducible with RabbitMQ3 we currently have in master branch

Changed in fuel:
status: Triaged → Fix Committed
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Bogdan, please have a look at the rules for tracking bugs targeted for backporting described here:
https://lists.launchpad.net/fuel-dev/msg00698.html

RabbitMQ3 wasn't yet uploaded for 4.1.1, so this should be changed to In Progress with target milestone set to 4.1.1.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Yes, my mistake, you are completely right

Changed in fuel:
status: Fix Committed → In Progress
milestone: 5.0 → 4.1.1
Mike Scherbakov (mihgen)
tags: added: release-notes
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

4.1.1 now has RabbitMQ 3.2.

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Meg McRoberts (dreidellhasa) wrote :

Added to "Fixed Issues" list in 5.0 Release Notes.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.