RabbitMQ cluster locks up when a member is removed

Bug #1288831 reported by Bogdan Dobrelya on 2014-03-06
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Fuel Library (Deprecated)

Bug Description

{"build_id": "2014-03-04_12-31-13", "mirantis": "yes", "build_number": "112", "nailgun_sha": "d98b61e073d32c45c98099a11ff263a68b7ba205", "ostf_sha": "dc54d99ddff2f497b131ad1a42362515f2a61afa", "fuelmain_sha": "16637e2ea0ae6fe9a773aceb9d76c6e3a75f6c3b", "astute_sha": "f15f5615249c59c826ea05d26707f062c88db32a", "release": "4.1", "fuellib_sha": "15a55ccff0f59929b32d087679d19e896bde8e0d"}

Reproduce:
* Deploy Centos HA, nova-network FLATdhcp, tagged interfaces, DEBUG=TRUE: 3 controllers, 1 compute
* log on to the 1st controller node and issue the commands (see below):
service rabbitmq-server stop
sleep 30; . openrc; heat list
service rabbitmq-server start
rabbitmqctl list_queues
heat list
* If there is no issues for rabbitmq startup, list_queues and heat list results (see below, normal results were marked as (OK)):
- repeat the same steps for other controllers, one by one.
* Otherwise, in case there were any issues for given controller node (see below, issues were marked as (Hangs)):
- reboot the given node and check OS services and rabbitmq:
chkconfig | grep openstack | awk '{print $1}' | xargs -n1 -I{} service {} status
rabbitmqctl list_queues
. openrc; heat list
- check the results: All Openstack services will be stopped and RabbitMQ will not be able to show its queues. And that is the subject of the issue...

Issue:
- Once stopped, RabbitMQ became broken and won't start back, after reboot it starts but remains unoperational.
- None of the Openstack services start after controller node reboot
- 'heat list' hangs every the time after RabbitMQ was stopped for the 1st time.

Console actions and results:

*Pre-patched behavior*
{"build_id": "2014-02-26_13-39-45", "mirantis": "yes", "build_number": "211", "nailgun_sha": "ea08cef3e06a72f47cfaa8cd8fe6d034e2cf722e", "ostf_sha": "8e6681b6d06c7cb20a84c1cc740d5f2492fb9d85", "fuelmain_sha": "baa8bb07393698f1186cb67bb65f1b93907c59bd", "astute_sha": "10cccc87f2ee35510e43c8fa19d2bf916ca1fced", "release": "4.1", "fuellib_sha": "0a2e5bdc01c1e3bb285acb7b39125101e950ac72"}
Centos HA, nova-network FLATdhcp, tagged interfaces: 3 controllers, 1 compute

[root@node-7 ~]# service rabbitmq-server stop
[root@node-7 ~]# . openrc; heat list
(OK)
[root@node-7 ~]# service rabbitmq-server start
Starting rabbitmq-server: RabbitMQ is going to make 3 attempts to find master node and start.
3 attempts left to start RabbitMQ Server before consider start failed.
SUCCESS
rabbitmq-server.
[root@node-7 ~]# rabbitmqctl list_queues
(OK)

Reboot the node, and check:
[root@node-7 ~]chkconfig | grep openstack | awk '{print $1}' | xargs -n1 -I{} service {} status
(All OS services are running)

*Patched behavior*
{"build_id": "2014-03-04_12-31-13", "mirantis": "yes", "build_number": "112", "nailgun_sha": "d98b61e073d32c45c98099a11ff263a68b7ba205", "ostf_sha": "dc54d99ddff2f497b131ad1a42362515f2a61afa", "fuelmain_sha": "16637e2ea0ae6fe9a773aceb9d76c6e3a75f6c3b", "astute_sha": "f15f5615249c59c826ea05d26707f062c88db32a", "release": "4.1", "fuellib_sha": "15a55ccff0f59929b32d087679d19e896bde8e0d"}
Centos HA, nova-network FLATdhcp, tagged interfaces: 3 controllers, 1 compute

[root@node-7 ~]# service rabbitmq-server stop
[root@node-7 ~]# . openrc; heat list
(Hangs)

[root@node-1 ~]# service rabbitmq-server start
Starting rabbitmq-server: RabbitMQ is going to make 3 attempts to find master node and start.
3 attempts left to start RabbitMQ Server before consider start failed.
(Hangs)

If reboot the node, rabbitMQ starts, but:
[root@node-1 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-1' ...
[{nodes,[{disc,['rabbit@node-3','rabbit@node-2','rabbit@node-1']}]},
 {running_nodes,['rabbit@node-3','rabbit@node-2','rabbit@node-1']}]
...done.
[root@node-1 ~]# rabbitmqctl list_queues
Listing queues ...

=ERROR REPORT==== 6-Mar-2014::14:50:04 ===
Discarding message {'$gen_call',{<0.17752.11>,#Ref<0.0.1.206018>},{info,[name,messages]}} from <0.17752.11> to <0.1694.0> in an old incarnation (1) of this node (2)
(Hangs)

[root@node-1 ~]# rabbitmqctl list_consumers
Listing consumers ...

=ERROR REPORT==== 6-Mar-2014::14:55:37 ===
Discarding message {'$gen_call',{<0.2839.13>,#Ref<0.0.2.95633>},consumers} from <0.2839.13> to <0.1694.0> in an old incarnation (1) of this node (2)
(Hangs)

Reboot the node, and check:
[root@node-7 ~]chkconfig | grep openstack | awk '{print $1}' | xargs -n1 -I{} service {} status
(All OS services are stopped)

description: updated
Changed in fuel:
status: New → Incomplete
Nastya Urlapova (aurlapova) wrote :

Bogdan, sorry but I can't reproduce issue. After reboot primary controller rabbit works fine without error reports.
{
build_id: "2014-03-05_07-31-01",
mirantis: "yes",
build_number: "235",
nailgun_sha: "f58aad317829112913f364347b14f1f0518ad371",
ostf_sha: "dc54d99ddff2f497b131ad1a42362515f2a61afa",
fuelmain_sha: "16637e2ea0ae6fe9a773aceb9d76c6e3a75f6c3b",
astute_sha: "f15f5615249c59c826ea05d26707f062c88db32a",
release: "4.1",
fuellib_sha: "73313007c0914e602246ea41fa5e8ca2dfead9f8"
}

Bogdan Dobrelya (bogdando) wrote :
Bogdan Dobrelya (bogdando) wrote :

Reproduced for node-2, see logs (node-1 was OK though).

I've saved logs in vcs as well (2 commits) here https://github.com/bogdando/log_snapshots
1st commit is stable state after deployment, 2nd one - is reproduced issue for node-2.
So, you can just clone it and run git diff against node-2 dir to see that have happened, e.g.
git difftool HEAD~1 fuel-snapshot-2014-03-06_18-46-23/localhost/var/log/remote/node-2.test.domain.local

Ryan Moe (rmoe) on 2014-03-06
Changed in fuel:
status: Incomplete → Confirmed
Ryan Moe (rmoe) on 2014-03-06
summary: - RabbitMQ HA regression
+ RabbitMQ cluster locks up when a member is removed
Changed in fuel:
milestone: 4.1 → 4.1.1
importance: Critical → High
Ryan Moe (rmoe) wrote :

This is only an issue on CentOS. I'm able to reproduce this issue and the only workaround I've found is rebooting all 3 controllers.

Ryan Moe (rmoe) wrote :

After upgrading RabbitMQ to 3.2.4 I can't reproduce this issue anymore.

Dmitry Borodaenko (angdraug) wrote :

The fix is to upgrade to RabbitMQ 3 which already almost did for 4.1.1.

Changed in fuel:
status: Confirmed → Triaged
description: updated
Nastya Urlapova (aurlapova) wrote :

What is the status for upgrade to RabbitMQ 3?

Dmitry Borodaenko (angdraug) wrote :

We have RabbitMQ packages ready in OSCI-1016, we should test them extensively with 5.0 before backporting them for 4.1.1.

Changed in fuel:
milestone: 4.1.1 → 5.0
tags: added: backports-4.1.1
Dmitry Borodaenko (angdraug) wrote :

When upgrading RabbitMQ to 3.x, primary-controller manifests should be updated to set HA policy for all queues:
https://bugs.launchpad.net/fuel/+bug/1296922

Andrew Woodward (xarses) on 2014-04-08
tags: added: ha
Bogdan Dobrelya (bogdando) wrote :

Considering the Ryan's report I close this issue as non reproducible with RabbitMQ3 we currently have in master branch

Changed in fuel:
status: Triaged → Fix Committed
Dmitry Borodaenko (angdraug) wrote :

Bogdan, please have a look at the rules for tracking bugs targeted for backporting described here:
https://lists.launchpad.net/fuel-dev/msg00698.html

RabbitMQ3 wasn't yet uploaded for 4.1.1, so this should be changed to In Progress with target milestone set to 4.1.1.

Bogdan Dobrelya (bogdando) wrote :

Yes, my mistake, you are completely right

Changed in fuel:
status: Fix Committed → In Progress
milestone: 5.0 → 4.1.1
Mike Scherbakov (mihgen) on 2014-05-08
tags: added: release-notes
Dmitry Borodaenko (angdraug) wrote :

4.1.1 now has RabbitMQ 3.2.

Changed in fuel:
status: In Progress → Fix Committed
Meg McRoberts (dreidellhasa) wrote :

Added to "Fixed Issues" list in 5.0 Release Notes.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers