RabbitMQ cluster locks up when a member is removed
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Fix Committed
|
High
|
Fuel Library (Deprecated) |
Bug Description
{"build_id": "2014-03-
Reproduce:
* Deploy Centos HA, nova-network FLATdhcp, tagged interfaces, DEBUG=TRUE: 3 controllers, 1 compute
* log on to the 1st controller node and issue the commands (see below):
service rabbitmq-server stop
sleep 30; . openrc; heat list
service rabbitmq-server start
rabbitmqctl list_queues
heat list
* If there is no issues for rabbitmq startup, list_queues and heat list results (see below, normal results were marked as (OK)):
- repeat the same steps for other controllers, one by one.
* Otherwise, in case there were any issues for given controller node (see below, issues were marked as (Hangs)):
- reboot the given node and check OS services and rabbitmq:
chkconfig | grep openstack | awk '{print $1}' | xargs -n1 -I{} service {} status
rabbitmqctl list_queues
. openrc; heat list
- check the results: All Openstack services will be stopped and RabbitMQ will not be able to show its queues. And that is the subject of the issue...
Issue:
- Once stopped, RabbitMQ became broken and won't start back, after reboot it starts but remains unoperational.
- None of the Openstack services start after controller node reboot
- 'heat list' hangs every the time after RabbitMQ was stopped for the 1st time.
Console actions and results:
*Pre-patched behavior*
{"build_id": "2014-02-
Centos HA, nova-network FLATdhcp, tagged interfaces: 3 controllers, 1 compute
[root@node-7 ~]# service rabbitmq-server stop
[root@node-7 ~]# . openrc; heat list
(OK)
[root@node-7 ~]# service rabbitmq-server start
Starting rabbitmq-server: RabbitMQ is going to make 3 attempts to find master node and start.
3 attempts left to start RabbitMQ Server before consider start failed.
SUCCESS
rabbitmq-server.
[root@node-7 ~]# rabbitmqctl list_queues
(OK)
Reboot the node, and check:
[root@node-7 ~]chkconfig | grep openstack | awk '{print $1}' | xargs -n1 -I{} service {} status
(All OS services are running)
*Patched behavior*
{"build_id": "2014-03-
Centos HA, nova-network FLATdhcp, tagged interfaces: 3 controllers, 1 compute
[root@node-7 ~]# service rabbitmq-server stop
[root@node-7 ~]# . openrc; heat list
(Hangs)
[root@node-1 ~]# service rabbitmq-server start
Starting rabbitmq-server: RabbitMQ is going to make 3 attempts to find master node and start.
3 attempts left to start RabbitMQ Server before consider start failed.
(Hangs)
If reboot the node, rabbitMQ starts, but:
[root@node-1 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-1' ...
[{nodes,
{running_
...done.
[root@node-1 ~]# rabbitmqctl list_queues
Listing queues ...
=ERROR REPORT==== 6-Mar-2014:
Discarding message {'$gen_
(Hangs)
[root@node-1 ~]# rabbitmqctl list_consumers
Listing consumers ...
=ERROR REPORT==== 6-Mar-2014:
Discarding message {'$gen_
(Hangs)
Reboot the node, and check:
[root@node-7 ~]chkconfig | grep openstack | awk '{print $1}' | xargs -n1 -I{} service {} status
(All OS services are stopped)
description: | updated |
Changed in fuel: | |
status: | New → Incomplete |
Changed in fuel: | |
status: | Incomplete → Confirmed |
summary: |
- RabbitMQ HA regression + RabbitMQ cluster locks up when a member is removed |
Changed in fuel: | |
milestone: | 4.1 → 4.1.1 |
importance: | Critical → High |
description: | updated |
Changed in fuel: | |
milestone: | 4.1.1 → 5.0 |
tags: | added: backports-4.1.1 |
tags: | added: ha |
Changed in fuel: | |
status: | Fix Committed → In Progress |
milestone: | 5.0 → 4.1.1 |
tags: | added: release-notes |
Bogdan, sorry but I can't reproduce issue. After reboot primary controller rabbit works fine without error reports. 05_07-31- 01", 2913f364347b14f 1f0518ad371" , 97b131ad1a42362 515f2a61afa" , e9a773aceb9d76c 6e3a75f6c3b" , c826ea05d26707f 062c88db32a" , 602246ea41fa5e8 ca2dfead9f8"
{
build_id: "2014-03-
mirantis: "yes",
build_number: "235",
nailgun_sha: "f58aad31782911
ostf_sha: "dc54d99ddff2f4
fuelmain_sha: "16637e2ea0ae6f
astute_sha: "f15f5615249c59
release: "4.1",
fuellib_sha: "73313007c0914e
}