RabbitMQ cluster contains offline node after failover
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Invalid
|
High
|
Fuel Library (Deprecated) | ||
5.1.x |
Won't Fix
|
High
|
Denis Meltsaykin | ||
6.0.x |
Won't Fix
|
High
|
Denis Meltsaykin | ||
6.1.x |
Fix Released
|
High
|
Bogdan Dobrelya | ||
7.0.x |
Won't Fix
|
High
|
Fuel Library (Deprecated) |
Bug Description
Fuel version info (6.1 build #478): http://
After controller node shutdown which is master for RabbitMQ cluster of 3 nodes, one of 2 rest controllers doesn't kick offline node from cluster and it leads to endless RabbitMQ server restarts by pacemaker:
<30>Jun 3 09:52:48 node-15 lrmd: INFO: p_rabbitmq-server: su_rabbit_cmd(): the invoked command exited 2: /usr/sbin/
<27>Jun 3 09:52:48 node-15 lrmd: ERROR: p_rabbitmq-server: join_to_cluster(): Can't join to cluster by node 'rabbit@node-5'. Stopping.
<30>Jun 3 09:52:48 node-15 lrmd: INFO: p_rabbitmq-server: stop: action begin.
...
<30>Jun 3 09:53:00 node-15 lrmd: INFO: p_rabbitmq-server: notify: post-start end.
<28>Jun 3 09:53:00 node-15 lrmd: WARNING: p_rabbitmq-server: notify: Failed to join the cluster on post-start. The resource will be restarted.
...
Jun 03 09:53:00 [18700] node-15.
are present."} ]
Jun 03 09:53:00 [18700] node-15.
<29>Jun 3 09:53:00 node-15 lrmd[18700]: notice: operation_finished: p_rabbitmq-
Jun 03 09:53:00 [18703] node-15.
Jun 03 09:53:00 [18703] node-15.
<29>Jun 3 09:53:00 node-15 crmd[18703]: notice: process_lrm_event: Operation p_rabbitmq-
Steps to reproduce:
1. Deploy environment: CentOS, NovaVlan, Ceph, Classic Provisioning
2. Destroy primary controller
3. Check rabbitmq cluster status on controllers.
Expected result:
- rabbitmq cluster is re-assembled, offline controller is removed from it
Actual:
- rabbitmq cluster on 1 of controllers contains offline node
I reproduced this bug on bare metal lab. On Ubuntu environment (the same hardware and iso) I didn't observe such issue.
Changed in fuel: | |
assignee: | Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando) |
importance: | Undecided → High |
On the master (node-1) failover, the node unjoin from the cluster was failed at the node-15 due to the race with the rabbit-fence daemon and rabbitmq stop_app logic in OCF. As the result joining the cluster fails in a start/stop loop, with no reset attempts at all.
The solution is to:
a) do not stop rabbit app locally by the OCF logic if it can see there is the rabbit-fence daemon trying to kick some node out of the cluster and assumes the rabbit app is running locally.
b) introduce additional reset action if joining to the cluster have failed.
While the b should be enough do handle this situation and let the failed node joine cluster after mnesia reset, the complete solution should include a as well.