RabbitMQ cluster is broken after controller destroy: 'On the controller node-3.test.domain.local, resource master_p_rabbitmq-server is active but failed to start (managed)'
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
In Progress
|
High
|
Valeriy Saharov | ||
8.0.x |
Confirmed
|
High
|
Fuel Library (Deprecated) |
Bug Description
Fuel version info (8.0 build #264): http://
System tests 'ha_neutron_
Master/Slave Set: master_
p_
Masters: [ node-1.
root@node-3:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-3' ...
[{nodes,
Here is a part of pacemaker logs:
Dec 08 04:28:58 [14416] node-3.
Dec 08 04:28:58 [14416] node-3.
Dec 08 04:28:58 [14416] node-3.
Dec 08 04:28:58 [14416] node-3.
Dec 08 04:28:58 [14416] node-3.
Dec 08 04:28:58 [14416] node-3.
Dec 08 04:28:58 [14416] node-3.
Dec 08 04:28:58 [14416] node-3.
Dec 08 04:28:58 [14416] node-3.
Dec 08 04:28:58 [14416] node-3.
Dec 08 04:28:58 [14416] node-3.
Dec 08 04:28:58 [14416] node-3.
Dec 08 04:28:58 [14416] node-3.
Dec 08 04:28:58 [14416] node-3.
Dec 08 04:28:58 [14416] node-3.
Dec 08 04:28:58 [14416] node-3.
Dec 08 04:28:58 [14416] node-3.
Dec 08 04:28:58 [14416] node-3.
Dec 08 04:28:58 [14416] node-3.
Please note that test was waiting for worikng RabbitMQ cluster (passed OSTF tests) more than 20 minutes, but it hasn't been recovered (even after 1 hour when I reverted environment manually).
Steps to reproduce:
1. Deploy environment with 3 controllers
2. Destroy first controller
3. Wait 20 minutes
4. Check pacemaker status on alive controllers
Expected result: all resources are running
Actual: 'p_rabbitmq-server' resource is stopped/failed on one controllers
tags: | added: area-mos |
Changed in fuel: | |
status: | New → Confirmed |
tags: | added: swarm-fail-driver |
Changed in fuel: | |
assignee: | MOS Oslo (mos-oslo) → Dmitriy Ukhlov (dukhlov) |
tags: | removed: swarm-fail-driver |
Changed in fuel: | |
assignee: | Fuel Library Team (fuel-library) → Valeriy Saharov (vsakharov) |
status: | Confirmed → In Progress |
The attached snapshot is missing lrmd.log files for controller nodes and it is critical for analyzing such issues. Please reproduce the issue and attach a fresh snapshot. Also please make sure that lrmd logs are included and if they are missing, attach them separately.