Controller replacement fails: RabbitMQ goes down after node deletion
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Fix Released
|
High
|
Bogdan Dobrelya | ||
8.0.x |
Fix Released
|
High
|
Bogdan Dobrelya |
Bug Description
Environment deployment fails if controller node is replaced (old node is removed and new one is added):
2016-02-02 02:39:16 DEBUG [796] Task time summary: dump_rabbitmq_
2016-02-02 02:39:16 DEBUG [796] Task time summary: dump_rabbitmq_
2016-02-02 02:39:16 DEBUG [796] Task time summary: dump_rabbitmq_
2016-02-02 02:39:16 INFO [796] Casting message to Nailgun:
2016-02-02 02:39:16 INFO [796] Casting message to Nailgun:
2016-02-02 02:39:16 INFO [796] Casting message to Nailgun:
node-2 2016-02-
root@node-2:~# pcs status | grep -A 2 p_rabbitmq-server
Master/Slave Set: master_
Masters: [ node-1.
Slaves: [ node-2.
root@node-2:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@
Error: unable to connect to node 'rabbit@
Segmentation fault
Steps to reproduce:
1. Deploy environment with 3 controllers, 2 computes and 1 compute+cinder nodes
2. Remove 1 controller node and add 1 controller+cinder node
3. Deploy changes
Expected result: controller is replaced, deployment is successful, environment passes OSTF
Actual result: deployment of controllers fails on dump_rabbitmq_
Changed in fuel: | |
status: | New → Confirmed |
Changed in fuel: | |
assignee: | Fuel Library Team (fuel-library) → Dmitry Bilunov (dbilunov) |
tags: | added: team-bugfix |
description: | updated |
tags: | added: on-verification |
tags: | removed: on-verification |
tags: | added: on-verification |
The rabbitmq-server process definitely wasn't running on node-2. It looks like it went down around 1 minute after node-4 was fenced (00:46). I don't see any rabbitmq server logs in the snapshot though so I can't tell what happened to node-2 at that time. lrmd.log (http:// paste.openstack .org/show/ 485778/) on node-2 has some more information.