Rabbit cluster is broken after destroy controllers: no running rabbit nodes, no master
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Fix Released
|
High
|
Volodymyr Shypyguzov | ||
8.0.x |
Fix Released
|
High
|
Volodymyr Shypyguzov |
Bug Description
Check after repeatable failover rabbit cluster is healthy.
Scenario:
1. Deploy environment with at least 3 controllers
2. Get rabbit master node
3. Destroy controller with master rabbit
4. run OSTF
Expected result:
OSTF passed
Actual:
OSTF failed, rabbit cluster broken:
- RabbitMQ availability (failure) Number of RabbitMQ nodes is not equal to number of cluster nodes.
- RabbitMQ replication (failure) Failed to establish AMQP connection to 5673/tcp port on 10.109.26.4 from controller node! Please refer to OpenStack logs for more details.
PCS status:
root@node-3:~# crm_mon -1
Last updated: Fri Feb 12 13:17:29 2016
Last change: Fri Feb 12 12:49:02 2016
Stack: corosync
Current DC: node-1.
Version: 1.1.12-561c4cf
3 Nodes configured
46 Resources configured
Online: [ node-1.
OFFLINE: [ node-2.
sysinfo_
Clone Set: clone_p_vrouter [p_vrouter]
Started: [ node-1.
vip__management (ocf::fuel:
vip__vrouter_pub (ocf::fuel:
vip__vrouter (ocf::fuel:
vip__public (ocf::fuel:
Master/Slave Set: master_p_conntrackd [p_conntrackd]
Masters: [ node-1.
Slaves: [ node-3.
Clone Set: clone_p_haproxy [p_haproxy]
Started: [ node-1.
Clone Set: clone_p_mysql [p_mysql]
Started: [ node-1.
Master/Slave Set: master_
Slaves: [ node-1.
...
root@node-3:~#
root@node-3:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@
[{nodes,
root@node-3:~#
telnet:
rying 10.109.26.8...
telnet: Unable to connect to remote host: Connection refused
root@node-3:~# telnet 10.109.26.8 5673
Trying 10.109.26.8...
telnet: Unable to connect to remote host: Connection refused
root@node-3:~# telnet 10.109.26.8 15673
Trying 10.109.26.8...
telnet: Unable to connect to remote host: Connection refused
root@node-3:~# telnet 10.109.26.4 15673
Trying 10.109.26.4...
VERSION:
feature_groups:
- mirantis
production: "docker"
release: "8.0"
api: "1.0"
build_number: "553"
build_id: "553"
fuel-nailgun_sha: "ed2e0cde96ae7b
python-
fuel-agent_sha: "658be72c4b42d3
fuel-
astute_sha: "b81577a5b7857c
fuel-library_sha: "33634ec27be77e
fuel-ostf_sha: "3bc76a63a9e7d1
fuel-mirror_sha: "fb45b80d7bee58
fuelmenu_sha: "78ffc73065a967
shotgun_sha: "63645dea384a37
network-
fuel-upgrade_sha: "616a7490ec7199
fuelmain_sha: "d605bcbabf3153
Changed in fuel: | |
status: | Confirmed → In Progress |
status: | In Progress → Confirmed |
tags: | added: move-to-mu |
tags: |
added: area-qa non-release system-tests removed: area-mos mos-oslo move-to-mu release-notes |
Changed in fuel: | |
assignee: | Fuel QA Team (fuel-qa) → Volodymyr Shypyguzov (vshypyguzov) |
Changed in fuel: | |
status: | Confirmed → In Progress |
tags: | removed: non-release |
tags: |
added: area-qasystem-tests removed: area-qa system-tests |
tags: |
added: area-qa system-tests removed: area-qasystem-tests |
What is found so far: pcs resource shows that there is no master elected:
Master/Slave Set: master_ p_rabbitmq- server [p_rabbitmq-server] test.domain. local node-3. test.domain. local ]
Slaves: [ node-1.
It is so for a long time. In pacemaker.log on node-1 the following entries could be seen periodically.
Feb 12 15:05:23 [6652] node-1. test.domain. local pengine: info: master_color: master_ p_rabbitmq- server: Promoted 0 instances of a possible 1 to master