Fuel for OpenStack

Rabbit cluster is broken after destroy controllers: no running rabbit nodes, no master

Bug #1544973 reported by Tatyanka on 2016-02-12

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Released	High	Volodymyr Shypyguzov	Fuel for OpenStack 9.0
	8.0.x	Fix Released	High	Volodymyr Shypyguzov	Fuel for OpenStack 8.0

Bug Description

Check after repeatable failover rabbit cluster is healthy.

Scenario:
1. Deploy environment with at least 3 controllers
2. Get rabbit master node
3. Destroy controller with master rabbit
4. run OSTF

Expected result:
OSTF passed

Actual:
OSTF failed, rabbit cluster broken:
- RabbitMQ availability (failure) Number of RabbitMQ nodes is not equal to number of cluster nodes.
- RabbitMQ replication (failure) Failed to establish AMQP connection to 5673/tcp port on 10.109.26.4 from controller node! Please refer to OpenStack logs for more details.

PCS status:
root@node-3:~# crm_mon -1
Last updated: Fri Feb 12 13:17:29 2016
Last change: Fri Feb 12 12:49:02 2016
Stack: corosync
Current DC: node-1.test.domain.local (1) - partition with quorum
Version: 1.1.12-561c4cf
3 Nodes configured
46 Resources configured

Online: [ node-1.test.domain.local node-3.test.domain.local ]
OFFLINE: [ node-2.test.domain.local ]

sysinfo_node-1.test.domain.local (ocf::pacemaker:SysInfo): Started node-1.test.domain.local
Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
vip__management (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
vip__vrouter (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
vip__public (ocf::fuel:ns_IPaddr2): Started node-3.test.domain.local
Master/Slave Set: master_p_conntrackd [p_conntrackd]
     Masters: [ node-1.test.domain.local ]
     Slaves: [ node-3.test.domain.local ]
Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Slaves: [ node-1.test.domain.local node-3.test.domain.local ]
...
root@node-3:~#

root@node-3:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@messaging-node-3' ...
[{nodes,[{disc,['rabbit@messaging-node-1','rabbit@messaging-node-3']}]}]
root@node-3:~#

telnet:
rying 10.109.26.8...
telnet: Unable to connect to remote host: Connection refused
root@node-3:~# telnet 10.109.26.8 5673
Trying 10.109.26.8...
telnet: Unable to connect to remote host: Connection refused
root@node-3:~# telnet 10.109.26.8 15673
Trying 10.109.26.8...
telnet: Unable to connect to remote host: Connection refused
root@node-3:~# telnet 10.109.26.4 15673
Trying 10.109.26.4...

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "553"
  build_id: "553"
  fuel-nailgun_sha: "ed2e0cde96ae7bc064e689f7409470e69c57772e"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "658be72c4b42d3e1436b86ac4567ab914bfb451b"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "33634ec27be77ecfb0b56b7e07497ad86d1fdcd3"
  fuel-ostf_sha: "3bc76a63a9e7d195ff34eadc29552f4235fa6c52"
  fuel-mirror_sha: "fb45b80d7bee5899d931f926e5c9512e2b442749"
  fuelmenu_sha: "78ffc73065a9674b707c081d128cb7eea611474f"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "d605bcbabf315382d56d0ce8143458be67c53434"

Tags:

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2016-02-12:

fail_error_ha_rabbitmq_stability_check-fuel-snapshot-2016-02-12_07-11-45.tar.xz Edit (56.7 MiB, application/octet-stream)

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-12:

What is found so far: pcs resource shows that there is no master elected:

Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
Slaves: [ node-1.test.domain.local node-3.test.domain.local ]

It is so for a long time. In pacemaker.log on node-1 the following entries could be seen periodically.

Feb 12 15:05:23 [6652] node-1.test.domain.local pengine: info: master_color: master_p_rabbitmq-server: Promoted 0 instances of a possible 1 to master

Changed in fuel:
status:	New → Confirmed

Roman Podoliaka (rpodolyaka) on 2016-02-12

Changed in fuel:
status:	Confirmed → In Progress
status:	In Progress → Confirmed

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-02-12:

This is tested during regular SWARM runs and this is the first time we've run into it (1 of 5 similar cases failed today).

We believe, this is some corner case, which is not handled by our OCF scripts correctly. Giving the fact, we haven't seen this before, this must happen rarely.

If it's reproduced, there is a workaround of restarting the rabbitmq cluster manually.

tags:	added: release-notes
Changed in fuel:
importance:	Critical → High

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-02-12:

Based on ^ , I suggest we downgrade the bug importance to High to not block the 8.0 release and continue the investigation.

Roman Podoliaka (rpodolyaka) on 2016-02-15

tags:

added: move-to-mu

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-15:

cib-49.raw Edit (36.6 KiB, text/html)

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-15:

cib-50.raw Edit (36.9 KiB, text/html)

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-15:

cib-51.raw Edit (36.6 KiB, text/html)

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-15:

cib-52.raw Edit (36.6 KiB, text/html)

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-15:

cib-48.raw Edit (36.3 KiB, text/html)

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-15:

#10

I figured it is a problem in our tests. 'crm configure show' on the affected env shows the following rules for RabbitMQ:

location location-p_rabbitmq-server-1 p_rabbitmq-server \
rule $role=master -inf: #uname ne node-2.test.domain.local
location master_p_rabbitmq-server-on-node-1.test.domain.local master_p_rabbitmq-server 100: node-1.test.domain.local
location master_p_rabbitmq-server-on-node-2.test.domain.local master_p_rabbitmq-server 100: node-2.test.domain.local
location master_p_rabbitmq-server-on-node-3.test.domain.local master_p_rabbitmq-server 100: node-3.test.domain.local

Pay attention to the top rule, it didn't come from deployment or Pacemaker itself. It was set in the tests here:
https://github.com/openstack/fuel-qa/blob/7c70499309ec4882480963eed5bbd9d9975a6a8b/fuelweb_test/tests/tests_strength/test_failover_base.py#L1365-L1368

Just a little below the rule is deleted and seems like it generally works. But this time two duplicate rules were created with names location-p_rabbitmq-server and location-p_rabbitmq-server-1. While the first one was successfully deleted, the second remained and caused the tests to fail.

Attached are 5 cib files demonstrating the changes in the rules. In cib-49.raw location-p_rabbitmq-server was added. Then in cib-50.raw location-p_rabbitmq-server-1 was added. In cib-51.raw the first rule was deleted but the second remained.

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-15:

#11

Clarification to post above - the undeleted rule blocks promotion to master on nodes other then node-2. But node-2 was destroyed by the tests. node-1 and node-3 survived but with the rule in effect they could not be elected as masters.

Also, the stats of cib files I have uploadede correspond to the time when the rule was added according to the test log: https://product-ci.infra.mirantis.net/job/8.0.system_test.ubuntu.ha_neutron_destructive/138/console

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-15:

#12

Tatyana agreed to look into the tests

Changed in fuel:
assignee:	MOS Oslo (mos-oslo) → Fuel QA Team (fuel-qa)

Tatyanka (tatyana-leontovich) on 2016-02-16

tags:

added: area-qa non-release system-tests
removed: area-mos mos-oslo move-to-mu release-notes

Volodymyr Shypyguzov (vshypyguzov) on 2016-02-16

Changed in fuel:
assignee:	Fuel QA Team (fuel-qa) → Volodymyr Shypyguzov (vshypyguzov)

OpenStack Infra (hudson-openstack) on 2016-02-17

Changed in fuel:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-18: Fix merged to fuel-qa (master)

#13

Reviewed: https://review.openstack.org/280798
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=f770d426c81fca60ea9ded6e48256c3700d4a3ab
Submitter: Jenkins
Branch: master

commit f770d426c81fca60ea9ded6e48256c3700d4a3ab
Author: Volodymyr Shypyguzov <email address hidden>
Date: Tue Feb 16 18:24:11 2016 +0200

Fix duplicate pacemaker constraint command invocation

Edit test docstring according to the test script
Add show_step to the test

Related-Bug:#1458830
Closes-Bug:#1544973

Change-Id: I05322f648440447f7e23df0cfb9adffbfe7e7aec

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-19: Fix proposed to fuel-qa (stable/8.0)

#14

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/282229

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-19: Fix merged to fuel-qa (stable/8.0)

#15

Reviewed: https://review.openstack.org/282229
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=e29ffb3087d296f3ed3ba9fbbd68a321f2139bf7
Submitter: Jenkins
Branch: stable/8.0

commit e29ffb3087d296f3ed3ba9fbbd68a321f2139bf7
Author: Volodymyr Shypyguzov <email address hidden>
Date: Tue Feb 16 18:24:11 2016 +0200

Fix duplicate pacemaker constraint command invocation

Edit test docstring according to the test script
Add show_step to the test

Related-Bug:#1458830
Closes-Bug:#1544973

Change-Id: I05322f648440447f7e23df0cfb9adffbfe7e7aec
(cherry picked from commit f770d426c81fca60ea9ded6e48256c3700d4a3ab)

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2016-02-22:

#16

Test 'ha_rabbitmq_stability_check' passed on the latest swarm (test plan 8.0 iso #586)

Changed in fuel:
status:	Fix Committed → Fix Released

Nastya Urlapova (aurlapova) on 2016-03-11

tags:	removed: non-release
tags:	added: area-qasystem-tests removed: area-qa system-tests
tags:	added: area-qa system-tests removed: area-qasystem-tests

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.