Rabbit cluster is broken after destroy controllers: no running rabbit nodes, no master

Bug #1544973 reported by Tatyanka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Volodymyr Shypyguzov
8.0.x
Fix Released
High
Volodymyr Shypyguzov

Bug Description

Check after repeatable failover rabbit cluster is healthy.

Scenario:
1. Deploy environment with at least 3 controllers
2. Get rabbit master node
3. Destroy controller with master rabbit
4. run OSTF

Expected result:
OSTF passed

Actual:
OSTF failed, rabbit cluster broken:
 - RabbitMQ availability (failure) Number of RabbitMQ nodes is not equal to number of cluster nodes.
  - RabbitMQ replication (failure) Failed to establish AMQP connection to 5673/tcp port on 10.109.26.4 from controller node! Please refer to OpenStack logs for more details.

PCS status:
root@node-3:~# crm_mon -1
Last updated: Fri Feb 12 13:17:29 2016
Last change: Fri Feb 12 12:49:02 2016
Stack: corosync
Current DC: node-1.test.domain.local (1) - partition with quorum
Version: 1.1.12-561c4cf
3 Nodes configured
46 Resources configured

Online: [ node-1.test.domain.local node-3.test.domain.local ]
OFFLINE: [ node-2.test.domain.local ]

 sysinfo_node-1.test.domain.local (ocf::pacemaker:SysInfo): Started node-1.test.domain.local
 Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
 vip__management (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
 vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
 vip__vrouter (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
 vip__public (ocf::fuel:ns_IPaddr2): Started node-3.test.domain.local
 Master/Slave Set: master_p_conntrackd [p_conntrackd]
     Masters: [ node-1.test.domain.local ]
     Slaves: [ node-3.test.domain.local ]
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Slaves: [ node-1.test.domain.local node-3.test.domain.local ]
 ...
root@node-3:~#

root@node-3:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@messaging-node-3' ...
[{nodes,[{disc,['rabbit@messaging-node-1','rabbit@messaging-node-3']}]}]
root@node-3:~#

telnet:
rying 10.109.26.8...
telnet: Unable to connect to remote host: Connection refused
root@node-3:~# telnet 10.109.26.8 5673
Trying 10.109.26.8...
telnet: Unable to connect to remote host: Connection refused
root@node-3:~# telnet 10.109.26.8 15673
Trying 10.109.26.8...
telnet: Unable to connect to remote host: Connection refused
root@node-3:~# telnet 10.109.26.4 15673
Trying 10.109.26.4...

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "553"
  build_id: "553"
  fuel-nailgun_sha: "ed2e0cde96ae7bc064e689f7409470e69c57772e"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "658be72c4b42d3e1436b86ac4567ab914bfb451b"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "33634ec27be77ecfb0b56b7e07497ad86d1fdcd3"
  fuel-ostf_sha: "3bc76a63a9e7d195ff34eadc29552f4235fa6c52"
  fuel-mirror_sha: "fb45b80d7bee5899d931f926e5c9512e2b442749"
  fuelmenu_sha: "78ffc73065a9674b707c081d128cb7eea611474f"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "d605bcbabf315382d56d0ce8143458be67c53434"

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

What is found so far: pcs resource shows that there is no master elected:

 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Slaves: [ node-1.test.domain.local node-3.test.domain.local ]

It is so for a long time. In pacemaker.log on node-1 the following entries could be seen periodically.

Feb 12 15:05:23 [6652] node-1.test.domain.local pengine: info: master_color: master_p_rabbitmq-server: Promoted 0 instances of a possible 1 to master

Changed in fuel:
status: New → Confirmed
Changed in fuel:
status: Confirmed → In Progress
status: In Progress → Confirmed
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

This is tested during regular SWARM runs and this is the first time we've run into it (1 of 5 similar cases failed today).

We believe, this is some corner case, which is not handled by our OCF scripts correctly. Giving the fact, we haven't seen this before, this must happen rarely.

If it's reproduced, there is a workaround of restarting the rabbitmq cluster manually.

tags: added: release-notes
Changed in fuel:
importance: Critical → High
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Based on ^ , I suggest we downgrade the bug importance to High to not block the 8.0 release and continue the investigation.

tags: added: move-to-mu
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

I figured it is a problem in our tests. 'crm configure show' on the affected env shows the following rules for RabbitMQ:

location location-p_rabbitmq-server-1 p_rabbitmq-server \
        rule $role=master -inf: #uname ne node-2.test.domain.local
location master_p_rabbitmq-server-on-node-1.test.domain.local master_p_rabbitmq-server 100: node-1.test.domain.local
location master_p_rabbitmq-server-on-node-2.test.domain.local master_p_rabbitmq-server 100: node-2.test.domain.local
location master_p_rabbitmq-server-on-node-3.test.domain.local master_p_rabbitmq-server 100: node-3.test.domain.local

Pay attention to the top rule, it didn't come from deployment or Pacemaker itself. It was set in the tests here:
https://github.com/openstack/fuel-qa/blob/7c70499309ec4882480963eed5bbd9d9975a6a8b/fuelweb_test/tests/tests_strength/test_failover_base.py#L1365-L1368

Just a little below the rule is deleted and seems like it generally works. But this time two duplicate rules were created with names location-p_rabbitmq-server and location-p_rabbitmq-server-1. While the first one was successfully deleted, the second remained and caused the tests to fail.

Attached are 5 cib files demonstrating the changes in the rules. In cib-49.raw location-p_rabbitmq-server was added. Then in cib-50.raw location-p_rabbitmq-server-1 was added. In cib-51.raw the first rule was deleted but the second remained.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Clarification to post above - the undeleted rule blocks promotion to master on nodes other then node-2. But node-2 was destroyed by the tests. node-1 and node-3 survived but with the rule in effect they could not be elected as masters.

Also, the stats of cib files I have uploadede correspond to the time when the rule was added according to the test log: https://product-ci.infra.mirantis.net/job/8.0.system_test.ubuntu.ha_neutron_destructive/138/console

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Tatyana agreed to look into the tests

Changed in fuel:
assignee: MOS Oslo (mos-oslo) → Fuel QA Team (fuel-qa)
tags: added: area-qa non-release system-tests
removed: area-mos mos-oslo move-to-mu release-notes
Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Volodymyr Shypyguzov (vshypyguzov)
Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/280798
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=f770d426c81fca60ea9ded6e48256c3700d4a3ab
Submitter: Jenkins
Branch: master

commit f770d426c81fca60ea9ded6e48256c3700d4a3ab
Author: Volodymyr Shypyguzov <email address hidden>
Date: Tue Feb 16 18:24:11 2016 +0200

    Fix duplicate pacemaker constraint command invocation

    Edit test docstring according to the test script
    Add show_step to the test

    Related-Bug:#1458830
    Closes-Bug:#1544973

    Change-Id: I05322f648440447f7e23df0cfb9adffbfe7e7aec

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/282229

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (stable/8.0)

Reviewed: https://review.openstack.org/282229
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=e29ffb3087d296f3ed3ba9fbbd68a321f2139bf7
Submitter: Jenkins
Branch: stable/8.0

commit e29ffb3087d296f3ed3ba9fbbd68a321f2139bf7
Author: Volodymyr Shypyguzov <email address hidden>
Date: Tue Feb 16 18:24:11 2016 +0200

    Fix duplicate pacemaker constraint command invocation

    Edit test docstring according to the test script
    Add show_step to the test

    Related-Bug:#1458830
    Closes-Bug:#1544973

    Change-Id: I05322f648440447f7e23df0cfb9adffbfe7e7aec
    (cherry picked from commit f770d426c81fca60ea9ded6e48256c3700d4a3ab)

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Test 'ha_rabbitmq_stability_check' passed on the latest swarm (test plan 8.0 iso #586)

Changed in fuel:
status: Fix Committed → Fix Released
tags: removed: non-release
tags: added: area-qasystem-tests
removed: area-qa system-tests
tags: added: area-qa system-tests
removed: area-qasystem-tests
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.