There are 2 rabbitmq cluster after failover of one controller that was rabbit master previosly

Bug #1518264 reported by Tatyanka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
MOS Oslo

Bug Description

Steps to reproduce:
1. Deploy ha environment with 3 controller 1 compute and cinder nodes.
2. When cluster is ready, run health check, all tests should be passed
3. ssh on first controller and find controller where rabbit master is running using pcs status
4. Destroy it (virsh destroy <domain_name>)
6. wait for 10 minutes while cluster recovers
7. Run ostf

Expected result:
New rabbit master re-election happens, there is one rabbit cluster, pacs status on the last 2 controllers shows the same, rabbitmqctl cluster_status shows the same on each online controller

Actual:
Master rabbit was not elected, for each online controller rabbitmqctl shows that there is single one node cluster:
root@node-3:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-3' ...
[{nodes,[{disc,['rabbit@node-3']}]}]
root@node-3:~# OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/fuel/rabbitmq-server status ; echo $?
0

on other online controller
root@node-1:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-1' ...
[{nodes,[{disc,['rabbit@node-1']}]},
 {running_nodes,['rabbit@node-1']},
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]}]
root@node-1:~# OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/fuel/rabbitmq-server status ; echo $?
0
root@node-1:~#

pcs status
Cluster name:
Last updated: Fri Nov 20 10:55:27 2015
Last change: Fri Nov 20 05:42:14 2015
Stack: corosync
Current DC: node-1.test.domain.local (1) - partition with quorum
Version: 1.1.12-561c4cf
3 Nodes configured
46 Resources configured

Online: [ node-1.test.domain.local node-3.test.domain.local ]
OFFLINE: [ node-2.test.domain.local ]

Full list of resources:

 Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
 vip__management (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
 vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
 vip__vrouter (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
 vip__public (ocf::fuel:ns_IPaddr2): Started node-3.test.domain.local
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
 Clone Set: clone_p_dns [p_dns]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Slaves: [ node-1.test.domain.local node-3.test.domain.local ]
 Clone Set: clone_p_neutron-plugin-openvswitch-agent [p_neutron-plugin-openvswitch-agent]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
 Clone Set: clone_p_neutron-l3-agent [p_neutron-l3-agent]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
 Clone Set: clone_p_neutron-dhcp-agent [p_neutron-dhcp-agent]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
 Clone Set: clone_p_heat-engine [p_heat-engine]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
 Master/Slave Set: master_p_conntrackd [p_conntrackd]
     Masters: [ node-1.test.domain.local ]
     Slaves: [ node-3.test.domain.local ]
 sysinfo_node-1.test.domain.local (ocf::pacemaker:SysInfo): Started node-1.test.domain.local
 sysinfo_node-3.test.domain.local (ocf::pacemaker:SysInfo): Started node-3.test.domain.local
 sysinfo_node-2.test.domain.local (ocf::pacemaker:SysInfo): Stopped
 Clone Set: clone_ping_vip__public [ping_vip__public]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
 Clone Set: clone_p_ntp [p_ntp]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]

healthcheck tests are failed
[root@nailgun ~]# cat /etc/fuel/version.yaml
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  openstack_version: "2015.1.0-8.0"
  api: "1.0"
  build_number: "181"
  build_id: "181"
  fuel-nailgun_sha: "e5600e8a87745323c2a9c524b7c0647c9829ebb9"
  python-fuelclient_sha: "e685d68c1c0d0fa0491a250f07d9c3a8d0f9608c"
  fuel-agent_sha: "6faa1e0ba836ef114b3c1a6c4d12469fc66ae402"
  fuel-nailgun-agent_sha: "db1738a65012d2a1d2b7e83cc2a44a196e4290b4"
  astute_sha: "c8400f51b0b92254da206de55ef89d17fdf35393"
  fuel-library_sha: "f9281f9c3b08ed16b2fb411e30f5d48809446030"
  fuel-ostf_sha: "7e24fc802a95d2f0627512e585f8977f587aea18"
  fuel-createmirror_sha: "22a7aacd95bbdca69f9e0f08b70facabdec8fb28"
  fuelmenu_sha: "d12061b1aee82f81b3d074de74ea27a6e962a686"
  shotgun_sha: "c377d163519f6d10b69a654019d6086ba5f14edc"
  network-checker_sha: "a57e1d69acb5e765eb22cab0251c589cd76f51da"
  fuel-upgrade_sha: "1e894e26d4e1423a9b0d66abd6a79505f4175ff6"
  fuelmain_sha: "cd084cf5c4372a46184fb7c2f24568da4e030be2"
[root@nailgun ~]#

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The issue is caused by pacemaker, which for some reason can not select a new master. It can be seen in 'pcs resource' output that it continues to run with 2 slaves, though there is no master: http://paste.openstack.org/show/479567/

Here is 'crm_mon -fotAW -1' output, though I can not make anythin useful out of it: http://paste.openstack.org/show/479568/

Also, pacemaker logs contain these entries repeating every 10-20 seconds:

Nov 20 11:49:51 [16248] node-1.test.domain.local pengine: info: clone_print: Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
Nov 20 11:49:51 [16248] node-1.test.domain.local pengine: info: native_color: Resource p_rabbitmq-server:2 cannot run anywhere
Nov 20 11:49:51 [16248] node-1.test.domain.local pengine: info: master_color: master_p_rabbitmq-server: Promoted 0 instances of a possible 1 to master
Nov 20 11:49:51 [16248] node-1.test.domain.local pengine: info: LogActions: Leave p_rabbitmq-server:0 (Slave node-1.test.domain.local)
Nov 20 11:49:51 [16248] node-1.test.domain.local pengine: info: LogActions: Leave p_rabbitmq-server:1 (Slave node-3.test.domain.local)
Nov 20 11:49:51 [16248] node-1.test.domain.local pengine: info: LogActions: Leave p_rabbitmq-server:2 (Stopped)

tags: added: swarm-blocker
Changed in fuel:
status: New → Confirmed
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Dima,

Does this need to be worked on by another team? Can you please move it along?

Thanks,
Dims

tags: added: blocker-for-qa
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/trusty/rabbitmq-server (master)

Fix proposed to branch: master
Change author: Alexey Lebedeff <email address hidden>
Review: https://review.fuel-infra.org/14569

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/trusty/rabbitmq-server (8.0)

Fix proposed to branch: 8.0
Change author: Alexey Lebedeff <email address hidden>
Review: https://review.fuel-infra.org/14586

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to packages/trusty/rabbitmq-server (master)

Reviewed: https://review.fuel-infra.org/14569
Submitter: Pkgs Jenkins <email address hidden>
Branch: master

Commit: 23c9115f9084778225763793619787b6a4f3ce20
Author: Alexey Lebedeff <email address hidden>
Date: Thu Dec 10 12:34:24 2015

Backport fix for internal state corruption.

Upstream patch - https://github.com/rabbitmq/rabbitmq-common/pull/18

Change-Id: I1d2cd47305c0dfa8279e418088d8ab0e98e4ecc7
Partial-Bug: #1518264

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to packages/trusty/rabbitmq-server (8.0)

Reviewed: https://review.fuel-infra.org/14586
Submitter: Pkgs Jenkins <email address hidden>
Branch: 8.0

Commit: 5f342fcfecd0389bee484b4889ef76001ab9fa96
Author: Alexey Lebedeff <email address hidden>
Date: Thu Dec 10 15:32:23 2015

Backport fix for internal state corruption.

Upstream patch - https://github.com/rabbitmq/rabbitmq-common/pull/18

Change-Id: I1d2cd47305c0dfa8279e418088d8ab0e98e4ecc7
Partial-Bug: #1518264

tags: added: swarm-fail-driver
removed: swarm-blocker
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Partial fixes was merged, what is remaining to be done?

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

We think that the fix Alexey posted above and that one https://review.fuel-infra.org/#/c/14487/ should together fix the issue. Please reopen the bug if it reoccurs.

Changed in fuel:
status: In Progress → Fix Committed
tags: removed: swarm-fail-driver
tags: added: on-verification
Revision history for this message
ElenaRossokhina (esolomina) wrote :

iso #427:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "427"
  build_id: "427"
  fuel-nailgun_sha: "9ebbaa0473effafa5adee40270da96acf9c7d58a"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "df16d41cd7a9445cf82ad9fd8f0d53824711fcd8"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "fae42170a54b98d8e8c8db99b0fbb312633c693c"
  fuel-ostf_sha: "214e794835acc7aa0c1c5de936e93696a90bb57a"
  fuel-mirror_sha: "b62f3cce5321fd570c6589bc2684eab994c3f3f2"
  fuelmenu_sha: "85de57080a18fda18e5325f06eaf654b1b931592"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "e8e36cff332644576d7853c80b8a53d5b955420a"

Verified using initial scenario
After destroing of master rabbit node:
root@node-3:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-3' ...
[{nodes,[{disc,['rabbit@node-3','rabbit@node-5']}]},
 {running_nodes,['rabbit@node-5','rabbit@node-3']},
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]}]
root@node-5:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-5' ...
[{nodes,[{disc,['rabbit@node-3','rabbit@node-5']}]},
 {running_nodes,['rabbit@node-3','rabbit@node-5']},
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]}]

crm status output
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-5.test.domain.local ]
     Slaves: [ node-3.test.domain.local ]

Changed in fuel:
status: Fix Committed → Fix Released
tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.