Fuel for OpenStack

There are 2 rabbitmq cluster after failover of one controller that was rabbit master previosly

Bug #1518264 reported by Tatyanka on 2015-11-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Released	High	MOS Oslo	Fuel for OpenStack 8.0

Bug Description

Steps to reproduce:
1. Deploy ha environment with 3 controller 1 compute and cinder nodes.
2. When cluster is ready, run health check, all tests should be passed
3. ssh on first controller and find controller where rabbit master is running using pcs status
4. Destroy it (virsh destroy <domain_name>)
6. wait for 10 minutes while cluster recovers
7. Run ostf

Expected result:
New rabbit master re-election happens, there is one rabbit cluster, pacs status on the last 2 controllers shows the same, rabbitmqctl cluster_status shows the same on each online controller

Actual:
Master rabbit was not elected, for each online controller rabbitmqctl shows that there is single one node cluster:
root@node-3:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-3' ...
[{nodes,[{disc,['rabbit@node-3']}]}]
root@node-3:~# OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/fuel/rabbitmq-server status ; echo $?
0

on other online controller
root@node-1:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-1' ...
[{nodes,[{disc,['rabbit@node-1']}]},
{running_nodes,['rabbit@node-1']},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]}]
root@node-1:~# OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/fuel/rabbitmq-server status ; echo $?
0
root@node-1:~#

pcs status
Cluster name:
Last updated: Fri Nov 20 10:55:27 2015
Last change: Fri Nov 20 05:42:14 2015
Stack: corosync
Current DC: node-1.test.domain.local (1) - partition with quorum
Version: 1.1.12-561c4cf
3 Nodes configured
46 Resources configured

Online: [ node-1.test.domain.local node-3.test.domain.local ]
OFFLINE: [ node-2.test.domain.local ]

Full list of resources:

Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
vip__management (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
vip__vrouter (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
vip__public (ocf::fuel:ns_IPaddr2): Started node-3.test.domain.local
Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
Clone Set: clone_p_dns [p_dns]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Slaves: [ node-1.test.domain.local node-3.test.domain.local ]
Clone Set: clone_p_neutron-plugin-openvswitch-agent [p_neutron-plugin-openvswitch-agent]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
Clone Set: clone_p_neutron-l3-agent [p_neutron-l3-agent]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
Clone Set: clone_p_neutron-dhcp-agent [p_neutron-dhcp-agent]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
Clone Set: clone_p_heat-engine [p_heat-engine]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
Master/Slave Set: master_p_conntrackd [p_conntrackd]
     Masters: [ node-1.test.domain.local ]
     Slaves: [ node-3.test.domain.local ]
sysinfo_node-1.test.domain.local (ocf::pacemaker:SysInfo): Started node-1.test.domain.local
sysinfo_node-3.test.domain.local (ocf::pacemaker:SysInfo): Started node-3.test.domain.local
sysinfo_node-2.test.domain.local (ocf::pacemaker:SysInfo): Stopped
Clone Set: clone_ping_vip__public [ping_vip__public]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]
Clone Set: clone_p_ntp [p_ntp]
     Started: [ node-1.test.domain.local node-3.test.domain.local ]

healthcheck tests are failed
[root@nailgun ~]# cat /etc/fuel/version.yaml
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  openstack_version: "2015.1.0-8.0"
  api: "1.0"
  build_number: "181"
  build_id: "181"
  fuel-nailgun_sha: "e5600e8a87745323c2a9c524b7c0647c9829ebb9"
  python-fuelclient_sha: "e685d68c1c0d0fa0491a250f07d9c3a8d0f9608c"
  fuel-agent_sha: "6faa1e0ba836ef114b3c1a6c4d12469fc66ae402"
  fuel-nailgun-agent_sha: "db1738a65012d2a1d2b7e83cc2a44a196e4290b4"
  astute_sha: "c8400f51b0b92254da206de55ef89d17fdf35393"
  fuel-library_sha: "f9281f9c3b08ed16b2fb411e30f5d48809446030"
  fuel-ostf_sha: "7e24fc802a95d2f0627512e585f8977f587aea18"
  fuel-createmirror_sha: "22a7aacd95bbdca69f9e0f08b70facabdec8fb28"
  fuelmenu_sha: "d12061b1aee82f81b3d074de74ea27a6e962a686"
  shotgun_sha: "c377d163519f6d10b69a654019d6086ba5f14edc"
  network-checker_sha: "a57e1d69acb5e765eb22cab0251c589cd76f51da"
  fuel-upgrade_sha: "1e894e26d4e1423a9b0d66abd6a79505f4175ff6"
  fuelmain_sha: "cd084cf5c4372a46184fb7c2f24568da4e030be2"
[root@nailgun ~]#

Tags:

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2015-11-20:

https://product-ci.infra.mirantis.net/job/8.0.system_test.ubuntu.ha_neutron_destructive/47/consoleFull

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2015-11-20:

fail_error_ha_rabbitmq_stability_check-fuel-snapshot-2015-11-20_06-00-48.tar.xz Edit (74.4 MiB, application/octet-stream)

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-11-20:

The issue is caused by pacemaker, which for some reason can not select a new master. It can be seen in 'pcs resource' output that it continues to run with 2 slaves, though there is no master: http://paste.openstack.org/show/479567/

Here is 'crm_mon -fotAW -1' output, though I can not make anythin useful out of it: http://paste.openstack.org/show/479568/

Also, pacemaker logs contain these entries repeating every 10-20 seconds:

Nov 20 11:49:51 [16248] node-1.test.domain.local pengine: info: clone_print: Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
Nov 20 11:49:51 [16248] node-1.test.domain.local pengine: info: native_color: Resource p_rabbitmq-server:2 cannot run anywhere
Nov 20 11:49:51 [16248] node-1.test.domain.local pengine: info: master_color: master_p_rabbitmq-server: Promoted 0 instances of a possible 1 to master
Nov 20 11:49:51 [16248] node-1.test.domain.local pengine: info: LogActions: Leave p_rabbitmq-server:0 (Slave node-1.test.domain.local)
Nov 20 11:49:51 [16248] node-1.test.domain.local pengine: info: LogActions: Leave p_rabbitmq-server:1 (Slave node-3.test.domain.local)
Nov 20 11:49:51 [16248] node-1.test.domain.local pengine: info: LogActions: Leave p_rabbitmq-server:2 (Stopped)

Nastya Urlapova (aurlapova) on 2015-11-24

tags:

added: swarm-blocker

Roman Podoliaka (rpodolyaka) on 2015-11-24

Changed in fuel:
status:	New → Confirmed

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2015-11-24:

Dima,

Does this need to be worked on by another team? Can you please move it along?

Thanks,
Dims

Timur Nurlygayanov (tnurlygayanov) on 2015-12-09

tags:

added: blocker-for-qa

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-12-10: Fix proposed to packages/trusty/rabbitmq-server (master)

Fix proposed to branch: master
Change author: Alexey Lebedeff <email address hidden>
Review: https://review.fuel-infra.org/14569

Changed in fuel:
status:	Confirmed → In Progress

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-12-10: Fix proposed to packages/trusty/rabbitmq-server (8.0)

Fix proposed to branch: 8.0
Change author: Alexey Lebedeff <email address hidden>
Review: https://review.fuel-infra.org/14586

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-12-11: Fix merged to packages/trusty/rabbitmq-server (master)

Reviewed: https://review.fuel-infra.org/14569
Submitter: Pkgs Jenkins <email address hidden>
Branch: master

Commit: 23c9115f9084778225763793619787b6a4f3ce20
Author: Alexey Lebedeff <email address hidden>
Date: Thu Dec 10 12:34:24 2015

Backport fix for internal state corruption.

Upstream patch - https://github.com/rabbitmq/rabbitmq-common/pull/18

Change-Id: I1d2cd47305c0dfa8279e418088d8ab0e98e4ecc7
Partial-Bug: #1518264

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-12-11: Fix merged to packages/trusty/rabbitmq-server (8.0)

Reviewed: https://review.fuel-infra.org/14586
Submitter: Pkgs Jenkins <email address hidden>
Branch: 8.0

Commit: 5f342fcfecd0389bee484b4889ef76001ab9fa96
Author: Alexey Lebedeff <email address hidden>
Date: Thu Dec 10 15:32:23 2015

Backport fix for internal state corruption.

Upstream patch - https://github.com/rabbitmq/rabbitmq-common/pull/18

Change-Id: I1d2cd47305c0dfa8279e418088d8ab0e98e4ecc7
Partial-Bug: #1518264

Nastya Urlapova (aurlapova) on 2015-12-15

tags:

added: swarm-fail-driver
removed: swarm-blocker

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-16:

Partial fixes was merged, what is remaining to be done?

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-12-22:

#10

We think that the fix Alexey posted above and that one https://review.fuel-infra.org/#/c/14487/ should together fix the issue. Please reopen the bug if it reoccurs.

Changed in fuel:
status:	In Progress → Fix Committed

Sergey Shevorakov (sshevorakov) on 2015-12-23

tags:

removed: swarm-fail-driver

ElenaRossokhina (esolomina) on 2016-01-14

tags:

added: on-verification

Revision history for this message

ElenaRossokhina (esolomina) wrote on 2016-01-14:

#11

iso #427:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "427"
  build_id: "427"
  fuel-nailgun_sha: "9ebbaa0473effafa5adee40270da96acf9c7d58a"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "df16d41cd7a9445cf82ad9fd8f0d53824711fcd8"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "fae42170a54b98d8e8c8db99b0fbb312633c693c"
  fuel-ostf_sha: "214e794835acc7aa0c1c5de936e93696a90bb57a"
  fuel-mirror_sha: "b62f3cce5321fd570c6589bc2684eab994c3f3f2"
  fuelmenu_sha: "85de57080a18fda18e5325f06eaf654b1b931592"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "e8e36cff332644576d7853c80b8a53d5b955420a"

Verified using initial scenario
After destroing of master rabbit node:
root@node-3:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-3' ...
[{nodes,[{disc,['rabbit@node-3','rabbit@node-5']}]},
{running_nodes,['rabbit@node-5','rabbit@node-3']},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]}]
root@node-5:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-5' ...
[{nodes,[{disc,['rabbit@node-3','rabbit@node-5']}]},
{running_nodes,['rabbit@node-3','rabbit@node-5']},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]}]

crm status output
Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
Masters: [ node-5.test.domain.local ]
Slaves: [ node-3.test.domain.local ]

Changed in fuel:
status:	Fix Committed → Fix Released
tags:	removed: on-verification

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

fail_error_ha_rabbitmq_stability_check-fuel-snapshot-2015-11-20_06-00-48.tar.xz Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.