Consequent -1 node failures lead to Corosync split-brain

Bug #1436343 reported by Alexey Khivin
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Critical
Bogdan Dobrelya
6.0.x
Undecided
Bogdan Dobrelya

Bug Description

Multinode with HA with 3 controllers (node-10, node-7, node-8) + 2 Node (compute+storage) Juno on Ubuntu 14.01

Test case:
1. kill one of controllers (master controller or slave controller for rabbitmq)
2. wait until RabbitMQ cluster became reassembled by ocf script and became stable
3. turn on controller
4. wait until RabbitMQ cluster became reassembled by ocf script and became stable
5. goto 1

after maybe 5-7 cycles we see two different RabbitMQ clusters within the cloud
http://paste.openstack.org/show/196563/

This state is _permanent_ and will not be fixed by OCF script without admin actions

node-7 and node-10 both has the master status with the slave node-8

[root@fuel ~]# fuel --f
DEPRECATION WARNING: file /etc/fuel/client/config.yaml is found and will be used as a source for settings. However, it deprecated and will not be used by default in the ongoing version of python-fuelclient.
api: '1.0'
astute_sha: 4a117a1ca6bdcc34fe4d086959ace1a6d18eeca9
auth_required: true
build_id: 2015-03-23_15-29-20
build_number: '218'
feature_groups:
- mirantis
fuellib_sha: a0265ae47bb2307a6967a3f1dd06fe222c561265
fuelmain_sha: a05ab877af31924585c81081f45305700961458e
nailgun_sha: 7c100f47450ea1a910e19fa09f78d586cb2bc0d3
ostf_sha: a4cf5f218c6aea98105b10c97a4aed8115c15867
production: docker
python-fuelclient_sha: 3624051242c83fdbdd1df9a0e466797c06b75043
release: '6.1'
release_versions:
  2014.2-6.1:
    VERSION:
      api: '1.0'
      astute_sha: 4a117a1ca6bdcc34fe4d086959ace1a6d18eeca9
      build_id: 2015-03-23_15-29-20
      build_number: '218'
      feature_groups:
      - mirantis
      fuellib_sha: a0265ae47bb2307a6967a3f1dd06fe222c561265
      fuelmain_sha: a05ab877af31924585c81081f45305700961458e
      nailgun_sha: 7c100f47450ea1a910e19fa09f78d586cb2bc0d3
      ostf_sha: a4cf5f218c6aea98105b10c97a4aed8115c15867
      production: docker
      python-fuelclient_sha: 3624051242c83fdbdd1df9a0e466797c06b75043
      release: '6.1'

Alexey Khivin (akhivin)
description: updated
Alexey Khivin (akhivin)
Changed in fuel:
importance: Undecided → High
Revision history for this message
Alexey Khivin (akhivin) wrote :

The time periods between the actions were not less then 10-20 minutes. Every time rabbit cluster became fully assembled and stable

Revision history for this message
Alexey Khivin (akhivin) wrote :
Alexey Khivin (akhivin)
description: updated
Alexey Khivin (akhivin)
summary: - RabbitMQ split brain
+ RabbitMQ split-brain
description: updated
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Re: RabbitMQ split-brain

There is possibly some overlap between this and these two:
https://bugs.launchpad.net/fuel/+bug/1435254
https://bugs.launchpad.net/fuel/+bug/1435250

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

"crm status" on the 3 nodes node-7, node-8, node-10 are here:

http://paste.openstack.org/show/196621/

They seem to indicate that 7 and 10 both think of themselves as master with 8 as their slave. guess it goes downhill from that point on ending up with the bad rabbitmq cluster as well.

Alexey Khivin (akhivin)
description: updated
summary: - RabbitMQ split-brain
+ Pacemaker cluster failure leads to RabbitMQ split-brain
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :
Changed in fuel:
milestone: none → 6.1
summary: - Pacemaker cluster failure leads to RabbitMQ split-brain
+ Pacemaker cluster failure leads to Corosync split-brain
Changed in fuel:
status: New → Confirmed
summary: - Pacemaker cluster failure leads to Corosync split-brain
+ Consequent -1 node failures lead to Corosync split-brain
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The split brain was happened to the Corosync cluster, and as a result both rabbitmq and galera is affected as well as all of the pacemaker resources. Split brain clusters cannot operate and such cases should be handled by STONITH of minority partitions

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Please elaborate the bug description - which exactly controller node should be restarted in a loop in order to reproduce this split?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The solution is:
 * configure the Pacemaker built-in STONITH, for example with the help of the HA fencing plugin [0]
* put no-quorum-policy to suicide. But it should be tested, of course.

[0] https://blueprints.launchpad.net/fuel/+spec/fencing-in-puppet-manifests

Alexey Khivin (akhivin)
tags: added: corosync
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I can reproduce this issue sometimes even with the one and only failover and failback:
- Destroy the controller node
- Power it up

Watch the split brain as a result, like:

node-3, node-4 reports:
Current DC: node-3 (1) - partition with quorum
Version: 1.1.10-42f2063
3 Nodes configured
25 Resources configured
Online: [ node-3 node-4 ]
OFFLINE: [ node-9 ]

vs node-9 reports:
Current DC: node-3 (1) - partition with quorum
Version: 1.1.10-42f2063
3 Nodes configured
25 Resources configured
Online: [ node-3 node-9 ]
OFFLINE: [ node-4 ]

I'm not sure if it is a real split brain, or just some pacemaker/corosync issue?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Note, I reproduce it on Ubuntu Trusty:
root@node-9:~# apt-cache policy pacemaker
pacemaker:
  Installed: 1.1.10+git20130802-1ubuntu2.3
 corosync:
  Installed: 2.3.3-1ubuntu1

 So this issue can be related to the given packages versions only.
Need more details on reproducing @ Centos 6.5 with pacemaker 1.1.12

Changed in fuel:
importance: High → Critical
Changed in fuel:
importance: Critical → High
Alexey Khivin (akhivin)
description: updated
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This issue looks related to the pacemaker 1.1.10 version shipped with Ubuntu Trusty. I never saw this issue on Centos 6.5 with pacemaker 1.1.12. I suggest to push custom build for pacemaker 1.1.12 to Ubuntu as well

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Note: the reproducing steps can be simplified as the following:
instead of killing a controller node, just restart the corosync and next, the pacemaker service

tags: added: to-be-covered-by-tests
Revision history for this message
Sergey Yudin (tsipa740) wrote :

Bogdan told me that this issue related https://bugs.launchpad.net/fuel/+bug/1438823

I was able to solve my issue https://bugs.launchpad.net/fuel/+bug/1438823 upgrading pacemaker to 1.1.12 and buiding 1.1.12 against corosync 1.4.6-2, so i don't have to use corosync 2 under ubuntu while had the issue solved.
You can cherry pick this commit if you'll decide to follow same approach https://gerrit.mirantis.com/#/c/43794/

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Can someone please bump this to critical? thanks!

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Raised to critical as the issue impacts OpenStack environment operations

Changed in fuel:
importance: High → Critical
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I managed to reproduce this issue once again, with the following steps: http://pastebin.com/SkqyG4Hd
As you can see, nodes report different pcs statuses
Logs attached

tags: removed: rabbitmq
Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

There is another split case reproduced while testing this:

When there is a minority partition w/o a quorum:

node-3: Mon Apr 13 09:05:38 UTC 2015
Current DC: node-3.test.domain.local (3) - partition with quorum
Online: [ node-2.test.domain.local node-3.test.domain.local ]
OFFLINE: [ node-1.test.domain.local ]

node-2: Mon Apr 13 09:05:38 UTC 2015
Current DC: node-2.test.domain.local (2) - partition WITHOUT quorum
Online: [ node-2.test.domain.local ]
OFFLINE: [ node-1.test.domain.local node-3.test.domain.local ]

Tthis case clearly shows that no-quorum-policy=stop works excellent and prevents split brain. But the node in minority partition should still be "healed" manually or STONITHed in order to return back in the cluster. Note, that later this minority partition gains quorum, which is *very* strange, but resources still kept stopped, so it is acceptable:

  Current DC: node-2.test.domain.local (2) - partition with quorum
  Online: [ node-2.test.domain.local ]
  OFFLINE: [ node-1.test.domain.local node-3.test.domain.local ]

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

According to the testing results, as a solution I suggest to configure vivid versions pinning for corosync and related packages

Changed in fuel:
status: In Progress → Triaged
tags: added: scale
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to packages/trusty/libqb (6.1)

Related fix proposed to branch: 6.1
Change author: Aleksandr Mogylchenko <email address hidden>
Review: https://review.fuel-infra.org/5906

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to fuel-infra/jeepyb-config (master)

Related fix proposed to branch: master
Change author: Aleksandr Mogylchenko <email address hidden>
Review: https://review.fuel-infra.org/5917

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to fuel-infra/jeepyb-config (master)

Reviewed: https://review.fuel-infra.org/5917
Submitter: Andrey Nikitin <email address hidden>
Branch: master

Commit: 06d46aa96a06d3c483f788a2a8bc06a47c2f65bd
Author: Aleksandr Mogylchenko <email address hidden>
Date: Fri Apr 17 10:22:28 2015

Adding corosync to Trusty package repositories.

Since it was decided that we need to rebuild the whole stack, corosync
should be added (see bugs descriptions for more details).

Change-Id: I8a7e18661f5350e673be9056699f667d6ad60017
Partial-Bug: #1443800
Related-Bug: #1439120
Related-Bug: #1436343

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to packages/trusty/corosync (6.1)

Related fix proposed to branch: 6.1
Change author: Aleksandr Mogylchenko <email address hidden>
Review: https://review.fuel-infra.org/5950

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to packages/trusty/libqb (6.1)

Reviewed: https://review.fuel-infra.org/5906
Submitter: Michael Semenov <email address hidden>
Branch: 6.1

Commit: 3ec75bd020b44c453e7ce64a3749674e5a06508c
Author: Aleksandr Mogylchenko <email address hidden>
Date: Tue Apr 28 12:40:02 2015

Adding libqb 0.17 fom Vivid.

Sources:
http://archive.ubuntu.com/ubuntu/pool/main/libq/libqb/libqb_0.17.0.orig.tar.gz

It was decided that we need to import HA stack (corosync, pacemaker and libqb)
from Vivid, since testing revealed unstability of versions in Trusty:
https://bugs.launchpad.net/ubuntu/+source/libqb/+bug/1341496

Partial-Bug: #1443800
Related-Bug: #1439120
Related-Bug: #1436343
Change-Id: I23bda853b11f7a9b442d31e6ad1ba3cb5beb4e9e

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to packages/trusty/corosync (6.1)

Reviewed: https://review.fuel-infra.org/5950
Submitter: Michael Semenov <email address hidden>
Branch: 6.1

Commit: bac08f0040e307695003483aa1834872bfcc4ee2
Author: Aleksandr Mogylchenko <email address hidden>
Date: Tue Apr 28 12:38:33 2015

Add corosync 2.3.4 from Vivid

It was decided that we need to import HA stack (corosync, pacemaker and libqb)
from Vivid, since testing revealed unstability of versions in Trusty,
causing split brain issues after node reboot. (see
bugs descriptions for more details).

Sources:
http://archive.ubuntu.com/ubuntu/pool/main/c/corosync/corosync_2.3.4.orig.tar.gz

Partial-Bug: #1443800
Related-Bug: #1439120
Related-Bug: #1436343
Change-Id: I6b6b0e40e6d39cf78d35a0af825d655f9e70ac10

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers