corosync split brain during rally test due to network flapping

Bug #1441651 reported by Leontii Istomin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Triaged
Critical
Fuel Library (Deprecated)

Bug Description

[root@fuel ~]# fuel --fuel-version
DEPRECATION WARNING: file /etc/fuel/client/config.yaml is found and will be used as a source for settings. However, it deprecated and will not be used by default in the ongoing version of python-fuelclient.
api: '1.0'
astute_sha: 3f1ece0318e5e93eaf48802fefabf512ca1dce40
auth_required: true
build_id: 2015-03-26_21-32-43
build_number: '233'
feature_groups:
- mirantis
fuellib_sha: 9c7716bc2ce6075065d7d9dcf96f4c94662c0b56
fuelmain_sha: 320b5f46fc1b2798f9e86ed7df51d3bda1686c10
nailgun_sha: b163f6fc77d6639aaffd9dd992e1ad96951c3bbf
ostf_sha: a4cf5f218c6aea98105b10c97a4aed8115c15867
production: docker
python-fuelclient_sha: e5e8389d8d481561a4d7107a99daae07c6ec5177
release: '6.1'
release_versions:
  2014.2-6.1:
    VERSION:
      api: '1.0'
      astute_sha: 3f1ece0318e5e93eaf48802fefabf512ca1dce40
      build_id: 2015-03-26_21-32-43
      build_number: '233'
      feature_groups:
      - mirantis
      fuellib_sha: 9c7716bc2ce6075065d7d9dcf96f4c94662c0b56
      fuelmain_sha: 320b5f46fc1b2798f9e86ed7df51d3bda1686c10
      nailgun_sha: b163f6fc77d6639aaffd9dd992e1ad96951c3bbf
      ostf_sha: a4cf5f218c6aea98105b10c97a4aed8115c15867
      production: docker
      python-fuelclient_sha: e5e8389d8d481561a4d7107a99daae07c6ec5177
      release: '6.1'

Successfully deployed the following configuration:
Baremetal, Ubuntu, IBP, HA, Neutron-vlan,Ceph-all, Debug,6.1_233
Controllers:3 Computes:47

during rally tests mysql cluster was failed.
wsrep + crm:
node-52 http://paste.openstack.org/show/200444/
node-58 http://paste.openstack.org/show/200442/
node-92 http://paste.openstack.org/show/200443/

I've tried to restart galera, but it's still dead.

you can find logs from controller nodes here:
http://mos-scale-share.mirantis.com/logs.tar.gz

Tags: galera scale
no longer affects: mos
Changed in fuel:
importance: Undecided → Critical
assignee: nobody → Fuel Library Team (fuel-library)
tags: added: galera
Changed in fuel:
status: New → Confirmed
milestone: none → 6.1
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Alexander, how did you try to restart galera?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote : Re: corosync split brain during rally test

This is a corosync split brain, the mysql and other services, such as rabbitmq, will be failed as a result of the split brain as well.
As a result of yet unknown RC, your corosync cluster had been split into two pieces:

1st partition with quorum (DC node-58)

root@node-92:~# crm status
..Current DC: node-58.domain.tld (2) - partition with quorum

root@node-58:~# crm status
...Current DC: node-58.domain.tld (2) - partition with quorum

And a 2nd partition that thinks it has a quorum with DC node-53

root@node-53:~# crm status
...Current DC: node-53.domain.tld (1) - partition with quorum

summary: - mysql is unreachable during rally test
+ corosync split brain during rally test
Revision history for this message
Alexander Ignatov (aignatov) wrote :

@Vladimir, I've just moved this issue from mos to fuel space in LP. The question should be addressed to @Leontiy.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This bug is potentially a dup of https://bugs.launchpad.net/bugs/1439120. But please let's keep it as a separate for a while. Have to investigate the logs and the environment itself

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

According to atop logs, rally test started approximately at 2015/04/07 23:49:48 (as kestone CPU load suddenly became 714%)

And according to the messages log, there were multiple network flapping, just search for "link is not ready":
  <6>Apr 7 23:40:51 node-53 kernel: [32416.408741] IPv6: ADDRCONF(NETDEV_UP): hapr-m: link is not ready
  <6>Apr 7 23:40:51 node-53 kernel: [32416.433304] IPv6: ADDRCONF(NETDEV_UP): vr-mgmt: link is not ready
  <6>Apr 7 23:44:51 node-53 kernel: [32656.745252] IPv6: ADDRCONF(NETDEV_UP): hapr-m: link is not ready
  <6>Apr 7 23:48:52 node-53 kernel: [32897.083971] IPv6: ADDRCONF(NETDEV_UP): hapr-p: link is not ready

And this flapping caused the split brain in corosync:
  <28>Apr 7 23:42:51 node-53 pengine[11858]: warning: stage6: Node node-58.domain.tld is unclean!
  <28>Apr 7 23:42:51 node-53 pengine[11858]: warning: stage6: Node node-92.domain.tld is unclean!
  <28>Apr 7 23:42:51 node-53 pengine[11858]: warning: stage6: YOUR RESOURCES ARE NOW LIKELY COMPROMISED
  <27>Apr 7 23:42:51 node-53 pengine[11858]: error: stage6: ENABLE STONITH TO KEEP YOUR RESOURCES SAFE
  <28>Apr 7 23:44:51 node-53 pengine[11858]: warning: pe_fence_node: Node node-58.domain.tld is unclean because it is partially and/or un-expectedly down
  <28>Apr 7 23:44:51 node-53 pengine[11858]: warning: determine_online_status: Node node-58.domain.tld is unclean
  <28>Apr 7 23:44:51 node-53 pengine[11858]: warning: pe_fence_node: Node node-92.domain.tld is unclean because it is partially and/or un-expectedly down
  <28>Apr 7 23:44:51 node-53 pengine[11858]: warning: determine_online_status: Node node-92.domain.tld is unclean

This issue cannot be fixed unless the pengine recommendations will be done, which are ENABLE STONITH TO KEEP YOUR RESOURCES SAFE

Changed in fuel:
status: Confirmed → Triaged
summary: - corosync split brain during rally test
+ corosync split brain during rally test due to network flapping and no
+ STONITH enabled
summary: - corosync split brain during rally test due to network flapping and no
- STONITH enabled
+ corosync split brain during rally test due to network flapping
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I managed to reproduce this issue once again, with the following steps: http://pastebin.com/SkqyG4Hd
As you can see, nodes report different pcs statuses
Logs attached

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.