Fuel for OpenStack

corosync split brain during rally test due to network flapping

Bug #1441651 reported by Leontii Istomin on 2015-04-08

This bug report is a duplicate of: Bug #1443800: [ubuntu] rebuild corosync with pacemaker from vivid mirrors for Fuel mirrors. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Triaged	Critical	Fuel Library (Deprecated)	Fuel for OpenStack 6.1

Bug Description

[root@fuel ~]# fuel --fuel-version
DEPRECATION WARNING: file /etc/fuel/client/config.yaml is found and will be used as a source for settings. However, it deprecated and will not be used by default in the ongoing version of python-fuelclient.
api: '1.0'
astute_sha: 3f1ece0318e5e93eaf48802fefabf512ca1dce40
auth_required: true
build_id: 2015-03-26_21-32-43
build_number: '233'
feature_groups:
- mirantis
fuellib_sha: 9c7716bc2ce6075065d7d9dcf96f4c94662c0b56
fuelmain_sha: 320b5f46fc1b2798f9e86ed7df51d3bda1686c10
nailgun_sha: b163f6fc77d6639aaffd9dd992e1ad96951c3bbf
ostf_sha: a4cf5f218c6aea98105b10c97a4aed8115c15867
production: docker
python-fuelclient_sha: e5e8389d8d481561a4d7107a99daae07c6ec5177
release: '6.1'
release_versions:
  2014.2-6.1:
    VERSION:
      api: '1.0'
      astute_sha: 3f1ece0318e5e93eaf48802fefabf512ca1dce40
      build_id: 2015-03-26_21-32-43
      build_number: '233'
      feature_groups:
      - mirantis
      fuellib_sha: 9c7716bc2ce6075065d7d9dcf96f4c94662c0b56
      fuelmain_sha: 320b5f46fc1b2798f9e86ed7df51d3bda1686c10
      nailgun_sha: b163f6fc77d6639aaffd9dd992e1ad96951c3bbf
      ostf_sha: a4cf5f218c6aea98105b10c97a4aed8115c15867
      production: docker
      python-fuelclient_sha: e5e8389d8d481561a4d7107a99daae07c6ec5177
      release: '6.1'

Successfully deployed the following configuration:
Baremetal, Ubuntu, IBP, HA, Neutron-vlan,Ceph-all, Debug,6.1_233
Controllers:3 Computes:47

during rally tests mysql cluster was failed.
wsrep + crm:
node-52 http://paste.openstack.org/show/200444/
node-58 http://paste.openstack.org/show/200442/
node-92 http://paste.openstack.org/show/200443/

I've tried to restart galera, but it's still dead.

you can find logs from controller nodes here:
http://mos-scale-share.mirantis.com/logs.tar.gz

Tags:

Alexander Ignatov (aignatov) on 2015-04-09

no longer affects:	mos
Changed in fuel:
importance:	Undecided → Critical
assignee:	nobody → Fuel Library Team (fuel-library)
tags:	added: galera

Stanislaw Bogatkin (sbogatkin) on 2015-04-09

Changed in fuel:
status:	New → Confirmed
milestone:	none → 6.1

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-04-09:

Alexander, how did you try to restart galera?

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-04-09: Re: corosync split brain during rally test

This is a corosync split brain, the mysql and other services, such as rabbitmq, will be failed as a result of the split brain as well.
As a result of yet unknown RC, your corosync cluster had been split into two pieces:

1st partition with quorum (DC node-58)

root@node-92:~# crm status
..Current DC: node-58.domain.tld (2) - partition with quorum

root@node-58:~# crm status
...Current DC: node-58.domain.tld (2) - partition with quorum

And a 2nd partition that thinks it has a quorum with DC node-53

root@node-53:~# crm status
...Current DC: node-53.domain.tld (1) - partition with quorum

summary:

- mysql is unreachable during rally test
+ corosync split brain during rally test

Revision history for this message

Alexander Ignatov (aignatov) wrote on 2015-04-09:

@Vladimir, I've just moved this issue from mos to fuel space in LP. The question should be addressed to @Leontiy.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-04-09:

This bug is potentially a dup of https://bugs.launchpad.net/bugs/1439120. But please let's keep it as a separate for a while. Have to investigate the logs and the environment itself

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-04-10:

According to atop logs, rally test started approximately at 2015/04/07 23:49:48 (as kestone CPU load suddenly became 714%)

And according to the messages log, there were multiple network flapping, just search for "link is not ready":
  <6>Apr 7 23:40:51 node-53 kernel: [32416.408741] IPv6: ADDRCONF(NETDEV_UP): hapr-m: link is not ready
  <6>Apr 7 23:40:51 node-53 kernel: [32416.433304] IPv6: ADDRCONF(NETDEV_UP): vr-mgmt: link is not ready
  <6>Apr 7 23:44:51 node-53 kernel: [32656.745252] IPv6: ADDRCONF(NETDEV_UP): hapr-m: link is not ready
  <6>Apr 7 23:48:52 node-53 kernel: [32897.083971] IPv6: ADDRCONF(NETDEV_UP): hapr-p: link is not ready

And this flapping caused the split brain in corosync:
  <28>Apr 7 23:42:51 node-53 pengine[11858]: warning: stage6: Node node-58.domain.tld is unclean!
  <28>Apr 7 23:42:51 node-53 pengine[11858]: warning: stage6: Node node-92.domain.tld is unclean!
  <28>Apr 7 23:42:51 node-53 pengine[11858]: warning: stage6: YOUR RESOURCES ARE NOW LIKELY COMPROMISED
  <27>Apr 7 23:42:51 node-53 pengine[11858]: error: stage6: ENABLE STONITH TO KEEP YOUR RESOURCES SAFE
  <28>Apr 7 23:44:51 node-53 pengine[11858]: warning: pe_fence_node: Node node-58.domain.tld is unclean because it is partially and/or un-expectedly down
  <28>Apr 7 23:44:51 node-53 pengine[11858]: warning: determine_online_status: Node node-58.domain.tld is unclean
  <28>Apr 7 23:44:51 node-53 pengine[11858]: warning: pe_fence_node: Node node-92.domain.tld is unclean because it is partially and/or un-expectedly down
  <28>Apr 7 23:44:51 node-53 pengine[11858]: warning: determine_online_status: Node node-92.domain.tld is unclean

This issue cannot be fixed unless the pengine recommendations will be done, which are ENABLE STONITH TO KEEP YOUR RESOURCES SAFE

Changed in fuel:
status:	Confirmed → Triaged
summary:	- corosync split brain during rally test + corosync split brain during rally test due to network flapping and no + STONITH enabled

Bogdan Dobrelya (bogdando) on 2015-04-10

summary:

- corosync split brain during rally test due to network flapping and no
- STONITH enabled
+ corosync split brain during rally test due to network flapping

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-04-10:

I managed to reproduce this issue once again, with the following steps: http://pastebin.com/SkqyG4Hd
As you can see, nodes report different pcs statuses
Logs attached