Comment 5 for bug 1441651

Revision history for this message
Bogdan Dobrelya (bogdando) wrote : Re: corosync split brain during rally test

According to atop logs, rally test started approximately at 2015/04/07 23:49:48 (as kestone CPU load suddenly became 714%)

And according to the messages log, there were multiple network flapping, just search for "link is not ready":
  <6>Apr 7 23:40:51 node-53 kernel: [32416.408741] IPv6: ADDRCONF(NETDEV_UP): hapr-m: link is not ready
  <6>Apr 7 23:40:51 node-53 kernel: [32416.433304] IPv6: ADDRCONF(NETDEV_UP): vr-mgmt: link is not ready
  <6>Apr 7 23:44:51 node-53 kernel: [32656.745252] IPv6: ADDRCONF(NETDEV_UP): hapr-m: link is not ready
  <6>Apr 7 23:48:52 node-53 kernel: [32897.083971] IPv6: ADDRCONF(NETDEV_UP): hapr-p: link is not ready

And this flapping caused the split brain in corosync:
  <28>Apr 7 23:42:51 node-53 pengine[11858]: warning: stage6: Node node-58.domain.tld is unclean!
  <28>Apr 7 23:42:51 node-53 pengine[11858]: warning: stage6: Node node-92.domain.tld is unclean!
  <28>Apr 7 23:42:51 node-53 pengine[11858]: warning: stage6: YOUR RESOURCES ARE NOW LIKELY COMPROMISED
  <27>Apr 7 23:42:51 node-53 pengine[11858]: error: stage6: ENABLE STONITH TO KEEP YOUR RESOURCES SAFE
  <28>Apr 7 23:44:51 node-53 pengine[11858]: warning: pe_fence_node: Node node-58.domain.tld is unclean because it is partially and/or un-expectedly down
  <28>Apr 7 23:44:51 node-53 pengine[11858]: warning: determine_online_status: Node node-58.domain.tld is unclean
  <28>Apr 7 23:44:51 node-53 pengine[11858]: warning: pe_fence_node: Node node-92.domain.tld is unclean because it is partially and/or un-expectedly down
  <28>Apr 7 23:44:51 node-53 pengine[11858]: warning: determine_online_status: Node node-92.domain.tld is unclean

This issue cannot be fixed unless the pengine recommendations will be done, which are ENABLE STONITH TO KEEP YOUR RESOURCES SAFE