According to atop logs, rally test started approximately at 2015/04/07 23:49:48 (as kestone CPU load suddenly became 714%)
And according to the messages log, there were multiple network flapping, just search for "link is not ready":
<6>Apr 7 23:40:51 node-53 kernel: [32416.408741] IPv6: ADDRCONF(NETDEV_UP): hapr-m: link is not ready
<6>Apr 7 23:40:51 node-53 kernel: [32416.433304] IPv6: ADDRCONF(NETDEV_UP): vr-mgmt: link is not ready
<6>Apr 7 23:44:51 node-53 kernel: [32656.745252] IPv6: ADDRCONF(NETDEV_UP): hapr-m: link is not ready
<6>Apr 7 23:48:52 node-53 kernel: [32897.083971] IPv6: ADDRCONF(NETDEV_UP): hapr-p: link is not ready
And this flapping caused the split brain in corosync:
<28>Apr 7 23:42:51 node-53 pengine[11858]: warning: stage6: Node node-58.domain.tld is unclean!
<28>Apr 7 23:42:51 node-53 pengine[11858]: warning: stage6: Node node-92.domain.tld is unclean!
<28>Apr 7 23:42:51 node-53 pengine[11858]: warning: stage6: YOUR RESOURCES ARE NOW LIKELY COMPROMISED
<27>Apr 7 23:42:51 node-53 pengine[11858]: error: stage6: ENABLE STONITH TO KEEP YOUR RESOURCES SAFE
<28>Apr 7 23:44:51 node-53 pengine[11858]: warning: pe_fence_node: Node node-58.domain.tld is unclean because it is partially and/or un-expectedly down
<28>Apr 7 23:44:51 node-53 pengine[11858]: warning: determine_online_status: Node node-58.domain.tld is unclean
<28>Apr 7 23:44:51 node-53 pengine[11858]: warning: pe_fence_node: Node node-92.domain.tld is unclean because it is partially and/or un-expectedly down
<28>Apr 7 23:44:51 node-53 pengine[11858]: warning: determine_online_status: Node node-92.domain.tld is unclean
This issue cannot be fixed unless the pengine recommendations will be done, which are ENABLE STONITH TO KEEP YOUR RESOURCES SAFE
According to atop logs, rally test started approximately at 2015/04/07 23:49:48 (as kestone CPU load suddenly became 714%)
And according to the messages log, there were multiple network flapping, just search for "link is not ready": NETDEV_ UP): hapr-m: link is not ready NETDEV_ UP): vr-mgmt: link is not ready NETDEV_ UP): hapr-m: link is not ready NETDEV_ UP): hapr-p: link is not ready
<6>Apr 7 23:40:51 node-53 kernel: [32416.408741] IPv6: ADDRCONF(
<6>Apr 7 23:40:51 node-53 kernel: [32416.433304] IPv6: ADDRCONF(
<6>Apr 7 23:44:51 node-53 kernel: [32656.745252] IPv6: ADDRCONF(
<6>Apr 7 23:48:52 node-53 kernel: [32897.083971] IPv6: ADDRCONF(
And this flapping caused the split brain in corosync: online_ status: Node node-58.domain.tld is unclean online_ status: Node node-92.domain.tld is unclean
<28>Apr 7 23:42:51 node-53 pengine[11858]: warning: stage6: Node node-58.domain.tld is unclean!
<28>Apr 7 23:42:51 node-53 pengine[11858]: warning: stage6: Node node-92.domain.tld is unclean!
<28>Apr 7 23:42:51 node-53 pengine[11858]: warning: stage6: YOUR RESOURCES ARE NOW LIKELY COMPROMISED
<27>Apr 7 23:42:51 node-53 pengine[11858]: error: stage6: ENABLE STONITH TO KEEP YOUR RESOURCES SAFE
<28>Apr 7 23:44:51 node-53 pengine[11858]: warning: pe_fence_node: Node node-58.domain.tld is unclean because it is partially and/or un-expectedly down
<28>Apr 7 23:44:51 node-53 pengine[11858]: warning: determine_
<28>Apr 7 23:44:51 node-53 pengine[11858]: warning: pe_fence_node: Node node-92.domain.tld is unclean because it is partially and/or un-expectedly down
<28>Apr 7 23:44:51 node-53 pengine[11858]: warning: determine_
This issue cannot be fixed unless the pengine recommendations will be done, which are ENABLE STONITH TO KEEP YOUR RESOURCES SAFE