VIP set as corosync node address, cluster state desynced, VIP down
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack HA Cluster Charm |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
I deployed an openstack-
Originally the cluster was all in sync at both a pacemaker and mysql level, and mysql/1 was the VIP owner. Then the node hosting the VIP had a network/machine outage for ~10s due to openstack live migration.
mysql/0 juju-36680f-train-7
mysql/1 juju-36680f-
mysql/2 juju-36680f-
After that I found machines in the following problematic state:
- mysql/1 "crm status" shows all other nodes offline except itself. Both VIP and cl_mysql_monitor stopped on all nodes. Last updated time is current, but last change reported as "Tue Nov 17 05:11:25 2020 by hacluster via crmd on juju-36680f-
- mysql/0 and mysql/2 "crm status" shows all 3 nodes online, with the VIP still started on mysql/1. Last updated time is current, but last change reported as ~40 minutes newer "Tue Nov 17 05:52:05 2020 by hacluster via crmd on juju-36680f-
- mysql/0 and mysql/2 continously logs the following messages:
Nov 17 06:50:39 juju-36680f-train-7 corosync[13083]: notice [TOTEM ] A new membership (10.5.0.130:5940) was formed. Members
Nov 17 06:50:39 juju-36680f-train-7 corosync[13083]: [TOTEM ] A new membership (10.5.0.130:5940) was formed. Members
- mysql/1 journal ctl corosync shows:
Nov 17 05:50:33 juju-36680f-
Nov 17 05:50:33 juju-36680f-
Nov 17 05:50:33 juju-36680f-
Nov 17 05:50:33 juju-36680f-
Nov 17 06:47:52 juju-36680f-
Nov 17 06:47:52 juju-36680f-
Nov 17 06:47:59 juju-36680f-
Nov 17 06:47:59 juju-36680f-
- mysql/0 and mysql/2 both have 10.5.100.0 (the VIP) as the ring0_addr for corosync.conf
- mysql/1 has it's real IP 10.5.2.197 in corosync.conf
In the past we had a bug where the corosync messenger was hung, and pacemaker has no heartbeat mechanism to actually detect this - hence the cluster state desync. Perhaps the same has happened here? Need to find the bug link.
2 main bugs to consider:
- The VIP should not get used as the juju address on the relation. There are a number of bugs related to this that I suspect are largely to do with not using a network space binding. This bug may not happen on a MAAS deployment but we should check.
- corosync should not get frozen/desynced so that the other nodes still think everything is still OK. This may be a bug that needs backporting that could happen in other scenarios.
Will potentially split into a second bug once either of these are confirmed.
tags: | added: seg |
tags: | added: sts |
Changed in charm-hacluster: | |
status: | New → Confirmed |
You allude to this at the end but, is this caused by https:/ /bugs.launchpad .net/juju/ +bug/1863916 ?