[RFE] nrpe check for 'expected votes' > cluster_count
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack HA Cluster Charm |
Triaged
|
Wishlist
|
Unassigned |
Bug Description
In some cases, pacemaker and corosync may end up with stale entries (especially as long as LP#1821109 remains open).
Offline pacemaker nodes are already caught by the check_crm nrpe check, but there's nothing to alert for extraneous corosync nodes.
This can actually be quite dangerous. Consider the following example:
* a normal 3-node cluster has one stale entry in corosync.conf
* the spurious entry is always down (because it doesn't exist anymore)
* expected votes = 4 and quorum = 3
Under normal circumstances, all units are up and total votes == 3, so the ring has quorum and everything runs fine.
If one of the units goes offline, however, votes drop to 2 and quorum is lost. Pacemaker would then react accordingly, possibly tearing down all vips.
I propose we create a new nrpe check that would warn if `corosync-cmapctl runtime.
Changed in charm-hacluster: | |
status: | New → Triaged |
importance: | Undecided → Wishlist |
The above assumes that we want cluster_count to mean "the minimum number of nodes needed to form a cluster".
Another way to solve this would be to explicitly set expected_votes in corosync.conf to be equal to cluster_count and just alert on having a number of configured nodes higher than expected_votes. This would make the cluster more resilient against stale entries, but the documentation should then explicitly warn about the danger of having a large cluster use the default cluster_size setting (which is currently not a problem).