[RFE] nrpe check for 'expected votes' > cluster_count

Bug #1835418 reported by Andrea Ieri
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack HA Cluster Charm
Triaged
Wishlist
Unassigned

Bug Description

In some cases, pacemaker and corosync may end up with stale entries (especially as long as LP#1821109 remains open).

Offline pacemaker nodes are already caught by the check_crm nrpe check, but there's nothing to alert for extraneous corosync nodes.

This can actually be quite dangerous. Consider the following example:
* a normal 3-node cluster has one stale entry in corosync.conf
* the spurious entry is always down (because it doesn't exist anymore)
* expected votes = 4 and quorum = 3

Under normal circumstances, all units are up and total votes == 3, so the ring has quorum and everything runs fine.

If one of the units goes offline, however, votes drop to 2 and quorum is lost. Pacemaker would then react accordingly, possibly tearing down all vips.

I propose we create a new nrpe check that would warn if `corosync-cmapctl runtime.votequorum.ev_barrier` is higher than the charm option cluster_count.

Revision history for this message
Andrea Ieri (aieri) wrote :

The above assumes that we want cluster_count to mean "the minimum number of nodes needed to form a cluster".
Another way to solve this would be to explicitly set expected_votes in corosync.conf to be equal to cluster_count and just alert on having a number of configured nodes higher than expected_votes. This would make the cluster more resilient against stale entries, but the documentation should then explicitly warn about the danger of having a large cluster use the default cluster_size setting (which is currently not a problem).

James Page (james-page)
Changed in charm-hacluster:
status: New → Triaged
importance: Undecided → Wishlist
Revision history for this message
Trent Lloyd (lathiat) wrote :

For future travellers, the workaround for this situation may be simply to run "corosync-cfgtool -R" to reload the configuration file from disk, where corosync.conf and "crm status" show the correct node list but you still have no quorum.

Full details on such cases in this comment:
https://bugs.launchpad.net/charms/+source/hacluster/+bug/1400481/comments/15

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.