OpenStack HA Cluster Charm

[RFE] nrpe check for 'expected votes' > cluster_count

Bug #1835418 reported by Andrea Ieri on 2019-07-04

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack HA Cluster Charm	Triaged	Wishlist	Unassigned

Bug Description

In some cases, pacemaker and corosync may end up with stale entries (especially as long as LP#1821109 remains open).

Offline pacemaker nodes are already caught by the check_crm nrpe check, but there's nothing to alert for extraneous corosync nodes.

This can actually be quite dangerous. Consider the following example:
* a normal 3-node cluster has one stale entry in corosync.conf
* the spurious entry is always down (because it doesn't exist anymore)
* expected votes = 4 and quorum = 3

Under normal circumstances, all units are up and total votes == 3, so the ring has quorum and everything runs fine.

If one of the units goes offline, however, votes drop to 2 and quorum is lost. Pacemaker would then react accordingly, possibly tearing down all vips.

I propose we create a new nrpe check that would warn if `corosync-cmapctl runtime.votequorum.ev_barrier` is higher than the charm option cluster_count.

Revision history for this message

Andrea Ieri (aieri) wrote on 2019-07-04:

The above assumes that we want cluster_count to mean "the minimum number of nodes needed to form a cluster".
Another way to solve this would be to explicitly set expected_votes in corosync.conf to be equal to cluster_count and just alert on having a number of configured nodes higher than expected_votes. This would make the cluster more resilient against stale entries, but the documentation should then explicitly warn about the danger of having a large cluster use the default cluster_size setting (which is currently not a problem).

James Page (james-page) on 2019-07-18

Changed in charm-hacluster:
status:	New → Triaged
importance:	Undecided → Wishlist

Revision history for this message

Trent Lloyd (lathiat) wrote on 2023-01-24:

For future travellers, the workaround for this situation may be simply to run "corosync-cfgtool -R" to reload the configuration file from disk, where corosync.conf and "crm status" show the correct node list but you still have no quorum.

Full details on such cases in this comment:
https://bugs.launchpad.net/charms/+source/hacluster/+bug/1400481/comments/15

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1861036

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.