[RFE] support ignoring quorum issues

Bug #1850829 reported by Andrea Ieri on 2019-10-31
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack hacluster charm
Wishlist
Unassigned

Bug Description

Pacemaker clusters are generally built under the assumption that resources can be corrupted by uncontroller access during split brain situations. For this reason, resources are shut down if quorum has not been achieved, as it is considered preferable to have a given resource be running nowhere over it running in too many places.

In OpenStack clouds the hacluster charm is often used in API units to manage haproxy and VIPs. These resources are stateless, and cannot cause data corruption.

I would like to propose here a new charm option 'ignore_quorum' that would reconfigure pacemaker to ignore quorum issues (no-quorum-policy: ignore) and effectively ensure resources always stay up in at least one place.

Why would this be useful? After all if you lose one AZ you still have quorum, and if you lose 2 AZs rabbit/mysql/ceph would not be fully functional anyway.
That's true, but there are some cases in which this can add resiliency, for example in improperly configured clouds: it has happened in production during a full AZ outage that some API endpoints stop responding because units had been improperly distributed across AZs by mistake and two of them are down. This shouldn't happen, of course, but having extra resiliency in the face of (not yet known) imperfect configurations is still very valuable.

Downsides to ignoring quorum in the haproxy/vip scenario described above:
* haproxy: none
* VIP resource in a true "the majority of my peers are dead" scenario: none
* VIP resouce in a split brain scenario: instantiating VIPs in multiple places might, depending on the type of network failure, confuse the ARP caches of the connected switches. This could lead to inconsistent routing of client packets, and cause some client connections to fail. This is in my opinion acceptable, as some client connections would still succeed, and that's better than having all of them fail. Additionally, adding redundant ring functionality (LP#1850822) would reduce the likelihood of quorum loss during partial network outages.

There is currently an option that relates to quorum, cluster_count: this could approximate the behavior I'm seeking, but currently serves a different purpose, preventing race conditions during application deployment. Having a separate ignore_quorum option would ensure safe behavior during installation while allowing resource agents to operate even when quorum has not been achieved.

The proposed option should of course default to false and be only enabled on specific applications.

Changed in charm-hacluster:
status: New → Triaged
importance: Undecided → Wishlist
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers