[RFE] support ignoring quorum issues

Bug #1850829 reported by Andrea Ieri
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack HA Cluster Charm
Fix Released
Wishlist
Xav Paice

Bug Description

Pacemaker clusters are generally built under the assumption that resources can be corrupted by uncontroller access during split brain situations. For this reason, resources are shut down if quorum has not been achieved, as it is considered preferable to have a given resource be running nowhere over it running in too many places.

In OpenStack clouds the hacluster charm is often used in API units to manage haproxy and VIPs. These resources are stateless, and cannot cause data corruption.

I would like to propose here a new charm option 'ignore_quorum' that would reconfigure pacemaker to ignore quorum issues (no-quorum-policy: ignore) and effectively ensure resources always stay up in at least one place.

Why would this be useful? After all if you lose one AZ you still have quorum, and if you lose 2 AZs rabbit/mysql/ceph would not be fully functional anyway.
That's true, but there are some cases in which this can add resiliency, for example in improperly configured clouds: it has happened in production during a full AZ outage that some API endpoints stop responding because units had been improperly distributed across AZs by mistake and two of them are down. This shouldn't happen, of course, but having extra resiliency in the face of (not yet known) imperfect configurations is still very valuable.

Downsides to ignoring quorum in the haproxy/vip scenario described above:
* haproxy: none
* VIP resource in a true "the majority of my peers are dead" scenario: none
* VIP resouce in a split brain scenario: instantiating VIPs in multiple places might, depending on the type of network failure, confuse the ARP caches of the connected switches. This could lead to inconsistent routing of client packets, and cause some client connections to fail. This is in my opinion acceptable, as some client connections would still succeed, and that's better than having all of them fail. Additionally, adding redundant ring functionality (LP#1850822) would reduce the likelihood of quorum loss during partial network outages.

There is currently an option that relates to quorum, cluster_count: this could approximate the behavior I'm seeking, but currently serves a different purpose, preventing race conditions during application deployment. Having a separate ignore_quorum option would ensure safe behavior during installation while allowing resource agents to operate even when quorum has not been achieved.

The proposed option should of course default to false and be only enabled on specific applications.

Changed in charm-hacluster:
status: New → Triaged
importance: Undecided → Wishlist
Revision history for this message
Xav Paice (xavpaice) wrote :

This can be enabled by setting property cib-bootstrap-options no-quorum-policy=ignore. The default is 'stop'.

This can be queried with: crm_attribute --query --name no-quorum-policy

Update:

crm_attribute --name no-quorum-policy --update ignore

I suggest initially we set this as an optional config option, if set, switch to the setting, if unset then we can clear the setting back to default with:

crm_attribute --name no-quorum-policy --delete

Revision history for this message
Xav Paice (xavpaice) wrote :
Changed in charm-hacluster:
status: Triaged → In Progress
assignee: nobody → Xav Paice (xavpaice)
Revision history for this message
Andrea Ieri (aieri) wrote :

As a first pass a new charm option would definitely help, but longer term I think it would be more idiomatic if this option was passed by the principal charm over the relation. Allowing the operator to override the principal charm's choice can still be useful though.

Changed in charm-hacluster:
milestone: none → 21.10
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-hacluster (master)

Reviewed: https://review.opendev.org/c/openstack/charm-hacluster/+/779648
Committed: https://opendev.org/openstack/charm-hacluster/commit/d17fdd276ef1c6614bfc43491e4cb9e1ee5ce612
Submitter: "Zuul (22348)"
Branch: master

commit d17fdd276ef1c6614bfc43491e4cb9e1ee5ce612
Author: Xav Paice <email address hidden>
Date: Wed Mar 10 16:15:58 2021 +1300

    Add option for no-quorum-policy

    Adds a config item for what to do when the cluster does not have quorum.
    This is useful with stateless services where, e.g., we only need a VIP
    and that can be up on a single host with no problem.

    Though this would be a good relation data setting, many sites would
    prefer to stop the resources rather than have a VIP on multiple hosts,
    causing arp issues with the switch.

    Closes-bug: #1850829
    Change-Id: I961b6b32e7ed23f967b047dd0ecb45b0c0dff49a

Changed in charm-hacluster:
status: In Progress → Fix Committed
Changed in charm-hacluster:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.