HA: galera cannot recover from a network split on a 2-node

Bug #1903051 reported by Damien Ciabrini
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Undecided
Damien Ciabrini

Bug Description

Galera and pacemaker both have their own notion of quorum. When a
network split occurs in a two node overcloud, both node becomes
inquorate, per galera and pacemaker point of view.

The pacemaker resource agent always demotes a node when it loses
galera quorum; however it cannot promote it back because it waits for
the other node to advertise its DB sequence number in the CIB, and
that information is unavailable during the network split.

Pacemaker can recover from its quorum loss if one of the node
manages to fence the other peer. From that moment onward, the
pacemaker cluster is unblocked and the HA services can be restarted
and run on a single node temporarily.

However, the galera resource agent is currently not able to take any
automatic decision to restart the resource, even after pacemaker has
fenced the other node and determined it's the surviving node in the
cluster.

So the DB service stays down and cannot recover until the network
disruption is resolved.

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.opendev.org/758153
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=5836bcc15b3e28160b13f25e1022a79002a71dd2
Submitter: Zuul
Branch: master

commit 5836bcc15b3e28160b13f25e1022a79002a71dd2
Author: Damien Ciabrini <email address hidden>
Date: Wed Oct 14 16:59:27 2020 +0200

    galera: expose 2-node mode for the galera resource

    When deploying a 2-node HA overcloud, the galera resource
    agent can be configured to enable a "2-node mode" heuristic,
    that allows it to restart a galera node in the event of a
    network split.

    Make this resource agent's option available in puppet via
    the new parameter "two_node_mode".

    Closes-Bug: #1903051

    Change-Id: I543ee77ec38b6429989435122ae0c257d279e507

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/761992

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/victoria)

Reviewed: https://review.opendev.org/761992
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=24d0ff54f38a1dc1fcc59e8999b0b55989e12070
Submitter: Zuul
Branch: stable/victoria

commit 24d0ff54f38a1dc1fcc59e8999b0b55989e12070
Author: Damien Ciabrini <email address hidden>
Date: Wed Oct 14 16:59:27 2020 +0200

    galera: expose 2-node mode for the galera resource

    When deploying a 2-node HA overcloud, the galera resource
    agent can be configured to enable a "2-node mode" heuristic,
    that allows it to restart a galera node in the event of a
    network split.

    Make this resource agent's option available in puppet via
    the new parameter "two_node_mode".

    Closes-Bug: #1903051

    Change-Id: I543ee77ec38b6429989435122ae0c257d279e507
    (cherry picked from commit 5836bcc15b3e28160b13f25e1022a79002a71dd2)

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/762675

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/ussuri)

Reviewed: https://review.opendev.org/762675
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=7e7aa969b78acd7196cbe993173e26033744fbe2
Submitter: Zuul
Branch: stable/ussuri

commit 7e7aa969b78acd7196cbe993173e26033744fbe2
Author: Damien Ciabrini <email address hidden>
Date: Wed Oct 14 16:59:27 2020 +0200

    galera: expose 2-node mode for the galera resource

    When deploying a 2-node HA overcloud, the galera resource
    agent can be configured to enable a "2-node mode" heuristic,
    that allows it to restart a galera node in the event of a
    network split.

    Make this resource agent's option available in puppet via
    the new parameter "two_node_mode".

    Closes-Bug: #1903051

    Change-Id: I543ee77ec38b6429989435122ae0c257d279e507
    (cherry picked from commit 5836bcc15b3e28160b13f25e1022a79002a71dd2)
    (cherry picked from commit 24d0ff54f38a1dc1fcc59e8999b0b55989e12070)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/762837

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/train)

Reviewed: https://review.opendev.org/762837
Committed: https://opendev.org/openstack/puppet-tripleo/commit/3e9b801d5d4843cb767478567a89aadbeb2d07c7
Submitter: Zuul
Branch: stable/train

commit 3e9b801d5d4843cb767478567a89aadbeb2d07c7
Author: Damien Ciabrini <email address hidden>
Date: Wed Oct 14 16:59:27 2020 +0200

    galera: expose 2-node mode for the galera resource

    When deploying a 2-node HA overcloud, the galera resource
    agent can be configured to enable a "2-node mode" heuristic,
    that allows it to restart a galera node in the event of a
    network split.

    Make this resource agent's option available in puppet via
    the new parameter "two_node_mode".

    Closes-Bug: #1903051

    Change-Id: I543ee77ec38b6429989435122ae0c257d279e507
    (cherry picked from commit 5836bcc15b3e28160b13f25e1022a79002a71dd2)
    (cherry picked from commit 24d0ff54f38a1dc1fcc59e8999b0b55989e12070)
    (cherry picked from commit 7e7aa969b78acd7196cbe993173e26033744fbe2)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 13.5.0

This issue was fixed in the openstack/puppet-tripleo 13.5.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 14.0.0

This issue was fixed in the openstack/puppet-tripleo 14.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 12.5.0

This issue was fixed in the openstack/puppet-tripleo 12.5.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 11.5.0

This issue was fixed in the openstack/puppet-tripleo 11.5.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.