MariaDB deployment does not respect quorum and may break cluster

Bug #1859145 reported by Radosław Piliszek
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Released
High
Radosław Piliszek
Rocky
Won't Fix
High
Unassigned
Stein
Won't Fix
High
Unassigned
Train
Fix Released
High
Radosław Piliszek
Ussuri
Fix Released
High
Radosław Piliszek

Bug Description

Current code does not wait for MariaDB to recalculate quorum when starting and stopping MariaDB containers. This may lead to WSREP issues (and failures) and in the worst case require a recovery.

This most notably affects upgrades of MariaDB clusters but also its reconfigurations (due to user action or our change). Hence high importance.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

The fix is to separate bootstrap, deployment of new members and restart of old members, ensuring restart uses 3 phases (aka batches) not to break quorum.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

And also wait for status Synced for already existing members.

Changed in kolla-ansible:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/701010
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=9f14ad651a9e6516d02c90d9eb0ec4b7a4702e7e
Submitter: Zuul
Branch: master

commit 9f14ad651a9e6516d02c90d9eb0ec4b7a4702e7e
Author: Radosław Piliszek <email address hidden>
Date: Fri Jan 3 11:20:00 2020 +0100

    Fix multiple issues with MariaDB handling

    These affected both deploy (and reconfigure) and upgrade
    resulting in WSREP issues, failed deploys or need to
    recover the cluster.

    This patch makes sure k-a does not abruptly terminate
    nodes to break cluster.
    This is achieved by cleaner separation between stages
    (bootstrap, restart current, deploy new) and 3 phases
    for restarts (to keep the quorum).

    Upgrade actions, which operate on a healthy cluster,
    went to its section.

    Service restart was refactored.

    We no longer rely on the master/slave distinction as
    all nodes are masters in Galera.

    Closes-bug: #1857908
    Closes-bug: #1859145
    Change-Id: I83600c69141714fc412df0976f49019a857655f5

Changed in kolla-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/705414

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/706078

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/train)

Reviewed: https://review.opendev.org/705414
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=8acf5c132df02002e05a17c1754f5d79143a8d75
Submitter: Zuul
Branch: stable/train

commit 8acf5c132df02002e05a17c1754f5d79143a8d75
Author: Radosław Piliszek <email address hidden>
Date: Fri Jan 3 11:20:00 2020 +0100

    Fix multiple issues with MariaDB handling

    These affected both deploy (and reconfigure) and upgrade
    resulting in WSREP issues, failed deploys or need to
    recover the cluster.

    This patch makes sure k-a does not abruptly terminate
    nodes to break cluster.
    This is achieved by cleaner separation between stages
    (bootstrap, restart current, deploy new) and 3 phases
    for restarts (to keep the quorum).

    Upgrade actions, which operate on a healthy cluster,
    went to its section.

    Service restart was refactored.

    We no longer rely on the master/slave distinction as
    all nodes are masters in Galera.

    Backport includes also the:
    Followup on MariaDB handling fixes

    This fixes issues reported by Mark:
    - possible failure with 4-node cluster (however unlikely)
    - failure to stop all nodes from progressing when conditions are
      not valid (due to: "any_errors_fatal: False")

    Closes-bug: #1857908
    Closes-bug: #1859145
    Change-Id: I83600c69141714fc412df0976f49019a857655f5
    (cherry picked from commit 9f14ad651a9e6516d02c90d9eb0ec4b7a4702e7e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/713501

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla-ansible (stable/rocky)

Change abandoned by Radosław Piliszek (<email address hidden>) on branch: stable/rocky
Review: https://review.opendev.org/713501
Reason: no time to pursue, rocky already em and code diverged much - could have different characteristics regarding stability

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla-ansible (stable/stein)

Change abandoned by Radosław Piliszek (<email address hidden>) on branch: stable/stein
Review: https://review.opendev.org/706078
Reason: not pursuing, stein is oldie ;-)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.