kolla-ansible

MariaDB deployment does not respect quorum and may break cluster

Series rocky
Bug #1859145

Bug #1859145 reported by Radosław Piliszek on 2020-01-10

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
kolla-ansible	Fix Released	High	Radosław Piliszek	kolla-ansible 10.0.0 "ussuri"
Rocky	Won't Fix	High	Unassigned
Stein	Won't Fix	High	Unassigned
Train	Fix Released	High	Radosław Piliszek	kolla-ansible 9.1.0 "Train"
Ussuri	Fix Released	High	Radosław Piliszek	kolla-ansible 10.0.0 "ussuri"

Bug Description

Current code does not wait for MariaDB to recalculate quorum when starting and stopping MariaDB containers. This may lead to WSREP issues (and failures) and in the worst case require a recovery.

This most notably affects upgrades of MariaDB clusters but also its reconfigurations (due to user action or our change). Hence high importance.

Tags:

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-01-10:

The fix is to separate bootstrap, deployment of new members and restart of old members, ensuring restart uses 3 phases (aka batches) not to break quorum.

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-01-10:

And also wait for status Synced for already existing members.

OpenStack Infra (hudson-openstack) on 2020-01-10

Changed in kolla-ansible:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-21: Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/701010
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=9f14ad651a9e6516d02c90d9eb0ec4b7a4702e7e
Submitter: Zuul
Branch: master

commit 9f14ad651a9e6516d02c90d9eb0ec4b7a4702e7e
Author: Radosław Piliszek <email address hidden>
Date: Fri Jan 3 11:20:00 2020 +0100

Fix multiple issues with MariaDB handling

    These affected both deploy (and reconfigure) and upgrade
    resulting in WSREP issues, failed deploys or need to
    recover the cluster.

    This patch makes sure k-a does not abruptly terminate
    nodes to break cluster.
    This is achieved by cleaner separation between stages
    (bootstrap, restart current, deploy new) and 3 phases
    for restarts (to keep the quorum).

Upgrade actions, which operate on a healthy cluster,
went to its section.

Service restart was refactored.

We no longer rely on the master/slave distinction as
all nodes are masters in Galera.

    Closes-bug: #1857908
    Closes-bug: #1859145
    Change-Id: I83600c69141714fc412df0976f49019a857655f5

Changed in kolla-ansible:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-03: Fix proposed to kolla-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/705414

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-05: Fix proposed to kolla-ansible (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/706078

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-06: Fix merged to kolla-ansible (stable/train)

Reviewed: https://review.opendev.org/705414
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=8acf5c132df02002e05a17c1754f5d79143a8d75
Submitter: Zuul
Branch: stable/train

commit 8acf5c132df02002e05a17c1754f5d79143a8d75
Author: Radosław Piliszek <email address hidden>
Date: Fri Jan 3 11:20:00 2020 +0100

Fix multiple issues with MariaDB handling

    These affected both deploy (and reconfigure) and upgrade
    resulting in WSREP issues, failed deploys or need to
    recover the cluster.

Upgrade actions, which operate on a healthy cluster,
went to its section.

Service restart was refactored.

We no longer rely on the master/slave distinction as
all nodes are masters in Galera.

Backport includes also the:
Followup on MariaDB handling fixes

    This fixes issues reported by Mark:
    - possible failure with 4-node cluster (however unlikely)
    - failure to stop all nodes from progressing when conditions are
      not valid (due to: "any_errors_fatal: False")

    Closes-bug: #1857908
    Closes-bug: #1859145
    Change-Id: I83600c69141714fc412df0976f49019a857655f5
    (cherry picked from commit 9f14ad651a9e6516d02c90d9eb0ec4b7a4702e7e)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-03-17: Fix proposed to kolla-ansible (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/713501

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-04-04: Change abandoned on kolla-ansible (stable/rocky)

Change abandoned by Radosław Piliszek (<email address hidden>) on branch: stable/rocky
Review: https://review.opendev.org/713501
Reason: no time to pursue, rocky already em and code diverged much - could have different characteristics regarding stability

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-15: Change abandoned on kolla-ansible (stable/stein)

Change abandoned by Radosław Piliszek (<email address hidden>) on branch: stable/stein
Review: https://review.opendev.org/706078
Reason: not pursuing, stein is oldie ;-)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.