New MariaDB instance deployment fails due to missing container or possibly breaks the cluster

Bug #1857908 reported by Radosław Piliszek
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Released
Medium
Radosław Piliszek
Rocky
Won't Fix
Medium
Radosław Piliszek
Stein
Won't Fix
Medium
Radosław Piliszek
Train
Fix Released
Medium
Radosław Piliszek
Ussuri
Fix Released
Medium
Radosław Piliszek

Bug Description

Scenario:
- have a working k-a deployment with mariadb on n hosts
- add (n+1)th host to the mariadb group and make it first (easily achievable with ini, random with yml)
- k-a Stein will fail complaining about missing container:
RUNNING HANDLER [mariadb : remove restart policy from master mariadb]
Error response from daemon: No such container: mariadb

This has been reported by users but the AIO case had a different root cause - already existing volume but not the container (see https://bugs.launchpad.net/kolla-ansible/+bug/1855268 )
Now neither the volume nor the container exist so it *should* work, yet it fails.
The culprit is that the lookup_cluster finally does *not* register the master mariadb server properly, keeping the first one instead as master. This breaks Stein deploy action completely.
In other series (Rocky, Train) this violates handler logic which may break the mariadb cluster if container images are about to change in the same action. Otherwise, they are affected only cosmetically where k-a claims it starts master on the new node. :-)

PS: Looks like it has been broken since the refactoring in pike: https://review.opendev.org/433480

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/700785

Changed in kolla-ansible:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla-ansible (master)

Change abandoned by Radosław Piliszek (<email address hidden>) on branch: master
Review: https://review.opendev.org/700785
Reason: superseded by https://review.opendev.org/701010 (3 phases approach + separation of deploys and upgrades)

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Simply fixing the lookup revealed other issues with the deployment code. These have been addressed by https://review.opendev.org/701010

tags: removed: stein
description: updated
summary: - [Stein] New MariaDB instance deployment fails due to missing container
+ New MariaDB instance deployment fails due to missing container or
+ possibly breaks the cluster
tags: added: galera wsrep
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/701010
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=9f14ad651a9e6516d02c90d9eb0ec4b7a4702e7e
Submitter: Zuul
Branch: master

commit 9f14ad651a9e6516d02c90d9eb0ec4b7a4702e7e
Author: Radosław Piliszek <email address hidden>
Date: Fri Jan 3 11:20:00 2020 +0100

    Fix multiple issues with MariaDB handling

    These affected both deploy (and reconfigure) and upgrade
    resulting in WSREP issues, failed deploys or need to
    recover the cluster.

    This patch makes sure k-a does not abruptly terminate
    nodes to break cluster.
    This is achieved by cleaner separation between stages
    (bootstrap, restart current, deploy new) and 3 phases
    for restarts (to keep the quorum).

    Upgrade actions, which operate on a healthy cluster,
    went to its section.

    Service restart was refactored.

    We no longer rely on the master/slave distinction as
    all nodes are masters in Galera.

    Closes-bug: #1857908
    Closes-bug: #1859145
    Change-Id: I83600c69141714fc412df0976f49019a857655f5

Changed in kolla-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/705414

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/706078

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/train)

Reviewed: https://review.opendev.org/705414
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=8acf5c132df02002e05a17c1754f5d79143a8d75
Submitter: Zuul
Branch: stable/train

commit 8acf5c132df02002e05a17c1754f5d79143a8d75
Author: Radosław Piliszek <email address hidden>
Date: Fri Jan 3 11:20:00 2020 +0100

    Fix multiple issues with MariaDB handling

    These affected both deploy (and reconfigure) and upgrade
    resulting in WSREP issues, failed deploys or need to
    recover the cluster.

    This patch makes sure k-a does not abruptly terminate
    nodes to break cluster.
    This is achieved by cleaner separation between stages
    (bootstrap, restart current, deploy new) and 3 phases
    for restarts (to keep the quorum).

    Upgrade actions, which operate on a healthy cluster,
    went to its section.

    Service restart was refactored.

    We no longer rely on the master/slave distinction as
    all nodes are masters in Galera.

    Backport includes also the:
    Followup on MariaDB handling fixes

    This fixes issues reported by Mark:
    - possible failure with 4-node cluster (however unlikely)
    - failure to stop all nodes from progressing when conditions are
      not valid (due to: "any_errors_fatal: False")

    Closes-bug: #1857908
    Closes-bug: #1859145
    Change-Id: I83600c69141714fc412df0976f49019a857655f5
    (cherry picked from commit 9f14ad651a9e6516d02c90d9eb0ec4b7a4702e7e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/713501

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla-ansible (stable/rocky)

Change abandoned by Radosław Piliszek (<email address hidden>) on branch: stable/rocky
Review: https://review.opendev.org/713501
Reason: no time to pursue, rocky already em and code diverged much - could have different characteristics regarding stability

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla-ansible (stable/stein)

Change abandoned by Radosław Piliszek (<email address hidden>) on branch: stable/stein
Review: https://review.opendev.org/706078
Reason: not pursuing, stein is oldie ;-)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.