kolla-ansible

New MariaDB instance deployment fails due to missing container or possibly breaks the cluster

Bug #1857908 reported by Radosław Piliszek on 2019-12-30

This bug affects 3 people

	Status	Importance	Assigned to	Milestone
kolla-ansible	Fix Released	Medium	Radosław Piliszek	kolla-ansible 10.0.0 "ussuri"
Rocky	Won't Fix	Medium	Radosław Piliszek
Stein	Won't Fix	Medium	Radosław Piliszek
Train	Fix Released	Medium	Radosław Piliszek	kolla-ansible 9.1.0 "Train"
Ussuri	Fix Released	Medium	Radosław Piliszek	kolla-ansible 10.0.0 "ussuri"

Bug Description

Scenario:
- have a working k-a deployment with mariadb on n hosts
- add (n+1)th host to the mariadb group and make it first (easily achievable with ini, random with yml)
- k-a Stein will fail complaining about missing container:
RUNNING HANDLER [mariadb : remove restart policy from master mariadb]
Error response from daemon: No such container: mariadb

This has been reported by users but the AIO case had a different root cause - already existing volume but not the container (see https://bugs.launchpad.net/kolla-ansible/+bug/1855268 )
Now neither the volume nor the container exist so it *should* work, yet it fails.
The culprit is that the lookup_cluster finally does *not* register the master mariadb server properly, keeping the first one instead as master. This breaks Stein deploy action completely.
In other series (Rocky, Train) this violates handler logic which may break the mariadb cluster if container images are about to change in the same action. Otherwise, they are affected only cosmetically where k-a claims it starts master on the new node. :-)

PS: Looks like it has been broken since the refactoring in pike: https://review.opendev.org/433480

See original description

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-30: Fix proposed to kolla-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/700785

Changed in kolla-ansible:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-04: Change abandoned on kolla-ansible (master)

Change abandoned by Radosław Piliszek (<email address hidden>) on branch: master
Review: https://review.opendev.org/700785
Reason: superseded by https://review.opendev.org/701010 (3 phases approach + separation of deploys and upgrades)

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-01-10:

Simply fixing the lookup revealed other issues with the deployment code. These have been addressed by https://review.opendev.org/701010

tags:	removed: stein
description:	updated
summary:	- [Stein] New MariaDB instance deployment fails due to missing container + New MariaDB instance deployment fails due to missing container or + possibly breaks the cluster

Radosław Piliszek (yoctozepto) on 2020-01-10

tags:

added: galera wsrep

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-21: Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/701010
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=9f14ad651a9e6516d02c90d9eb0ec4b7a4702e7e
Submitter: Zuul
Branch: master

commit 9f14ad651a9e6516d02c90d9eb0ec4b7a4702e7e
Author: Radosław Piliszek <email address hidden>
Date: Fri Jan 3 11:20:00 2020 +0100

Fix multiple issues with MariaDB handling

    These affected both deploy (and reconfigure) and upgrade
    resulting in WSREP issues, failed deploys or need to
    recover the cluster.

    This patch makes sure k-a does not abruptly terminate
    nodes to break cluster.
    This is achieved by cleaner separation between stages
    (bootstrap, restart current, deploy new) and 3 phases
    for restarts (to keep the quorum).

Upgrade actions, which operate on a healthy cluster,
went to its section.

Service restart was refactored.

We no longer rely on the master/slave distinction as
all nodes are masters in Galera.

    Closes-bug: #1857908
    Closes-bug: #1859145
    Change-Id: I83600c69141714fc412df0976f49019a857655f5

Changed in kolla-ansible:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-03: Fix proposed to kolla-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/705414

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-05: Fix proposed to kolla-ansible (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/706078

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-06: Fix merged to kolla-ansible (stable/train)

Reviewed: https://review.opendev.org/705414
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=8acf5c132df02002e05a17c1754f5d79143a8d75
Submitter: Zuul
Branch: stable/train

commit 8acf5c132df02002e05a17c1754f5d79143a8d75
Author: Radosław Piliszek <email address hidden>
Date: Fri Jan 3 11:20:00 2020 +0100

Fix multiple issues with MariaDB handling

    These affected both deploy (and reconfigure) and upgrade
    resulting in WSREP issues, failed deploys or need to
    recover the cluster.

Upgrade actions, which operate on a healthy cluster,
went to its section.

Service restart was refactored.

We no longer rely on the master/slave distinction as
all nodes are masters in Galera.

Backport includes also the:
Followup on MariaDB handling fixes

    This fixes issues reported by Mark:
    - possible failure with 4-node cluster (however unlikely)
    - failure to stop all nodes from progressing when conditions are
      not valid (due to: "any_errors_fatal: False")

    Closes-bug: #1857908
    Closes-bug: #1859145
    Change-Id: I83600c69141714fc412df0976f49019a857655f5
    (cherry picked from commit 9f14ad651a9e6516d02c90d9eb0ec4b7a4702e7e)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-03-17: Fix proposed to kolla-ansible (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/713501

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-04-04: Change abandoned on kolla-ansible (stable/rocky)

Change abandoned by Radosław Piliszek (<email address hidden>) on branch: stable/rocky
Review: https://review.opendev.org/713501
Reason: no time to pursue, rocky already em and code diverged much - could have different characteristics regarding stability

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-15: Change abandoned on kolla-ansible (stable/stein)

#10

Change abandoned by Radosław Piliszek (<email address hidden>) on branch: stable/stein
Review: https://review.opendev.org/706078
Reason: not pursuing, stein is oldie ;-)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.