kolla-ansible

Multinode MariaDB 10.5 (galera 4) deployments may fail on WSREP

Bug #1947485 reported by Radosław Piliszek on 2021-10-17

This bug affects 1 person

	Status	Importance	Assigned to
kolla-ansible	Fix Released	High	Radosław Piliszek
Wallaby	Fix Released	High	Radosław Piliszek
Xena	Fix Released	High	Radosław Piliszek
Yoga	Fix Released	High	Radosław Piliszek

Bug Description

There seems to be a bug in Galera that causes TASK [mariadb : Check MariaDB service WSREP sync status] to fail.
One (in case of 3-node cluster) or more (possible with more-than-3-node clusters) nodes may "lose the race" and get stuck in the "initialized" state of WSREP. This is entirely random as is the case with most race issues.
MariaDB service restart on that node will fix the situation but it's unwieldy.
The above may happen because Kolla Ansible starts and waits for all new nodes at once.
This did not bother the old galera (galera 3) which figured out the ordering for itself and let each node join the cluster properly.
The proposed workaround is to start and wait for nodes serially.

Tags:

Radosław Piliszek (yoctozepto) on 2021-10-17

tags:	added: mariadb
tags:	added: galera
tags:	added: multinode

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2021-10-17:

In Wallaby affects only Debian as only Debian uses MariaDB 10.5 and galera 4.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-10-18: Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/814187
Committed: https://opendev.org/openstack/kolla-ansible/commit/c94cc4a61a6c213da21b4f161ec43ce1d6f067e7
Submitter: "Zuul (22348)"
Branch: master

commit c94cc4a61a6c213da21b4f161ec43ce1d6f067e7
Author: Radosław Piliszek <email address hidden>
Date: Fri Oct 15 14:38:17 2021 +0000

[mariadb] Start new nodes serially

    There seems to be a bug in Galera that causes
    TASK [mariadb : Check MariaDB service WSREP sync status]
    to fail.
    One (in case of 3-node cluster) or more (possible with
    more-than-3-node clusters) nodes may "lose the race" and get stuck
    in the "initialized" state of WSREP.
    This is entirely random as is the case with most race issues.
    MariaDB service restart on that node will fix the situation but
    it's unwieldy.
    The above may happen because Kolla Ansible starts and waits for
    all new nodes at once.
    This did not bother the old galera (galera 3) which figured out
    the ordering for itself and let each node join the cluster properly.
    The proposed workaround is to start and wait for nodes serially.

Change-Id: I449d4c2073d4e3953e9f09725577d2e1c9d563c9
Closes-Bug: #1947485

Changed in kolla-ansible:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-10-18: Fix proposed to kolla-ansible (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/814269

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-10-18: Fix proposed to kolla-ansible (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/814450

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-10-19: Fix merged to kolla-ansible (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/814269
Committed: https://opendev.org/openstack/kolla-ansible/commit/be6fd7b7df1f931993454004880df4703ef10eec
Submitter: "Zuul (22348)"
Branch: stable/xena

commit be6fd7b7df1f931993454004880df4703ef10eec
Author: Radosław Piliszek <email address hidden>
Date: Fri Oct 15 14:38:17 2021 +0000

[mariadb] Start new nodes serially

    There seems to be a bug in Galera that causes
    TASK [mariadb : Check MariaDB service WSREP sync status]
    to fail.
    One (in case of 3-node cluster) or more (possible with
    more-than-3-node clusters) nodes may "lose the race" and get stuck
    in the "initialized" state of WSREP.
    This is entirely random as is the case with most race issues.
    MariaDB service restart on that node will fix the situation but
    it's unwieldy.
    The above may happen because Kolla Ansible starts and waits for
    all new nodes at once.
    This did not bother the old galera (galera 3) which figured out
    the ordering for itself and let each node join the cluster properly.
    The proposed workaround is to start and wait for nodes serially.

    Change-Id: I449d4c2073d4e3953e9f09725577d2e1c9d563c9
    Closes-Bug: #1947485
    (cherry picked from commit c94cc4a61a6c213da21b4f161ec43ce1d6f067e7)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-10-19: Fix merged to kolla-ansible (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/814450
Committed: https://opendev.org/openstack/kolla-ansible/commit/8109217a736fd8ba0cfc7406c6d994e703d8a4fc
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 8109217a736fd8ba0cfc7406c6d994e703d8a4fc
Author: Radosław Piliszek <email address hidden>
Date: Fri Oct 15 14:38:17 2021 +0000

[mariadb] Start new nodes serially

    There seems to be a bug in Galera that causes
    TASK [mariadb : Check MariaDB service WSREP sync status]
    to fail.
    One (in case of 3-node cluster) or more (possible with
    more-than-3-node clusters) nodes may "lose the race" and get stuck
    in the "initialized" state of WSREP.
    This is entirely random as is the case with most race issues.
    MariaDB service restart on that node will fix the situation but
    it's unwieldy.
    The above may happen because Kolla Ansible starts and waits for
    all new nodes at once.
    This did not bother the old galera (galera 3) which figured out
    the ordering for itself and let each node join the cluster properly.
    The proposed workaround is to start and wait for nodes serially.

    Change-Id: I449d4c2073d4e3953e9f09725577d2e1c9d563c9
    Closes-Bug: #1947485
    (cherry picked from commit c94cc4a61a6c213da21b4f161ec43ce1d6f067e7)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-05: Fix included in openstack/kolla-ansible 13.0.0.0rc2

This issue was fixed in the openstack/kolla-ansible 13.0.0.0rc2 release candidate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-03: Fix included in openstack/kolla-ansible 12.3.0

This issue was fixed in the openstack/kolla-ansible 12.3.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-04-07: Fix included in openstack/kolla-ansible 14.0.0.0rc1

This issue was fixed in the openstack/kolla-ansible 14.0.0.0rc1 release candidate.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.