Multinode MariaDB 10.5 (galera 4) deployments may fail on WSREP

Bug #1947485 reported by Radosław Piliszek
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Released
High
Radosław Piliszek
Wallaby
Fix Released
High
Radosław Piliszek
Xena
Fix Released
High
Radosław Piliszek
Yoga
Fix Released
High
Radosław Piliszek

Bug Description

There seems to be a bug in Galera that causes TASK [mariadb : Check MariaDB service WSREP sync status] to fail.
One (in case of 3-node cluster) or more (possible with more-than-3-node clusters) nodes may "lose the race" and get stuck in the "initialized" state of WSREP. This is entirely random as is the case with most race issues.
MariaDB service restart on that node will fix the situation but it's unwieldy.
The above may happen because Kolla Ansible starts and waits for all new nodes at once.
This did not bother the old galera (galera 3) which figured out the ordering for itself and let each node join the cluster properly.
The proposed workaround is to start and wait for nodes serially.

tags: added: mariadb
tags: added: galera
tags: added: multinode
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

In Wallaby affects only Debian as only Debian uses MariaDB 10.5 and galera 4.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/814187
Committed: https://opendev.org/openstack/kolla-ansible/commit/c94cc4a61a6c213da21b4f161ec43ce1d6f067e7
Submitter: "Zuul (22348)"
Branch: master

commit c94cc4a61a6c213da21b4f161ec43ce1d6f067e7
Author: Radosław Piliszek <email address hidden>
Date: Fri Oct 15 14:38:17 2021 +0000

    [mariadb] Start new nodes serially

    There seems to be a bug in Galera that causes
    TASK [mariadb : Check MariaDB service WSREP sync status]
    to fail.
    One (in case of 3-node cluster) or more (possible with
    more-than-3-node clusters) nodes may "lose the race" and get stuck
    in the "initialized" state of WSREP.
    This is entirely random as is the case with most race issues.
    MariaDB service restart on that node will fix the situation but
    it's unwieldy.
    The above may happen because Kolla Ansible starts and waits for
    all new nodes at once.
    This did not bother the old galera (galera 3) which figured out
    the ordering for itself and let each node join the cluster properly.
    The proposed workaround is to start and wait for nodes serially.

    Change-Id: I449d4c2073d4e3953e9f09725577d2e1c9d563c9
    Closes-Bug: #1947485

Changed in kolla-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/814269

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/814450

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/814269
Committed: https://opendev.org/openstack/kolla-ansible/commit/be6fd7b7df1f931993454004880df4703ef10eec
Submitter: "Zuul (22348)"
Branch: stable/xena

commit be6fd7b7df1f931993454004880df4703ef10eec
Author: Radosław Piliszek <email address hidden>
Date: Fri Oct 15 14:38:17 2021 +0000

    [mariadb] Start new nodes serially

    There seems to be a bug in Galera that causes
    TASK [mariadb : Check MariaDB service WSREP sync status]
    to fail.
    One (in case of 3-node cluster) or more (possible with
    more-than-3-node clusters) nodes may "lose the race" and get stuck
    in the "initialized" state of WSREP.
    This is entirely random as is the case with most race issues.
    MariaDB service restart on that node will fix the situation but
    it's unwieldy.
    The above may happen because Kolla Ansible starts and waits for
    all new nodes at once.
    This did not bother the old galera (galera 3) which figured out
    the ordering for itself and let each node join the cluster properly.
    The proposed workaround is to start and wait for nodes serially.

    Change-Id: I449d4c2073d4e3953e9f09725577d2e1c9d563c9
    Closes-Bug: #1947485
    (cherry picked from commit c94cc4a61a6c213da21b4f161ec43ce1d6f067e7)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/814450
Committed: https://opendev.org/openstack/kolla-ansible/commit/8109217a736fd8ba0cfc7406c6d994e703d8a4fc
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 8109217a736fd8ba0cfc7406c6d994e703d8a4fc
Author: Radosław Piliszek <email address hidden>
Date: Fri Oct 15 14:38:17 2021 +0000

    [mariadb] Start new nodes serially

    There seems to be a bug in Galera that causes
    TASK [mariadb : Check MariaDB service WSREP sync status]
    to fail.
    One (in case of 3-node cluster) or more (possible with
    more-than-3-node clusters) nodes may "lose the race" and get stuck
    in the "initialized" state of WSREP.
    This is entirely random as is the case with most race issues.
    MariaDB service restart on that node will fix the situation but
    it's unwieldy.
    The above may happen because Kolla Ansible starts and waits for
    all new nodes at once.
    This did not bother the old galera (galera 3) which figured out
    the ordering for itself and let each node join the cluster properly.
    The proposed workaround is to start and wait for nodes serially.

    Change-Id: I449d4c2073d4e3953e9f09725577d2e1c9d563c9
    Closes-Bug: #1947485
    (cherry picked from commit c94cc4a61a6c213da21b4f161ec43ce1d6f067e7)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 13.0.0.0rc2

This issue was fixed in the openstack/kolla-ansible 13.0.0.0rc2 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 12.3.0

This issue was fixed in the openstack/kolla-ansible 12.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 14.0.0.0rc1

This issue was fixed in the openstack/kolla-ansible 14.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.