Multinode galera upgrade from Rocky to Stein fails on CentOS

Bug #1834191 reported by Mark Goddard
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Released
Critical
Mark Goddard
Stein
Fix Released
Critical
Mark Goddard

Bug Description

Currently the kolla-ansible-centos-source-upgrade-ceph job is failing on the stable/stein branch.

The problem occurs with mariadb, when performing an upgrade to the Stein release which has a new version of mariadb. It appears that when the slave mariadb services are shut down, we do not wait for the container to stop, so the service may not shut down cleanly. This prevents it from starting up successfully.

Example:

http://logs.openstack.org/periodic/opendev.org/openstack/kolla-ansible/stable/stein/kolla-ansible-centos-source-upgrade-ceph/d4dd98f/secondary1/logs/kolla/mariadb/mariadb.txt.gz

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/667363

Mark Goddard (mgoddard)
Changed in kolla-ansible:
importance: Undecided → Critical
assignee: nobody → Mark Goddard (mgoddard)
Revision history for this message
Mark Goddard (mgoddard) wrote :

Another issue was found. During the upgrade, both rocky and stein mariadb containers can be running. In Stein we switched from xtrabackup to mariabackup for the galera state sync, which means that stein and rocky containers cannot sync. I didn't hit this locally, but it was seen in CI. Here are the relevant error messages from the primary node at that time:

2019-06-25 19:05:50 140049019632384 [Note] WSREP: sst_donor_thread signaled with 0
2019-06-25 19:05:50 140044555761408 [Note] WSREP: async IST sender starting to serve tcp://10.209.96.149:4568 sending 8524-8557
sh: wsrep_sst_mariabackup: command not found
2019-06-25 19:05:50 140044564154112 [ERROR] WSREP: Failed to read from: wsrep_sst_mariabackup --role 'donor' --address '10.209.96.149:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --binlog 'mysql-bin' --gtid 'b99680cf-9773-11e9-b90b-e6ca413f0ef1:8523' --gtid-domain-id '0' --bypass
2019-06-25 19:05:50 140044564154112 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'donor' --address '10.209.96.149:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --binlog 'mysql-bin' --gtid 'b99680cf-9773-11e9-b90b-e6ca413f0ef1:8523' --gtid-domain-id '0' --bypass: 2 (No such file or directory)
2019-06-25 19:05:50 140044564154112 [ERROR] WSREP: Command did not run: wsrep_sst_mariabackup --role 'donor' --address '10.209.96.149:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --binlog 'mysql-bin' --gtid 'b99680cf-9773-11e9-b90b-e6ca413f0ef1:8523' --gtid-domain-id '0' --bypass
2019-06-25 19:05:50 140049068836608 [Warning] WSREP: 1.0 (secondary1): State transfer to 0.0 (secondary2) failed: -2 (No such file or directory)

http://logs.openstack.org/63/667363/3/check/kolla-ansible-centos-source-upgrade-ceph/479cd15/secondary1/logs/kolla/mariadb/mariadb.txt.gz#_2019-06-25_19_05_50

I think we need to shutdown all nodes and perform a recovery in this case.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/stein)

Reviewed: https://review.opendev.org/667363
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=99cd5ec10c910bff3c238942a613faafdce0a2e2
Submitter: Zuul
Branch: stable/stein

commit 99cd5ec10c910bff3c238942a613faafdce0a2e2
Author: Mark Goddard <email address hidden>
Date: Tue Jun 25 12:59:41 2019 +0100

    Wait for mariadb to stop after shutdown

    Stein only.

    Currently the kolla-ansible-centos-source-upgrade-ceph job is failing on
    the stable/stein branch.

    The problem occurs with mariadb, when performing an upgrade to the Stein
    release which has a new version of mariadb. It appears that when the
    slave mariadb services are shut down, we do not wait for the container
    to stop, so the service may not shut down cleanly. This prevents it from
    starting up successfully.

    This change waits for the container to stop after the shutdown command
    has been executed. It also temporarily replaces the restart policy of
    the container to prevent it from starting up again after the shutdown.

    This is not required in other branches since the mariadb shutdown
    workaround was only added in the stein branch for bug 1820325.

    There is a second issue that is addressed here. The Stein release
    switched from using xtrabackup to mariabackup for galera state syncing.
    If we run both container versions at the same time on different hosts
    then we can get an error such as the following:

    sh: wsrep_sst_mariabackup: command not found

    We therefore now stop the cluster and perform a recovery during an
    upgrade, if we detect that xtrabackup is in use.

    Finally, we now wait for the bootstrap host to report that it is in an
    OPERATIONAL state. Without this we can see errors where the MariaDB
    cluster is not ready when used by other services.

    Change-Id: I513bcf31adaee8441d43c6b578ca06f66820e52b
    Closes-Bug: #1834191
    Related-Bug: #1820325

Changed in kolla-ansible:
status: New → Fix Released
status: Fix Released → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 8.0.0.0rc2

This issue was fixed in the openstack/kolla-ansible 8.0.0.0rc2 release candidate.

Mark Goddard (mgoddard)
Changed in kolla-ansible:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.