kolla-ansible

Multinode galera upgrade from Rocky to Stein fails on CentOS

Bug #1834191 reported by Mark Goddard on 2019-06-25

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	kolla-ansible	Fix Released	Critical	Mark Goddard
	Stein	Fix Released	Critical	Mark Goddard	kolla-ansible 8.0.0 "Stein"

Bug Description

Currently the kolla-ansible-centos-source-upgrade-ceph job is failing on the stable/stein branch.

The problem occurs with mariadb, when performing an upgrade to the Stein release which has a new version of mariadb. It appears that when the slave mariadb services are shut down, we do not wait for the container to stop, so the service may not shut down cleanly. This prevents it from starting up successfully.

Example:

http://logs.openstack.org/periodic/opendev.org/openstack/kolla-ansible/stable/stein/kolla-ansible-centos-source-upgrade-ceph/d4dd98f/secondary1/logs/kolla/mariadb/mariadb.txt.gz

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-25: Fix proposed to kolla-ansible (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/667363

Mark Goddard (mgoddard) on 2019-06-25

Changed in kolla-ansible:
importance:	Undecided → Critical
assignee:	nobody → Mark Goddard (mgoddard)

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-06-26:

Another issue was found. During the upgrade, both rocky and stein mariadb containers can be running. In Stein we switched from xtrabackup to mariabackup for the galera state sync, which means that stein and rocky containers cannot sync. I didn't hit this locally, but it was seen in CI. Here are the relevant error messages from the primary node at that time:

2019-06-25 19:05:50 140049019632384 [Note] WSREP: sst_donor_thread signaled with 0
2019-06-25 19:05:50 140044555761408 [Note] WSREP: async IST sender starting to serve tcp://10.209.96.149:4568 sending 8524-8557
sh: wsrep_sst_mariabackup: command not found
2019-06-25 19:05:50 140044564154112 [ERROR] WSREP: Failed to read from: wsrep_sst_mariabackup --role 'donor' --address '10.209.96.149:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --binlog 'mysql-bin' --gtid 'b99680cf-9773-11e9-b90b-e6ca413f0ef1:8523' --gtid-domain-id '0' --bypass
2019-06-25 19:05:50 140044564154112 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'donor' --address '10.209.96.149:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --binlog 'mysql-bin' --gtid 'b99680cf-9773-11e9-b90b-e6ca413f0ef1:8523' --gtid-domain-id '0' --bypass: 2 (No such file or directory)
2019-06-25 19:05:50 140044564154112 [ERROR] WSREP: Command did not run: wsrep_sst_mariabackup --role 'donor' --address '10.209.96.149:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --binlog 'mysql-bin' --gtid 'b99680cf-9773-11e9-b90b-e6ca413f0ef1:8523' --gtid-domain-id '0' --bypass
2019-06-25 19:05:50 140049068836608 [Warning] WSREP: 1.0 (secondary1): State transfer to 0.0 (secondary2) failed: -2 (No such file or directory)

http://logs.openstack.org/63/667363/3/check/kolla-ansible-centos-source-upgrade-ceph/479cd15/secondary1/logs/kolla/mariadb/mariadb.txt.gz#_2019-06-25_19_05_50

I think we need to shutdown all nodes and perform a recovery in this case.

2019-06-25 19:05:50 140049019632384 [Note] WSREP: sst_donor_thread signaled with 0
2019-06-25 19:05:50 140044555761408 [Note] WSREP: async IST sender starting to serve tcp://10.209.96.149:4568 sending 8524-8557
sh: wsrep_sst_mariabackup: command not found
2019-06-25 19:05:50 140044564154112 [ERROR] WSREP: Failed to read from: wsrep_sst_mariabackup --role 'donor' --address '10.209.96.149:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/'    --binlog 'mysql-bin' --gtid 'b99680cf-9773-11e9-b90b-e6ca413f0ef1:8523' --gtid-domain-id '0' --bypass
2019-06-25 19:05:50 140044564154112 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'donor' --address '10.209.96.149:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/'    --binlog 'mysql-bin' --gtid 'b99680cf-9773-11e9-b90b-e6ca413f0ef1:8523' --gtid-domain-id '0' --bypass: 2 (No such file or directory)
2019-06-25 19:05:50 140044564154112 [ERROR] WSREP: Command did not run: wsrep_sst_mariabackup --role 'donor' --address '10.209.96.149:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/'    --binlog 'mysql-bin' --gtid 'b99680cf-9773-11e9-b90b-e6ca413f0ef1:8523' --gtid-domain-id '0' --bypass
2019-06-25 19:05:50 140049068836608 [Warning] WSREP: 1.0 (secondary1): State transfer to 0.0 (secondary2) failed: -2 (No such file or directory)

http://logs.openstack.org/63/667363/3/check/kolla-ansible-centos-source-upgrade-ceph/479cd15/secondary1/logs/kolla/mariadb/mariadb.txt.gz#_2019-06-25_19_05_50

I think we need to shutdown all nodes and perform a recovery in this case.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-05: Fix merged to kolla-ansible (stable/stein)

Reviewed: https://review.opendev.org/667363
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=99cd5ec10c910bff3c238942a613faafdce0a2e2
Submitter: Zuul
Branch: stable/stein

commit 99cd5ec10c910bff3c238942a613faafdce0a2e2
Author: Mark Goddard <email address hidden>
Date: Tue Jun 25 12:59:41 2019 +0100

Wait for mariadb to stop after shutdown

Stein only.

Currently the kolla-ansible-centos-source-upgrade-ceph job is failing on
the stable/stein branch.

    The problem occurs with mariadb, when performing an upgrade to the Stein
    release which has a new version of mariadb. It appears that when the
    slave mariadb services are shut down, we do not wait for the container
    to stop, so the service may not shut down cleanly. This prevents it from
    starting up successfully.

    This change waits for the container to stop after the shutdown command
    has been executed. It also temporarily replaces the restart policy of
    the container to prevent it from starting up again after the shutdown.

This is not required in other branches since the mariadb shutdown
workaround was only added in the stein branch for bug 1820325.

    There is a second issue that is addressed here. The Stein release
    switched from using xtrabackup to mariabackup for galera state syncing.
    If we run both container versions at the same time on different hosts
    then we can get an error such as the following:

sh: wsrep_sst_mariabackup: command not found

We therefore now stop the cluster and perform a recovery during an
upgrade, if we detect that xtrabackup is in use.

    Finally, we now wait for the bootstrap host to report that it is in an
    OPERATIONAL state. Without this we can see errors where the MariaDB
    cluster is not ready when used by other services.

    Change-Id: I513bcf31adaee8441d43c6b578ca06f66820e52b
    Closes-Bug: #1834191
    Related-Bug: #1820325

Reviewed:  https://review.opendev.org/667363
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=99cd5ec10c910bff3c238942a613faafdce0a2e2
Submitter: Zuul
Branch:    stable/stein

commit 99cd5ec10c910bff3c238942a613faafdce0a2e2
Author: Mark Goddard <mark@stackhpc.com>
Date:   Tue Jun 25 12:59:41 2019 +0100

Wait for mariadb to stop after shutdown
    
    Stein only.
    
    Currently the kolla-ansible-centos-source-upgrade-ceph job is failing on
    the stable/stein branch.
    
    The problem occurs with mariadb, when performing an upgrade to the Stein
    release which has a new version of mariadb. It appears that when the
    slave mariadb services are shut down, we do not wait for the container
    to stop, so the service may not shut down cleanly. This prevents it from
    starting up successfully.
    
    This change waits for the container to stop after the shutdown command
    has been executed. It also temporarily replaces the restart policy of
    the container to prevent it from starting up again after the shutdown.
    
    This is not required in other branches since the mariadb shutdown
    workaround was only added in the stein branch for bug 1820325.
    
    There is a second issue that is addressed here. The Stein release
    switched from using xtrabackup to mariabackup for galera state syncing.
    If we run both container versions at the same time on different hosts
    then we can get an error such as the following:
    
    sh: wsrep_sst_mariabackup: command not found
    
    We therefore now stop the cluster and perform a recovery during an
    upgrade, if we detect that xtrabackup is in use.
    
    Finally, we now wait for the bootstrap host to report that it is in an
    OPERATIONAL state. Without this we can see errors where the MariaDB
    cluster is not ready when used by other services.
    
    Change-Id: I513bcf31adaee8441d43c6b578ca06f66820e52b
    Closes-Bug: #1834191
    Related-Bug: #1820325

Radosław Piliszek (yoctozepto) on 2019-07-12

Changed in kolla-ansible:
status:	New → Fix Released
status:	Fix Released → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-18: Fix included in openstack/kolla-ansible 8.0.0.0rc2

This issue was fixed in the openstack/kolla-ansible 8.0.0.0rc2 release candidate.

Mark Goddard (mgoddard) on 2019-08-07

Changed in kolla-ansible:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.