kolla-ansible

Upgrading from Rocky to Stein fails on MariaDB

Bug #1820325 reported by Mark Goddard on 2019-03-15

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
kolla-ansible	Fix Released	Critical	Mark Goddard	kolla-ansible 8.0.0 "Stein"
Rocky	Fix Committed	Medium	Mark Goddard
Stein	Fix Released	Critical	Mark Goddard	kolla-ansible 8.0.0 "Stein"

Bug Description

MariaDB was upgraded from 10.1 to 10.3 in the Stein release. It seems upgrades do not work currently, as seen in kayobe upgrade testing and anecdotally in IRC.

There are no useful error logs either in `docker logs mariadb` or /var/log/kolla/mariadb/mariadb.log.

Using the following command I was able to get some logs on startup:

docker run --rm -it --name mariadb_test -e KOLLA_CONFIG_STRATEGY=COPY_ALWAYS -e KOLLA_BASE_DISTRO=centos -e KOLLA_INSTALL_TYPE=binary -v kolla_logs:/var/log/kolla -v /etc/localtime:/etc/localtime:ro -v mariadb:/var/lib/mysql -v /etc/kolla/mariadb:/var/lib/kolla/config_files:ro --net=host kolla/centos-binary-mariadb:master bash -c 'sudo -E kolla_set_configs && . kolla_extend_start && bash -x mysqld_safe'

For some reason the -x causes logs to be dumped to stdout/err.

Here's a dump of the output: http://paste.openstack.org/show/747900/. It's verbose due to the -x.

The pertinent logs seem to be these:

nohup /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql/ --plugin-dir=/usr/lib64/mariadb/plugin --wsrep_provider=/usr/lib64/galera/libgalera_smm.so --wsrep_on=ON --log-error=/var/log/kolla/mariadb/mariadb.log --pid-file=/var/lib/mysql/mariadb.pid --port=3306 --wsrep_recover --disable-log-error --pid-file=/var/lib/mysql//sossidge.novalocal-recover.pid
++ grep 'WSREP: Recovered position:' /tmp/wsrep_recovery.kZaWpN
+ local rp=
+ '[' -z '' ']'
++ grep WSREP /tmp/wsrep_recovery.kZaWpN
++ grep 'skipping position recovery'
+ local skipped=
+ '[' -z '' ']'
++ cat /tmp/wsrep_recovery.kZaWpN
+ log_error 'WSREP: Failed to recover position: '\''nohup: appending output to '\''/var/lib/mysql/nohup.out'\''
2019-03-15 17:58:59 0 [Note] /usr/libexec/mysqld (mysqld 10.3.10-MariaDB-log) starting as process 184 ...
2019-03-15 17:58:59 0 [Note] InnoDB: Using Linux native AIO
2019-03-15 17:58:59 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2019-03-15 17:58:59 0 [Note] InnoDB: Uses event mutexes
2019-03-15 17:58:59 0 [Note] InnoDB: Compressed tables use zlib 1.2.7
2019-03-15 17:58:59 0 [Note] InnoDB: Number of pools: 1
2019-03-15 17:58:59 0 [Note] InnoDB: Using SSE2 crc32 instructions
2019-03-15 17:58:59 0 [Note] InnoDB: Initializing buffer pool, total size = 4G, instances = 8, chunk size = 128M
2019-03-15 17:59:00 0 [Note] InnoDB: Completed initialization of buffer pool
2019-03-15 17:59:00 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2019-03-15 17:59:00 0 [ERROR] InnoDB: Upgrade after a crash is not supported. This redo log was created before MariaDB 10.2.2.
2019-03-15 17:59:00 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
2019-03-15 17:59:00 0 [Note] InnoDB: Starting shutdown...
2019-03-15 17:59:00 0 [ERROR] Plugin '\''InnoDB'\'' init function returned error.
2019-03-15 17:59:00 0 [ERROR] Plugin '\''InnoDB'\'' registration as a STORAGE ENGINE failed.
2019-03-15 17:59:00 0 [Note] Plugin '\''FEEDBACK'\'' is disabled.
2019-03-15 17:59:00 0 [ERROR] Unknown/unsupported storage engine: innodb
2019-03-15 17:59:00 0 [ERROR] Aborting'\'''

Tags:

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-03-15:

Following https://serverfault.com/questions/379714/unknown-unsupported-storage-engine-innodb-mysql-ubuntu, I removed /var/lib/mysql/ib_logfile0 and /var/lib/mysql/ib_logfile1, which allowed the service to start. There are various bad looking logs, e.g.

[Warning] InnoDB: Table mysql/innodb_index_stats has length mismatch in the column name table_name. Please run mysql_upgrade

This is clearly not the right course of action, but does show what is blocking the startup.

We also see these messages, suggesting we need to run mysql_upgrade:

[Warning] InnoDB: Table mysql/innodb_table_stats has length mismatch in the column name table_name. Please run mysql_upgrade

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-03-15:

In order to run mysql_upgrade, I had to remove the directory /var/lib/mysql/.pki, otherwise we get the following error:

mysqlcheck: Got error: 1102: Incorrect database name '#mysql50#.pki' when selecting the database
FATAL ERROR: Upgrade failed

Command used was this:

mysql_upgrade -u root -p<DB root password>

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-03-15:

Looks like we're killing the maridb service, which leaves around old state, preventing the upgrade.

docker stop -t 60 mariadb

sudo journalctl -u docker | tail -n 1

Container c2a33c47667f80452ab18fe10c1564309955db0e6b4cb765348371746f11a72d failed to exit within 60 seconds of signal 15 - using the force"

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-03-15:

dumb-init is executed with --single-process so only forwards the term signal to mysqld_safe script, which apparently just ignores it. We need to call this to shutdown:

mysqladmin shutdown -p<password>

Perhaps we need to wrap mysqld_admin in a script that monitors signals and shuts down mysql as necessary.

For stopping existing containers, we might need to do something in ansible.

Mark Goddard (mgoddard) on 2019-03-17

Changed in kolla-ansible:
importance:	Undecided → High
assignee:	nobody → Mark Goddard (mgoddard)

Eduardo Gonzalez (egonzalez90) on 2019-03-18

Changed in kolla-ansible:
importance:	High → Critical

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-18: Fix proposed to kolla-ansible (master)

Fix proposed to branch: master
Review: https://review.openstack.org/644250

Changed in kolla-ansible:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-26: Fix merged to kolla-ansible (master)

Reviewed: https://review.openstack.org/644250
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=b25c0ee4772fdf04fe6f263a8e81571204f706ff
Submitter: Zuul
Branch: master

commit b25c0ee4772fdf04fe6f263a8e81571204f706ff
Author: Mark Goddard <email address hidden>
Date: Sun Mar 17 17:48:54 2019 +0000

Fix MariaDB 10.3 upgrade

    Upgrading MariaDB from Rocky to Stein currently fails, with the new
    container left continually restarting. The problem is that the Rocky
    container does not shutdown cleanly, leaving behind state that the new
    container cannot recover. The container does not shutdown cleanly
    because we run dumb-init with a --single-child argument, causing it to
    forward signals to only the process executed by dumb-init. In our case
    this is mysqld_safe, which ignores various signals, including SIGTERM.
    After a (default 10 second) timeout, Docker then kills the container.

    A Kolla change [1] removes the --single-child argument from dumb-init
    for the MariaDB container, however we still need to support upgrading
    from Rocky images that don't have this change. To do that, we add new
    handlers to execute 'mysqladmin shutdown' to cleanly shutdown the
    service.

    A second issue with the current upgrade approach is that we don't
    execute mysql_upgrade after starting the new service. This can leave the
    database state using the format of the previous release. This patch also
    adds handlers to execute mysql_upgrade.

[1] https://review.openstack.org/644244

    Depends-On: https://review.openstack.org/644244
    Depends-On: https://review.openstack.org/645990
    Change-Id: I08a655a359ff9cfa79043f2166dca59199c7d67f
    Closes-Bug: #1820325

Changed in kolla-ansible:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-05: Fix included in openstack/kolla 8.0.0.0rc1

This issue was fixed in the openstack/kolla 8.0.0.0rc1 release candidate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-05: Fix included in openstack/kolla-ansible 8.0.0.0rc1

This issue was fixed in the openstack/kolla-ansible 8.0.0.0rc1 release candidate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-08: Related fix proposed to kolla-ansible (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/650855

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-09: Related fix merged to kolla-ansible (master)

#10

Reviewed: https://review.openstack.org/650855
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=d93c604d7af03e8b096faea8acaaaef718df2cd5
Submitter: Zuul
Branch: master

commit d93c604d7af03e8b096faea8acaaaef718df2cd5
Author: Mark Goddard <email address hidden>
Date: Fri Apr 5 19:37:54 2019 +0100

Remove shutdown of MariaDB

    Since we are now in the Train cycle, we can be sure that any running
    MariaDB containers can be safely stopped, and we do not need to perform
    an explicit shutdown prior to restarting them.

Change-Id: I5450690f1cbe0c995e8e4b01a76e90dac2574d61
Related-Bug: #1820325

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-11: Fix included in openstack/kolla 7.0.2

#11

This issue was fixed in the openstack/kolla 7.0.2 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-25: Related fix proposed to kolla-ansible (stable/stein)

#12

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/667363

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-05: Related fix merged to kolla-ansible (stable/stein)

#13

Reviewed: https://review.opendev.org/667363
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=99cd5ec10c910bff3c238942a613faafdce0a2e2
Submitter: Zuul
Branch: stable/stein

commit 99cd5ec10c910bff3c238942a613faafdce0a2e2
Author: Mark Goddard <email address hidden>
Date: Tue Jun 25 12:59:41 2019 +0100

Wait for mariadb to stop after shutdown

Stein only.

Currently the kolla-ansible-centos-source-upgrade-ceph job is failing on
the stable/stein branch.

    The problem occurs with mariadb, when performing an upgrade to the Stein
    release which has a new version of mariadb. It appears that when the
    slave mariadb services are shut down, we do not wait for the container
    to stop, so the service may not shut down cleanly. This prevents it from
    starting up successfully.

    This change waits for the container to stop after the shutdown command
    has been executed. It also temporarily replaces the restart policy of
    the container to prevent it from starting up again after the shutdown.

This is not required in other branches since the mariadb shutdown
workaround was only added in the stein branch for bug 1820325.

    There is a second issue that is addressed here. The Stein release
    switched from using xtrabackup to mariabackup for galera state syncing.
    If we run both container versions at the same time on different hosts
    then we can get an error such as the following:

sh: wsrep_sst_mariabackup: command not found

We therefore now stop the cluster and perform a recovery during an
upgrade, if we detect that xtrabackup is in use.

    Finally, we now wait for the bootstrap host to report that it is in an
    OPERATIONAL state. Without this we can see errors where the MariaDB
    cluster is not ready when used by other services.

    Change-Id: I513bcf31adaee8441d43c6b578ca06f66820e52b
    Closes-Bug: #1834191
    Related-Bug: #1820325

Reviewed:  https://review.opendev.org/667363
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=99cd5ec10c910bff3c238942a613faafdce0a2e2
Submitter: Zuul
Branch:    stable/stein

commit 99cd5ec10c910bff3c238942a613faafdce0a2e2
Author: Mark Goddard <mark@stackhpc.com>
Date:   Tue Jun 25 12:59:41 2019 +0100

Wait for mariadb to stop after shutdown
    
    Stein only.
    
    Currently the kolla-ansible-centos-source-upgrade-ceph job is failing on
    the stable/stein branch.
    
    The problem occurs with mariadb, when performing an upgrade to the Stein
    release which has a new version of mariadb. It appears that when the
    slave mariadb services are shut down, we do not wait for the container
    to stop, so the service may not shut down cleanly. This prevents it from
    starting up successfully.
    
    This change waits for the container to stop after the shutdown command
    has been executed. It also temporarily replaces the restart policy of
    the container to prevent it from starting up again after the shutdown.
    
    This is not required in other branches since the mariadb shutdown
    workaround was only added in the stein branch for bug 1820325.
    
    There is a second issue that is addressed here. The Stein release
    switched from using xtrabackup to mariabackup for galera state syncing.
    If we run both container versions at the same time on different hosts
    then we can get an error such as the following:
    
    sh: wsrep_sst_mariabackup: command not found
    
    We therefore now stop the cluster and perform a recovery during an
    upgrade, if we detect that xtrabackup is in use.
    
    Finally, we now wait for the bootstrap host to report that it is in an
    OPERATIONAL state. Without this we can see errors where the MariaDB
    cluster is not ready when used by other services.
    
    Change-Id: I513bcf31adaee8441d43c6b578ca06f66820e52b
    Closes-Bug: #1834191
    Related-Bug: #1820325

tags:

added: in-stable-stein

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1808993

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.