Upgrading from Rocky to Stein fails on MariaDB

Bug #1820325 reported by Mark Goddard
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Released
Critical
Mark Goddard
Rocky
Fix Committed
Medium
Mark Goddard
Stein
Fix Released
Critical
Mark Goddard

Bug Description

MariaDB was upgraded from 10.1 to 10.3 in the Stein release. It seems upgrades do not work currently, as seen in kayobe upgrade testing and anecdotally in IRC.

There are no useful error logs either in `docker logs mariadb` or /var/log/kolla/mariadb/mariadb.log.

Using the following command I was able to get some logs on startup:

docker run --rm -it --name mariadb_test -e KOLLA_CONFIG_STRATEGY=COPY_ALWAYS -e KOLLA_BASE_DISTRO=centos -e KOLLA_INSTALL_TYPE=binary -v kolla_logs:/var/log/kolla -v /etc/localtime:/etc/localtime:ro -v mariadb:/var/lib/mysql -v /etc/kolla/mariadb:/var/lib/kolla/config_files:ro --net=host kolla/centos-binary-mariadb:master bash -c 'sudo -E kolla_set_configs && . kolla_extend_start && bash -x mysqld_safe'

For some reason the -x causes logs to be dumped to stdout/err.

Here's a dump of the output: http://paste.openstack.org/show/747900/. It's verbose due to the -x.

The pertinent logs seem to be these:

nohup /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql/ --plugin-dir=/usr/lib64/mariadb/plugin --wsrep_provider=/usr/lib64/galera/libgalera_smm.so --wsrep_on=ON --log-error=/var/log/kolla/mariadb/mariadb.log --pid-file=/var/lib/mysql/mariadb.pid --port=3306 --wsrep_recover --disable-log-error --pid-file=/var/lib/mysql//sossidge.novalocal-recover.pid
++ grep 'WSREP: Recovered position:' /tmp/wsrep_recovery.kZaWpN
+ local rp=
+ '[' -z '' ']'
++ grep WSREP /tmp/wsrep_recovery.kZaWpN
++ grep 'skipping position recovery'
+ local skipped=
+ '[' -z '' ']'
++ cat /tmp/wsrep_recovery.kZaWpN
+ log_error 'WSREP: Failed to recover position: '\''nohup: appending output to '\''/var/lib/mysql/nohup.out'\''
2019-03-15 17:58:59 0 [Note] /usr/libexec/mysqld (mysqld 10.3.10-MariaDB-log) starting as process 184 ...
2019-03-15 17:58:59 0 [Note] InnoDB: Using Linux native AIO
2019-03-15 17:58:59 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2019-03-15 17:58:59 0 [Note] InnoDB: Uses event mutexes
2019-03-15 17:58:59 0 [Note] InnoDB: Compressed tables use zlib 1.2.7
2019-03-15 17:58:59 0 [Note] InnoDB: Number of pools: 1
2019-03-15 17:58:59 0 [Note] InnoDB: Using SSE2 crc32 instructions
2019-03-15 17:58:59 0 [Note] InnoDB: Initializing buffer pool, total size = 4G, instances = 8, chunk size = 128M
2019-03-15 17:59:00 0 [Note] InnoDB: Completed initialization of buffer pool
2019-03-15 17:59:00 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2019-03-15 17:59:00 0 [ERROR] InnoDB: Upgrade after a crash is not supported. This redo log was created before MariaDB 10.2.2.
2019-03-15 17:59:00 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
2019-03-15 17:59:00 0 [Note] InnoDB: Starting shutdown...
2019-03-15 17:59:00 0 [ERROR] Plugin '\''InnoDB'\'' init function returned error.
2019-03-15 17:59:00 0 [ERROR] Plugin '\''InnoDB'\'' registration as a STORAGE ENGINE failed.
2019-03-15 17:59:00 0 [Note] Plugin '\''FEEDBACK'\'' is disabled.
2019-03-15 17:59:00 0 [ERROR] Unknown/unsupported storage engine: innodb
2019-03-15 17:59:00 0 [ERROR] Aborting'\'''

Revision history for this message
Mark Goddard (mgoddard) wrote :

Following https://serverfault.com/questions/379714/unknown-unsupported-storage-engine-innodb-mysql-ubuntu, I removed /var/lib/mysql/ib_logfile0 and /var/lib/mysql/ib_logfile1, which allowed the service to start. There are various bad looking logs, e.g.

[Warning] InnoDB: Table mysql/innodb_index_stats has length mismatch in the column name table_name. Please run mysql_upgrade

This is clearly not the right course of action, but does show what is blocking the startup.

We also see these messages, suggesting we need to run mysql_upgrade:

[Warning] InnoDB: Table mysql/innodb_table_stats has length mismatch in the column name table_name. Please run mysql_upgrade

Revision history for this message
Mark Goddard (mgoddard) wrote :

In order to run mysql_upgrade, I had to remove the directory /var/lib/mysql/.pki, otherwise we get the following error:

mysqlcheck: Got error: 1102: Incorrect database name '#mysql50#.pki' when selecting the database
FATAL ERROR: Upgrade failed

Command used was this:

mysql_upgrade -u root -p<DB root password>

Revision history for this message
Mark Goddard (mgoddard) wrote :

Looks like we're killing the maridb service, which leaves around old state, preventing the upgrade.

docker stop -t 60 mariadb

sudo journalctl -u docker | tail -n 1

Container c2a33c47667f80452ab18fe10c1564309955db0e6b4cb765348371746f11a72d failed to exit within 60 seconds of signal 15 - using the force"

Revision history for this message
Mark Goddard (mgoddard) wrote :

dumb-init is executed with --single-process so only forwards the term signal to mysqld_safe script, which apparently just ignores it. We need to call this to shutdown:

mysqladmin shutdown -p<password>

Perhaps we need to wrap mysqld_admin in a script that monitors signals and shuts down mysql as necessary.

For stopping existing containers, we might need to do something in ansible.

Mark Goddard (mgoddard)
Changed in kolla-ansible:
importance: Undecided → High
assignee: nobody → Mark Goddard (mgoddard)
Changed in kolla-ansible:
importance: High → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)

Fix proposed to branch: master
Review: https://review.openstack.org/644250

Changed in kolla-ansible:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.openstack.org/644250
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=b25c0ee4772fdf04fe6f263a8e81571204f706ff
Submitter: Zuul
Branch: master

commit b25c0ee4772fdf04fe6f263a8e81571204f706ff
Author: Mark Goddard <email address hidden>
Date: Sun Mar 17 17:48:54 2019 +0000

    Fix MariaDB 10.3 upgrade

    Upgrading MariaDB from Rocky to Stein currently fails, with the new
    container left continually restarting. The problem is that the Rocky
    container does not shutdown cleanly, leaving behind state that the new
    container cannot recover. The container does not shutdown cleanly
    because we run dumb-init with a --single-child argument, causing it to
    forward signals to only the process executed by dumb-init. In our case
    this is mysqld_safe, which ignores various signals, including SIGTERM.
    After a (default 10 second) timeout, Docker then kills the container.

    A Kolla change [1] removes the --single-child argument from dumb-init
    for the MariaDB container, however we still need to support upgrading
    from Rocky images that don't have this change. To do that, we add new
    handlers to execute 'mysqladmin shutdown' to cleanly shutdown the
    service.

    A second issue with the current upgrade approach is that we don't
    execute mysql_upgrade after starting the new service. This can leave the
    database state using the format of the previous release. This patch also
    adds handlers to execute mysql_upgrade.

    [1] https://review.openstack.org/644244

    Depends-On: https://review.openstack.org/644244
    Depends-On: https://review.openstack.org/645990
    Change-Id: I08a655a359ff9cfa79043f2166dca59199c7d67f
    Closes-Bug: #1820325

Changed in kolla-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla 8.0.0.0rc1

This issue was fixed in the openstack/kolla 8.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 8.0.0.0rc1

This issue was fixed in the openstack/kolla-ansible 8.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/650855

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (master)

Reviewed: https://review.openstack.org/650855
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=d93c604d7af03e8b096faea8acaaaef718df2cd5
Submitter: Zuul
Branch: master

commit d93c604d7af03e8b096faea8acaaaef718df2cd5
Author: Mark Goddard <email address hidden>
Date: Fri Apr 5 19:37:54 2019 +0100

    Remove shutdown of MariaDB

    Since we are now in the Train cycle, we can be sure that any running
    MariaDB containers can be safely stopped, and we do not need to perform
    an explicit shutdown prior to restarting them.

    Change-Id: I5450690f1cbe0c995e8e4b01a76e90dac2574d61
    Related-Bug: #1820325

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla 7.0.2

This issue was fixed in the openstack/kolla 7.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/667363

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (stable/stein)

Reviewed: https://review.opendev.org/667363
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=99cd5ec10c910bff3c238942a613faafdce0a2e2
Submitter: Zuul
Branch: stable/stein

commit 99cd5ec10c910bff3c238942a613faafdce0a2e2
Author: Mark Goddard <email address hidden>
Date: Tue Jun 25 12:59:41 2019 +0100

    Wait for mariadb to stop after shutdown

    Stein only.

    Currently the kolla-ansible-centos-source-upgrade-ceph job is failing on
    the stable/stein branch.

    The problem occurs with mariadb, when performing an upgrade to the Stein
    release which has a new version of mariadb. It appears that when the
    slave mariadb services are shut down, we do not wait for the container
    to stop, so the service may not shut down cleanly. This prevents it from
    starting up successfully.

    This change waits for the container to stop after the shutdown command
    has been executed. It also temporarily replaces the restart policy of
    the container to prevent it from starting up again after the shutdown.

    This is not required in other branches since the mariadb shutdown
    workaround was only added in the stein branch for bug 1820325.

    There is a second issue that is addressed here. The Stein release
    switched from using xtrabackup to mariabackup for galera state syncing.
    If we run both container versions at the same time on different hosts
    then we can get an error such as the following:

    sh: wsrep_sst_mariabackup: command not found

    We therefore now stop the cluster and perform a recovery during an
    upgrade, if we detect that xtrabackup is in use.

    Finally, we now wait for the bootstrap host to report that it is in an
    OPERATIONAL state. Without this we can see errors where the MariaDB
    cluster is not ready when used by other services.

    Change-Id: I513bcf31adaee8441d43c6b578ca06f66820e52b
    Closes-Bug: #1834191
    Related-Bug: #1820325

tags: added: in-stable-stein
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.