kolla extend_start.sh doesn't check the start of MariaDB server correctly in bootstrap

Bug #1775864 reported by Sergii Golovatiuk on 2018-06-08
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla
High
Unassigned
tripleo
High
Sergii Golovatiuk

Bug Description

Description of problem:

kolla extend_start.sh doesn't check the start of MariaDB server correctly in bootstrap

How reproducible:

very hard to reproduce, caught on CI

Here is a log how it failed

http://logs.openstack.org/19/572319/3/check/tripleo-ci-centos-7-undercloud-containers/ed6d2af/logs/undercloud/home/zuul/undercloud_install.log.txt.gz#_2018-06-05_16_24_34

Here is a MySQL log at the same time
http://logs.openstack.org/19/572319/3/gate/tripleo-ci-centos-7-undercloud-containers/8099829/logs/undercloud/var/log/containers/mysql/mariadb.log.txt.gz

Steps to Reproduce:
There should be innodb recovery during bootstrap. In this case there will be delay between pid and mysql.sock

Actual results:

MariaDB bootstrap failed

Expected results:

MariaDB bootstrap happened

Additional info:

There is a wrong logic in [1]

When pid file exists it doesn't matter that MariaDB is up and running. In this case, MariaDB needed some time to recover InnoDB data, so PID file was present but MariaDB was not ready.

2018-06-08 7:33:46 140020369176768 [Note] /usr/libexec/mysqld: ready for connections.

happened after

2018-06-08 07:33:45 | "++ sudo -E kolla_security_reset",

so the check of MariaDB [1] didn't happen

[1] https://github.com/openstack/kolla/blob/master/docker/mariadb/extend_start.sh#L9-L11

RHBZ#1589267

Changed in kolla:
status: New → Confirmed
Changed in tripleo:
assignee: nobody → Sergii Golovatiuk (sgolovatiuk)
importance: Undecided → Medium
importance: Medium → High
status: New → Confirmed
milestone: none → rocky-3
no longer affects: tripleo/pike
no longer affects: tripleo/queens
Changed in tripleo:
status: Confirmed → Triaged
Damien Ciabrini (dciabrin) wrote :
Download full text (5.4 KiB)

[Current analysis of what failed in the CI job - sorry for the long post]

TL;DR using "mysqladmin ping" is probably a better way to ensure mysqld started properly during kolla bootstrap.

There's a small typo in the link in the description, the failure log is available at [1].

Quick tripleo refresher: during undercloud install, we bootstrap the mysql db by running the kolla bootstrap script [2] in a transient container (before that time no DB existed yet).

The bootstrapping happens in two steps: run mysql_install_db, and run some kolla-specific commands to set up root password.

extend_start.sh is sourced by kolla_start, which has +x and +e flags, so we trace all shell commands and stop on first error.

1) From the install log, we see that mysql_install_db went ok, because we continue and call bootstrap_db:

2018-06-05 16:24:34 | "++ mysql_install_db",
2018-06-05 16:24:34 | "2018-06-05 16:24:14 139787632818368 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295",
2018-06-05 16:24:34 | "2018-06-05 16:24:14 139787632818368 [Note] /usr/libexec/mysqld (mysqld 10.1.20-MariaDB) starting as process 46 ...",
2018-06-05 16:24:34 | "2018-06-05 16:24:18 139859719940288 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295",
2018-06-05 16:24:34 | "2018-06-05 16:24:18 139859719940288 [Note] /usr/libexec/mysqld (mysqld 10.1.20-MariaDB) starting as process 75 ...",
2018-06-05 16:24:34 | "2018-06-05 16:24:21 139917192714432 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295",
2018-06-05 16:24:34 | "2018-06-05 16:24:21 139917192714432 [Note] /usr/libexec/mysqld (mysqld 10.1.20-MariaDB) starting as process 104 ...",
2018-06-05 16:24:34 | "++ bootstrap_db",

(internally this caused three transient mysqld run as seen in [3])
2018-06-08 7:33:11 139673585006784 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 0
2018-06-08 7:33:15 139700819671232 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 1622818
2018-06-08 7:33:19 139684735236288 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 1622828

Finishing OK means any pidfile and sock file on disk were deleted when mysqld_safe stopped.

2) Then comes the kolla-specific initialization.
 . In that step, we start a mysql server _without network connectivity_ and _witout galera replication_.
 . A while loops [4] waits until it sees that both the mysql{d}.sock unix socket _and_ /var/lib/mysql/mariadb.pid are created.

The while loop is a fairly good guarantee that mysql is up because mysqld_safe force deletes any lingering /var/lib/mysql/mariadb.pid [5] before starting the mysqld server. Also mysql{d}.sock is always deleted and recreated because of the way named UNIX socket are initialized.

I've tested it manually and it seems that the while loop only finished when both the socket and the pidfile are created.

From mariadb logs [3], we see that the server properly started at 7:33...

Read more...

Change abandoned by Sergii Golovatiuk (<email address hidden>) on branch: master
Review: https://review.openstack.org/573742
Reason: it doesn't solve the initial issue

Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
milestone: rocky-rc1 → stein-1
Damien Ciabrini (dciabrin) wrote :
Download full text (11.7 KiB)

OK I revisited this bug and I think I understood all the missing bits and strange logs. I'm fairly confident the issue is understood at this point.

TL;DR: under heavy load, the setup of the mariadb container may fail because the mysql user configuration times out before the server has a chance to evaluate kolla's SQL script.

If your want the full story, brace yourself and read the absurdly long explanation below...

In my previous comment I had one comment wrong, and I couldn't figure out why mariadb was stopped a couple of seconds _after_ the bootstrap script failed. Let me clear out all the doubts here.

Logs used to analyze the failed CI jobs are available here:
 . undercloud installation log [1]
 . undercloud's journal [2]
 . mariadb log [3]

0) First a refresher:
---------------------

In a tripleo-deployed stack, the mysql server runs three times throughout deployment:

* during deployment step 2 - it runs twice in a transient container 'mysql_bootstrap'
  1. mysql server is first started for the kolla bootstrap [4], without network, without wsrep replication
     it's started to set root password, delete default users, and then stopped
  2. another mysql server is started by t-h-t [5], without network, without wsrep replication
     it's started to create a healcheck user, and then stopped

* during deployment step 3 - it runs the database service in container 'mysql'
  3. it's started by t-h-t, with network. this one is never stopped

undercloud_install.log [1] errors out at 7:33:45 and dumps all the stdout and stderr of the container since it started at 7:33:07.

Now let me update the incomplete/incorrect explanations I gave in the comment #1. The complete sequence of events is the following:

1) Kolla prepares the mysql db on disk
--------------------------------------

Around 7:33:08, during the kolla bootstrap, the first thing that runs is mysql_install_db, which internally invokes InnoDB three times to create all the files on disk, but it _ doesn't start _ mysqld just yet..

Jun 08 07:33:08 undercloud.localdomain dockerd-current[28532]: ++ mysql_install_db
Jun 08 07:33:08 undercloud.localdomain dockerd-current[28532]: Installing MariaDB/MySQL system tables in '/var/lib/mysql' ...

the three InnoDB initialization (not to be confused with the 3 runs of mysqld) show up in the mariadb.log:

2018-06-08 7:33:11 139673585006784 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 0
2018-06-08 7:33:15 139700819671232 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 1622818
2018-06-08 7:33:19 139684735236288 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 1622828

At this point, the DB is created on disk, mysqld _didn't_ start yet, and no pidfile nor mysqld.sock has been created.

That part completes successfully.

2) Kolla spawns mysql server and _correctly_ waits for it to be available
-------------------------------------------------------------------------

Kolla's bash function 'bootstrap_db' [4] starts at 7:33:21, and schedules a background bash command to start the mysql server

the next...

Changed in tripleo:
milestone: stein-1 → stein-2

Reviewed: https://review.openstack.org/612996
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=8e2906a59e663d43aaac2369b5b74e5142b61d53
Submitter: Zuul
Branch: master

commit 8e2906a59e663d43aaac2369b5b74e5142b61d53
Author: Damien Ciabrini <email address hidden>
Date: Wed Oct 24 13:27:23 2018 +0200

    mariadb: make security_reset.expect's timeout configurable

    Under heavy load, when the security_reset script is being run to set
    up root password, it might happen that the mysql server doesn't get a
    change to evaluate the SQL statement sent by security_reset.expect
    before the hardcoded 10 seconds timeout is reached.

    Allow the timeout to be overriden via env variable DB_MAX_TIMEOUT,
    which is already used as a timeout variable in extend_start.sh

    Change-Id: If82c6daca82a48d899605829dc99cf4a352cddcd
    Closes-Bug: #1775864

Changed in kolla:
status: Confirmed → Fix Released

Reviewed: https://review.openstack.org/614554
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=60959a0b5cd5eb918f083e3d1179f2e6c9e49f39
Submitter: Zuul
Branch: stable/rocky

commit 60959a0b5cd5eb918f083e3d1179f2e6c9e49f39
Author: Damien Ciabrini <email address hidden>
Date: Wed Oct 24 13:27:23 2018 +0200

    mariadb: make security_reset.expect's timeout configurable

    Under heavy load, when the security_reset script is being run to set
    up root password, it might happen that the mysql server doesn't get a
    change to evaluate the SQL statement sent by security_reset.expect
    before the hardcoded 10 seconds timeout is reached.

    Allow the timeout to be overriden via env variable DB_MAX_TIMEOUT,
    which is already used as a timeout variable in extend_start.sh

    Change-Id: If82c6daca82a48d899605829dc99cf4a352cddcd
    Closes-Bug: #1775864
    (cherry picked from commit 8e2906a59e663d43aaac2369b5b74e5142b61d53)

tags: added: in-stable-rocky

This issue was fixed in the openstack/kolla 8.0.0.0b1 development milestone.

This issue was fixed in the openstack/kolla 7.0.1 release.

Changed in tripleo:
milestone: stein-2 → stein-3
Changed in tripleo:
status: Triaged → Fix Released
Mark Goddard (mgoddard) on 2019-03-29
Changed in kolla:
milestone: none → 8.0.0
importance: Undecided → High
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.