OK I revisited this bug and I think I understood all the missing bits and strange logs. I'm fairly confident the issue is understood at this point. TL;DR: under heavy load, the setup of the mariadb container may fail because the mysql user configuration times out before the server has a chance to evaluate kolla's SQL script. If your want the full story, brace yourself and read the absurdly long explanation below... In my previous comment I had one comment wrong, and I couldn't figure out why mariadb was stopped a couple of seconds _after_ the bootstrap script failed. Let me clear out all the doubts here. Logs used to analyze the failed CI jobs are available here: . undercloud installation log [1] . undercloud's journal [2] . mariadb log [3] 0) First a refresher: --------------------- In a tripleo-deployed stack, the mysql server runs three times throughout deployment: * during deployment step 2 - it runs twice in a transient container 'mysql_bootstrap' 1. mysql server is first started for the kolla bootstrap [4], without network, without wsrep replication it's started to set root password, delete default users, and then stopped 2. another mysql server is started by t-h-t [5], without network, without wsrep replication it's started to create a healcheck user, and then stopped * during deployment step 3 - it runs the database service in container 'mysql' 3. it's started by t-h-t, with network. this one is never stopped undercloud_install.log [1] errors out at 7:33:45 and dumps all the stdout and stderr of the container since it started at 7:33:07. Now let me update the incomplete/incorrect explanations I gave in the comment #1. The complete sequence of events is the following: 1) Kolla prepares the mysql db on disk -------------------------------------- Around 7:33:08, during the kolla bootstrap, the first thing that runs is mysql_install_db, which internally invokes InnoDB three times to create all the files on disk, but it _ doesn't start _ mysqld just yet.. Jun 08 07:33:08 undercloud.localdomain dockerd-current[28532]: ++ mysql_install_db Jun 08 07:33:08 undercloud.localdomain dockerd-current[28532]: Installing MariaDB/MySQL system tables in '/var/lib/mysql' ... the three InnoDB initialization (not to be confused with the 3 runs of mysqld) show up in the mariadb.log: 2018-06-08 7:33:11 139673585006784 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 0 2018-06-08 7:33:15 139700819671232 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 1622818 2018-06-08 7:33:19 139684735236288 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 1622828 At this point, the DB is created on disk, mysqld _didn't_ start yet, and no pidfile nor mysqld.sock has been created. That part completes successfully. 2) Kolla spawns mysql server and _correctly_ waits for it to be available ------------------------------------------------------------------------- Kolla's bash function 'bootstrap_db' [4] starts at 7:33:21, and schedules a background bash command to start the mysql server the next thing that bash executes is the first test in the while loop that waits for mysql availability (from journal [2]): Jun 08 07:33:21 undercloud.localdomain dockerd-current[28532]: ++ TIMEOUT=60 Jun 08 07:33:21 undercloud.localdomain dockerd-current[28532]: ++ [[ ! -S /var/lib/mysql/mysql.sock ]] Jun 08 07:33:21 undercloud.localdomain dockerd-current[28532]: ++ [[ ! -S /var/run/mysqld/mysqld.sock ]] which stops there, because the server is not started yet then the mysql server is spawned in background Jun 08 07:33:21 undercloud.localdomain dockerd-current[28532]: ++ mysqld_safe --wsrep-new-cluster --skip-networking --wsrep-on=OFF --pid-file=/var/lib/mysql/mariadb.pid the loop decrements a counter and sleeps Jun 08 07:33:21 undercloud.localdomain dockerd-current[28532]: ++ [[ 60 -gt 0 ]] Jun 08 07:33:21 undercloud.localdomain dockerd-current[28532]: ++ let TIMEOUT-=1 Jun 08 07:33:21 undercloud.localdomain dockerd-current[28532]: ++ sleep 1 at this point, mysql becomes ready for connection (from mariadb.log [3]): 2018-06-08 7:33:22 140670143490240 [Note] /usr/libexec/mysqld: ready for connections. Version: '10.1.20-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server which is why the wait loop can now finish, and execute the rest of the bootstrap process: kolla_security_reset Jun 08 07:33:22 undercloud.localdomain dockerd-current[28532]: ++ [[ ! -S /var/lib/mysql/mysql.sock ]] Jun 08 07:33:22 undercloud.localdomain dockerd-current[28532]: ++ [[ ! -f /var/lib/mysql/mariadb.pid ]] Jun 08 07:33:22 undercloud.localdomain dockerd-current[28532]: ++ sudo -E kolla_security_reset So as hinted already in comment #1, we have a proof that kolla works and is not running SQL commands before the mysql server is ready to process them. 3) kolla_security_reset [6] fails to register mysql output and errors out. -------------------------------------------------------------------------- kolla_security_reset in an 'expect' script which runs mysql_secure_installation to set up root password and clean up unwanted mysql users in the DB. The entire output of that script in the journal is mangled due to a docker systemd log bug [7]: Jun 08 07:33:22 undercloud.localdomain dockerd-current[28532]: [32B blob data] Jun 08 07:33:22 undercloud.localdomain dockerd-current[28532]: [1B blob data] Jun 08 07:33:22 undercloud.localdomain dockerd-current[28532]: [70B blob data] Jun 08 07:33:22 undercloud.localdomain dockerd-current[28532]: [67B blob data] Jun 08 07:33:22 undercloud.localdomain dockerd-current[28532]: [1B blob data] Jun 08 07:33:22 undercloud.localdomain dockerd-current[28532]: [66B blob data] Jun 08 07:33:22 undercloud.localdomain dockerd-current[28532]: [67B blob data] Jun 08 07:33:22 undercloud.localdomain dockerd-current[28532]: [67B blob data] Jun 08 07:33:22 undercloud.localdomain dockerd-current[28532]: [37B blob data] Jun 08 07:33:22 undercloud.localdomain dockerd-current[28532]: [1B blob data] Jun 08 07:33:22 undercloud.localdomain dockerd-current[28532]: [51B blob data] But the undercloud install logs [1] shows the content unmangled: spawn mysql_secure_installation\r \r NOTE: RUNNING ALL PARTS OF THIS SCRIPT IS RECOMMENDED FOR ALL MariaDB\r SERVERS IN PRODUCTION USE! PLEASE READ EACH STEP CAREFULLY!\r \r In order to log into MariaDB to secure it, we'll need the current\r password for the root user. If you've just installed MariaDB, and\r you haven't set the root password yet, the password will be blank,\r so you should just press enter here.\r \r Enter current password for root (enter for none): \r 10 seconds after mysql server has started (and mysql_secure_installation has started), kolla_security_reset errors out: Jun 08 07:33:32 undercloud.localdomain dockerd-current[28532]: Failed to get 'Set root password?' prompt Which means two things: 1) kolla_security_reset correctly detected the pattern "Enter current password for root (enter for none):" and sent '\r' in result (to signify "empty root password"). 2) the spawned script mysql_secure_installation never has a chance to connect to the mysql server with this empty password, because it never sent the pattern "Set root password?" In fact, the shell script mysql_secure_installation has received the empty password [8] and tried to connect to mysql with a simple mysql command that roughly translates to "mysql -uroot --pasword='' < <(echo)" and timed out after 10 seconds, before the mysql server has a chance to receive/evaluate the SQL command. The most likely explanation for that failure is that - the time out being really small and the machine hosting the CI job being really loaded - the timeout kicked in before the system has time to schedule the mysql server to evaluate the SQL command. As a hint to bake my claim, I'd said it's not uncommon to experience slowdowns during mysql DB creation, where DB setup last 14 seconds where it usually takes less than 1 when system has no load [9]. since kolla's bootstrap_db is run with -e, the bash script errors out and terminates before running any mysql shutdown command, which leaves the mysql server running. 4) the running mysql server is mysteriously stopped before the mysql container exits. ------------------------------------------------------------------------------------- In fact, the kolla bootstrap is run by kolla_start [10], which has a special shebang which makes it run under dumb-init [11]. dumb-init acts as a reaper's parent process, which will send SIGTERM to any remaining children that got spawned by kolla_start before exiting. By sending SIGTERM to mysqld [12], mariadb triggers the usual 'Normal shutdown' log in mariadb.log. 2018-06-08 7:33:42 140670142007040 [Note] /usr/libexec/mysqld: Normal shutdown But the entire shutdown sequence is not ran until the end, because the container doesn't wait for mysqld to finish and exits unconditionally: Jun 08 07:33:44 undercloud.localdomain dockerd-current[28532]: time="2018-06-08T07:33:44.860381653Z" level=debug msg="containerd: process exited" id=44bec7775d18f31ec4708e5e697a2639c6f3276737fc260d3ee70889c5059ffc pid=init status=0 systemPid=22771 5) a new docker container 'mysql' starts the last mysql server -------------------------------------------------------------- for the record, even if the previous docker container 'mysql_bootstrap' failed, paunch starts the other docker container 'mysql' unconditionally; the latter starts a mysql server: 2018-06-08 7:33:45 140020369176768 [Note] /usr/libexec/mysqld (mysqld 10.1.20-MariaDB) starting as process 167 ... That server complains about the previous stop being incomplete, and runs a crash recovery, as spotted in the the original bug report: 2018-06-08 7:33:46 140020369176768 [Note] InnoDB: The log sequence numbers 1622838 and 1622838 in ibdata files do not match the log sequence number 1622848 in the ib_logfiles! 2018-06-08 7:33:46 140020369176768 [Note] InnoDB: Database was not shutdown normally! 2018-06-08 7:33:46 140020369176768 [Note] InnoDB: Starting crash recovery. 6) CONCLUSIONS -------------- I draw several conclusions from that analysis: 1) kolla correctly waits for the mysql server to start, so that code is safe 2) the default 10s timeout has proved to be small. this error seldom occurs in CI, which further hints this has to be a perf issue. This is the error that should get fixed in the upcoming review that I'll post. 3) the docker systemd log cannot log \r characters in the journal, this is inconvenient and should be tracked separately. 4) when investigating that bug, i found out that due to Ib583fa92d279fc3166cceaf58e152b5cce315d44, I broke the "bail out on error" behaviour of kolla_start. I need to provide a fix for that in a separate review ## Reference [1] http://file.cdg.redhat.com/~dciabrin/lp1775864/undercloud_install.log.txt [2] http://file.cdg.redhat.com/~dciabrin/lp1775864/journal.txt [3] http://file.cdg.redhat.com/~dciabrin/lp1775864/mariadb.log.txt [4] https://github.com/openstack/kolla/blob/master/docker/mariadb/extend_start.sh#L4 [5] https://github.com/openstack/tripleo-heat-templates/blob/56bec75c02b56bab79f35bc94c9110ee93eab82f/docker/services/database/mysql.yaml#L149 [6] https://github.com/openstack/kolla/blob/master/docker/mariadb/security_reset.expect [7] https://github.com/moby/moby/issues/15722 [8] https://github.com/MariaDB/server/blob/10.3/scripts/mysql_secure_installation.sh#L262-L282 [9] I can't link any log unfortunately, but I see that on my own deployments on a 8-core containerized undercloud. [10] https://github.com/openstack/tripleo-heat-templates/blob/56bec75c02b56bab79f35bc94c9110ee93eab82f/docker/services/database/mysql.yaml#L148 [11] https://github.com/Yelp/dumb-init/blob/master/dumb-init.c [12] https://github.com/MariaDB/server/blob/10.1/sql/mysqld.cc#L1895