Comment 1 for bug 1775864

Revision history for this message
Damien Ciabrini (dciabrin) wrote :

[Current analysis of what failed in the CI job - sorry for the long post]

TL;DR using "mysqladmin ping" is probably a better way to ensure mysqld started properly during kolla bootstrap.

There's a small typo in the link in the description, the failure log is available at [1].

Quick tripleo refresher: during undercloud install, we bootstrap the mysql db by running the kolla bootstrap script [2] in a transient container (before that time no DB existed yet).

The bootstrapping happens in two steps: run mysql_install_db, and run some kolla-specific commands to set up root password.

extend_start.sh is sourced by kolla_start, which has +x and +e flags, so we trace all shell commands and stop on first error.

1) From the install log, we see that mysql_install_db went ok, because we continue and call bootstrap_db:

2018-06-05 16:24:34 | "++ mysql_install_db",
2018-06-05 16:24:34 | "2018-06-05 16:24:14 139787632818368 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295",
2018-06-05 16:24:34 | "2018-06-05 16:24:14 139787632818368 [Note] /usr/libexec/mysqld (mysqld 10.1.20-MariaDB) starting as process 46 ...",
2018-06-05 16:24:34 | "2018-06-05 16:24:18 139859719940288 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295",
2018-06-05 16:24:34 | "2018-06-05 16:24:18 139859719940288 [Note] /usr/libexec/mysqld (mysqld 10.1.20-MariaDB) starting as process 75 ...",
2018-06-05 16:24:34 | "2018-06-05 16:24:21 139917192714432 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295",
2018-06-05 16:24:34 | "2018-06-05 16:24:21 139917192714432 [Note] /usr/libexec/mysqld (mysqld 10.1.20-MariaDB) starting as process 104 ...",
2018-06-05 16:24:34 | "++ bootstrap_db",

(internally this caused three transient mysqld run as seen in [3])
2018-06-08 7:33:11 139673585006784 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 0
2018-06-08 7:33:15 139700819671232 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 1622818
2018-06-08 7:33:19 139684735236288 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 1622828

Finishing OK means any pidfile and sock file on disk were deleted when mysqld_safe stopped.

2) Then comes the kolla-specific initialization.
 . In that step, we start a mysql server _without network connectivity_ and _witout galera replication_.
 . A while loops [4] waits until it sees that both the mysql{d}.sock unix socket _and_ /var/lib/mysql/mariadb.pid are created.

The while loop is a fairly good guarantee that mysql is up because mysqld_safe force deletes any lingering /var/lib/mysql/mariadb.pid [5] before starting the mysqld server. Also mysql{d}.sock is always deleted and recreated because of the way named UNIX socket are initialized.

I've tested it manually and it seems that the while loop only finished when both the socket and the pidfile are created.

From mariadb logs [3], we see that the server properly started at 7:33:22

2018-06-08 7:33:22 140670143490240 [Note] /usr/libexec/mysqld: ready for connections.
Version: '10.1.20-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 0 MariaDB Server

But for some reasons, kolla_security_reset [6] (an script based on 'expect') failed to run until the end.
If passes the first expect string (because it doesn't require connection to mysqld), but failed at the second (which requires sending SQL command to mysqld):

2018-06-08 07:33:45 | "Failed to get 'Set root password?' prompt",

---

It's unclear to me yet who stops the mysqld server right before the kolla script failure:

2018-06-08 7:33:42 140670142007040 [Note] /usr/libexec/mysqld: Normal shutdown

Nor who is starting another mysql after the whole bootstrap container exit in error:

2018-06-08 07:33:45 | "Error running ['docker', 'run', '--name', 'mysql_bootstrap', '--label', 'config_id=tripleo_step2', '--label', 'container_name=mysql_bootstrap', '--label', 'managed_by=paunch', '--label', 'config_data={\"start_order\": 1, \"image\": \"docker.io/tripleomaster/centos-binary-

180608 07:33:45 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql

(it's not the init container though, because we only start mysqld without network and the last start has network enabled.)

I can only assume from this point that 1) the mysql server could not process SQL command for whatever reason even if it seems up and ready from the logs, and that's what confused expect.

Maybe using a stronger test in the while loop like using "mysqladmin ping" would ensure that we can talk to the DB or fail cleanly before running the next kolla steps.

[1] http://logs.openstack.org/19/572319/3/gate/tripleo-ci-centos-7-undercloud-containers/8099829/logs/undercloud/home/zuul/undercloud_install.log.txt.gz

[2] https://github.com/openstack/kolla/blob/master/docker/mariadb/extend_start.sh

[3] http://logs.openstack.org/19/572319/3/gate/tripleo-ci-centos-7-undercloud-containers/8099829/logs/undercloud/var/log/containers/mysql/mariadb.log.txt.gz

[4] https://github.com/openstack/kolla/blob/master/docker/mariadb/extend_start.sh#L9-L18

[5] https://github.com/MariaDB/server/blob/10.3/scripts/mysqld_safe.sh#L960

[6] https://github.com/openstack/kolla/blob/master/docker/mariadb/security_reset.expect