tripleo

Sometimes the DB fails with WSREP has not yet prepared node for application use

Bug #1938987 reported by Michele Baldessari on 2021-08-05

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	New	Undecided	Unassigned

Bug Description

I have seen this now twice (once locally on my env) and once in RDO CI. Keystone or nova or other components seem to fail either during tempest or during the deploy with some errors related to the DB.

* ctrl-2 05:31:41 https://logserver.rdoproject.org/26/803526/3/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039/5e8fc15/logs/overcloud-controller-2/var/log/containers/keystone/keystone.log.txt.gz
2021-08-05 05:31:41.516 238 ERROR keystone.server.flask.request_processing.middleware.auth_context [req-0e51e6d1-893f-4c3c-bf3f-4b3f87bef405 - - - - -] (pymysql.err.OperationalError) (1047, 'WSREP has not yet prepared node for application use')

* ctrl-1 05:39:31 https://logserver.rdoproject.org/26/803526/3/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039/5e8fc15/logs/overcloud-controller-1/var/log/containers/keystone/keystone.log.txt.gz
2021-08-05 05:39:31.956 244 ERROR oslo_db.sqlalchemy.engines [req-87b6ed8d-f6fc-418c-9e8e-60716f7bc684 - - - - -] Database connection was found disconnected; reconnecting: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (BrokenPipeError(32, 'Broken pipe'))")
[SQL: SELECT 1]
(Background on this error at: http://sqlalche.me/e/13/e3q8)

* ctrl-0 ... https://logserver.rdoproject.org/26/803526/3/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039/5e8fc15/logs/overcloud-controller-0/var/log/containers/keystone/keystone.log.txt.gz
No issues logged on ctrl-0

In the corresponding galera logs we see:
* ctrl-0
2021-08-05 5:09:57 171 [Warning] 'proxies_priv' entry '@% <email address hidden>' ignored in --skip-name-resolve mode.
2021-08-05 5:25:54 0 [Warning] WSREP: Failed to report last committed 4004, -110 (Connection timed out)
2021-08-05 5:31:28 0 [Note] WSREP: (30419cad, 'ssl://172.17.0.131:4567') turning message relay requesting on, nonlive peers: ssl://172.17.0.133:4567
2021-08-05 5:31:29 0 [Note] WSREP: (30419cad, 'ssl://172.17.0.131:4567') reconnecting to 379eaa65 (ssl://172.17.0.133:4567), attempt 0

* ctrl-1
2021-08-05 5:09:57 2 [Warning] 'proxies_priv' entry '@% <email address hidden>' ignored in --skip-name-resolve mode.
2021-08-05 5:26:51 0 [Warning] WSREP: Failed to report last committed 4003, -110 (Connection timed out)
210805 05:32:31 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
210805 05:32:31 mysqld_safe WSREP: Running position recovery with --disable-log-error --pid-file='/var/lib/mysql/overcloud-controller-1.ooo.test-recover.pid'
210805 05:32:33 mysqld_safe WSREP: Recovered position 2ab56ff0-f5ab-11eb-ab22-2b57afeb185b:4189
2021-08-05 5:32:33 0 [Note] WSREP: Read nil XID from storage engines, skipping position init
2021-08-05 5:32:33 0 [Note] /usr/libexec/mysqld (mysqld 10.3.28-MariaDB) starting as process 1893 ...

Did it crash and pacemaker restarted it?

There are about 60 messages like these
Aug 05 05:31:34 overcloud-controller-1.ooo.test podman[208956]: Error: healthcheck command exceeded timeout of 30s

* ctrl-2
2021-08-05 5:09:57 2 [Warning] 'proxies_priv' entry '@% <email address hidden>' ignored in --skip-name-resolve mode.
2021-08-05 5:27:53 0 [Warning] WSREP: Failed to report last committed 4003, -110 (Connection timed out)
2021-08-05 5:31:29 0 [Note] WSREP: (2ab4f6f2, 'ssl://172.17.0.31:4567') turning message relay requesting on, nonlive peers: ssl://172.17.0.133:4567
2021-08-05 5:31:30 0 [Note] WSREP: (2ab4f6f2, 'ssl://172.17.0.31:4567') reconnecting to 379eaa65 (ssl://172.17.0.133:4567), attempt 0
2021-08-05 5:31:33 0 [Note] WSREP: evs::proto(2ab4f6f2, OPERATIONAL, view_id(REG,2ab4f6f2,3)) suspecting node: 379eaa65

It feels the whole podman layer froze and pacemaker restarted it? I'll keep an eye on this one.

Revision history for this message

Jakob Meng (jm1337) wrote on 2022-08-29:

We are still facing this issue one year later in c8 train, e.g. in our job periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-train [1]. Seems to be intermittent though?!?

[1] https://logserver.rdoproject.org/openstack-periodic-integration-stable4/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-train/bd0a94d/logs/overcloud-controller-2/var/log/containers/keystone/keystone.log.txt.gz

Revision history for this message

Jakob Meng (jm1337) wrote on 2022-09-01:

And again in C8 train, this time in job periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-train [1]:

2022-09-01 11:05:11.692 23 ERROR keystone.server.flask.request_processing.middleware.auth_context pymysql.err.OperationalError: (1047, 'WSREP has not yet prepared node for application use')

[1] https://logserver.rdoproject.org/62/44662/6/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-train/5753657/logs/overcloud-controller-0/var/log/containers/keystone/keystone.log.txt.gz

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.