Sometimes the DB fails with WSREP has not yet prepared node for application use

Bug #1938987 reported by Michele Baldessari
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
New
Undecided
Unassigned

Bug Description

I have seen this now twice (once locally on my env) and once in RDO CI. Keystone or nova or other components seem to fail either during tempest or during the deploy with some errors related to the DB.

* ctrl-2 05:31:41 https://logserver.rdoproject.org/26/803526/3/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039/5e8fc15/logs/overcloud-controller-2/var/log/containers/keystone/keystone.log.txt.gz
2021-08-05 05:31:41.516 238 ERROR keystone.server.flask.request_processing.middleware.auth_context [req-0e51e6d1-893f-4c3c-bf3f-4b3f87bef405 - - - - -] (pymysql.err.OperationalError) (1047, 'WSREP has not yet prepared node for application use')

* ctrl-1 05:39:31 https://logserver.rdoproject.org/26/803526/3/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039/5e8fc15/logs/overcloud-controller-1/var/log/containers/keystone/keystone.log.txt.gz
2021-08-05 05:39:31.956 244 ERROR oslo_db.sqlalchemy.engines [req-87b6ed8d-f6fc-418c-9e8e-60716f7bc684 - - - - -] Database connection was found disconnected; reconnecting: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (BrokenPipeError(32, 'Broken pipe'))")
[SQL: SELECT 1]
(Background on this error at: http://sqlalche.me/e/13/e3q8)

* ctrl-0 ... https://logserver.rdoproject.org/26/803526/3/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039/5e8fc15/logs/overcloud-controller-0/var/log/containers/keystone/keystone.log.txt.gz
No issues logged on ctrl-0

In the corresponding galera logs we see:
* ctrl-0
2021-08-05 5:09:57 171 [Warning] 'proxies_priv' entry '@% <email address hidden>' ignored in --skip-name-resolve mode.
2021-08-05 5:25:54 0 [Warning] WSREP: Failed to report last committed 4004, -110 (Connection timed out)
2021-08-05 5:31:28 0 [Note] WSREP: (30419cad, 'ssl://172.17.0.131:4567') turning message relay requesting on, nonlive peers: ssl://172.17.0.133:4567
2021-08-05 5:31:29 0 [Note] WSREP: (30419cad, 'ssl://172.17.0.131:4567') reconnecting to 379eaa65 (ssl://172.17.0.133:4567), attempt 0

* ctrl-1
2021-08-05 5:09:57 2 [Warning] 'proxies_priv' entry '@% <email address hidden>' ignored in --skip-name-resolve mode.
2021-08-05 5:26:51 0 [Warning] WSREP: Failed to report last committed 4003, -110 (Connection timed out)
210805 05:32:31 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
210805 05:32:31 mysqld_safe WSREP: Running position recovery with --disable-log-error --pid-file='/var/lib/mysql/overcloud-controller-1.ooo.test-recover.pid'
210805 05:32:33 mysqld_safe WSREP: Recovered position 2ab56ff0-f5ab-11eb-ab22-2b57afeb185b:4189
2021-08-05 5:32:33 0 [Note] WSREP: Read nil XID from storage engines, skipping position init
2021-08-05 5:32:33 0 [Note] /usr/libexec/mysqld (mysqld 10.3.28-MariaDB) starting as process 1893 ...

Did it crash and pacemaker restarted it?

There are about 60 messages like these
Aug 05 05:31:34 overcloud-controller-1.ooo.test podman[208956]: Error: healthcheck command exceeded timeout of 30s

* ctrl-2
2021-08-05 5:09:57 2 [Warning] 'proxies_priv' entry '@% <email address hidden>' ignored in --skip-name-resolve mode.
2021-08-05 5:27:53 0 [Warning] WSREP: Failed to report last committed 4003, -110 (Connection timed out)
2021-08-05 5:31:29 0 [Note] WSREP: (2ab4f6f2, 'ssl://172.17.0.31:4567') turning message relay requesting on, nonlive peers: ssl://172.17.0.133:4567
2021-08-05 5:31:30 0 [Note] WSREP: (2ab4f6f2, 'ssl://172.17.0.31:4567') reconnecting to 379eaa65 (ssl://172.17.0.133:4567), attempt 0
2021-08-05 5:31:33 0 [Note] WSREP: evs::proto(2ab4f6f2, OPERATIONAL, view_id(REG,2ab4f6f2,3)) suspecting node: 379eaa65

It feels the whole podman layer froze and pacemaker restarted it? I'll keep an eye on this one.

Revision history for this message
Jakob Meng (jm1337) wrote :

We are still facing this issue one year later in c8 train, e.g. in our job periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-train [1]. Seems to be intermittent though?!?

[1] https://logserver.rdoproject.org/openstack-periodic-integration-stable4/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-train/bd0a94d/logs/overcloud-controller-2/var/log/containers/keystone/keystone.log.txt.gz

Revision history for this message
Jakob Meng (jm1337) wrote :

And again in C8 train, this time in job periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-train [1]:

  2022-09-01 11:05:11.692 23 ERROR keystone.server.flask.request_processing.middleware.auth_context pymysql.err.OperationalError: (1047, 'WSREP has not yet prepared node for application use')

[1] https://logserver.rdoproject.org/62/44662/6/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-train/5753657/logs/overcloud-controller-0/var/log/containers/keystone/keystone.log.txt.gz

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.