Sometimes the DB fails with WSREP has not yet prepared node for application use
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
New
|
Undecided
|
Unassigned |
Bug Description
I have seen this now twice (once locally on my env) and once in RDO CI. Keystone or nova or other components seem to fail either during tempest or during the deploy with some errors related to the DB.
* ctrl-2 05:31:41 https:/
2021-08-05 05:31:41.516 238 ERROR keystone.
* ctrl-1 05:39:31 https:/
2021-08-05 05:39:31.956 244 ERROR oslo_db.
[SQL: SELECT 1]
(Background on this error at: http://
* ctrl-0 ... https:/
No issues logged on ctrl-0
In the corresponding galera logs we see:
* ctrl-0
2021-08-05 5:09:57 171 [Warning] 'proxies_priv' entry '@% <email address hidden>' ignored in --skip-name-resolve mode.
2021-08-05 5:25:54 0 [Warning] WSREP: Failed to report last committed 4004, -110 (Connection timed out)
2021-08-05 5:31:28 0 [Note] WSREP: (30419cad, 'ssl://
2021-08-05 5:31:29 0 [Note] WSREP: (30419cad, 'ssl://
* ctrl-1
2021-08-05 5:09:57 2 [Warning] 'proxies_priv' entry '@% <email address hidden>' ignored in --skip-name-resolve mode.
2021-08-05 5:26:51 0 [Warning] WSREP: Failed to report last committed 4003, -110 (Connection timed out)
210805 05:32:31 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
210805 05:32:31 mysqld_safe WSREP: Running position recovery with --disable-log-error --pid-file=
210805 05:32:33 mysqld_safe WSREP: Recovered position 2ab56ff0-
2021-08-05 5:32:33 0 [Note] WSREP: Read nil XID from storage engines, skipping position init
2021-08-05 5:32:33 0 [Note] /usr/libexec/mysqld (mysqld 10.3.28-MariaDB) starting as process 1893 ...
Did it crash and pacemaker restarted it?
There are about 60 messages like these
Aug 05 05:31:34 overcloud-
* ctrl-2
2021-08-05 5:09:57 2 [Warning] 'proxies_priv' entry '@% <email address hidden>' ignored in --skip-name-resolve mode.
2021-08-05 5:27:53 0 [Warning] WSREP: Failed to report last committed 4003, -110 (Connection timed out)
2021-08-05 5:31:29 0 [Note] WSREP: (2ab4f6f2, 'ssl://
2021-08-05 5:31:30 0 [Note] WSREP: (2ab4f6f2, 'ssl://
2021-08-05 5:31:33 0 [Note] WSREP: evs::proto(
It feels the whole podman layer froze and pacemaker restarted it? I'll keep an eye on this one.
We are still facing this issue one year later in c8 train, e.g. in our job periodic- tripleo- ci-centos- 8-ovb-3ctlr_ 1comp-featurese t035-train [1]. Seems to be intermittent though?!?
[1] https:/ /logserver. rdoproject. org/openstack- periodic- integration- stable4/ opendev. org/openstack/ tripleo- ci/master/ periodic- tripleo- ci-centos- 8-ovb-3ctlr_ 1comp-featurese t035-train/ bd0a94d/ logs/overcloud- controller- 2/var/log/ containers/ keystone/ keystone. log.txt. gz