Hello,
I have noticed some random failures in CI this week during ruck/rover. These failures are inconsistent, frequency is not high, and this issue passes in rerun.
Randomly jobs are failing while starting some containers(so far observed on *db_sync containers).
https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset039-wallaby/d0a6625/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
~~~
2022-07-07 00:24:58 | 2022-07-07 00:24:58.267625 | fa163e7c-2d07-b378-483a-000000008f78 | FATAL | Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_3 | overcloud-controller-0 | error={"changed": false, "msg": "Failed containers: nova_db_sync"}
2022-07-07 00:24:58 | 2022-07-07 00:24:58.268448 | fa163e7c-2d07-b378-483a-000000008f78 | TIMING | tripleo_container_manage : Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_3 | overcloud-controller-0 | 0:35:03.382647 | 58.44s
~~~
https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset039-wallaby/d0a6625/logs/overcloud-controller-0/var/log/extra/podman/containers/nova_db_sync/stdout.log.txt.gz
~~~
Running command: '/usr/bin/bootstrap_host_exec nova_conductor su nova -s /bin/bash -c '/usr/bin/nova-manage db sync ''
+ CMD='/usr/bin/bootstrap_host_exec nova_conductor su nova -s /bin/bash -c '\''/usr/bin/nova-manage db sync '\'''
+ ARGS=
+ [[ ! -n '' ]]
+ . kolla_extend_start
+ echo 'Running command: '\''/usr/bin/bootstrap_host_exec nova_conductor su nova -s /bin/bash -c '\''/usr/bin/nova-manage db sync '\'''\'''
+ exec /usr/bin/bootstrap_host_exec nova_conductor su nova -s /bin/bash -c ''\''/usr/bin/nova-manage' db sync \'
An error has occurred:
Traceback (most recent call last):
File "/usr/lib64/python3.9/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
self.dialect.do_execute(
File "/usr/lib64/python3.9/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
cursor.execute(statement, parameters)
File "/usr/lib/python3.9/site-packages/pymysql/cursors.py", line 163, in execute
result = self._query(query)
File "/usr/lib/python3.9/site-packages/pymysql/cursors.py", line 321, in _query
conn.query(q)
File "/usr/lib/python3.9/site-packages/pymysql/connections.py", line 505, in query
self._affected_rows = self._read_query_result(unbuffered=unbuffered)
File "/usr/lib/python3.9/site-packages/pymysql/connections.py", line 724, in _read_query_result
result.read()
File "/usr/lib/python3.9/site-packages/pymysql/connections.py", line 1069, in read
first_packet = self.connection._read_packet()
File "/usr/lib/python3.9/site-packages/pymysql/connections.py", line 676, in _read_packet
packet.raise_for_error()
File "/usr/lib/python3.9/site-packages/pymysql/protocol.py", line 223, in raise_for_error
err.raise_mysql_exception(self._data)
File "/usr/lib/python3.9/site-packages/pymysql/err.py", line 107, in raise_mysql_exception
raise errorclass(errno, errval)
pymysql.err.OperationalError: (1047, 'WSREP has not yet prepared node for application use')
~~~
Another example where multiple containers failed:
~~~
2022-07-07 00:04:24 | 2022-07-07 00:04:24.188154 | fa163e28-8c0a-f976-258d-000000007385 | FATAL | Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_3 | overcloud-controller-0 | error={"changed": false, "msg": "Failed containers: glance_api_db_sync, keystone_db_sync, neutron_db_sync, nova_api_db_sync, nova_api_ensure_default_cell, keystone_bootstrap"}
2022-07-07 00:04:24 | 2022-07-07 00:04:24.188710 | fa163e28-8c0a-f976-258d-000000007385 | TIMING | tripleo_container_manage : Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_3 | overcloud-controller-0 | 0:27:18.507050 | 137.92s
~~~
This one has similar error like https:/ /bugs.launchpad .net/tripleo/ +bug/1980921, but it's using ssl endpoints and hence some ssl warnings.
Similar galera bug https:/ /jira.mariadb. org/browse/ MDEV-28688
2022-07-07 0:20:16 0 [Warning] WSREP: read_completion _condition( ): unexpected eof while reading (167772454: 'error:0A000126:SSL routines: :unexpected eof while reading') :unexpected eof while reading') 172.17. 0.226:4567' ) turning message relay requesting on, nonlive peers: ssl://172. 17.0.23: 4567 172.17. 0.226:4567' ) reconnecting to 8d698aa5-8ffd (ssl:// 172.17. 0.23:4567) , attempt 0 172.17. 0.226:4567' ) connection established to 8d698aa5-8ffd ssl://172. 17.0.23: 4567 172.17. 0.226:4567' ) turning message relay requesting off 172.17. 0.226:4567' ) connection to peer 82f5ae11-ab9a with addr ssl://172. 17.0.203: 4567 timed out, no messages seen in PT3S, socket stats: rtt: 2678 rttvar: 4608 rto: 1624000 lost: 1 last_data_recv: 3247 cwnd: 1 last_queued_since: 500077861 last_delivered_ since: 3247760837 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 172.17. 0.226:4567' ) connection to peer 8d698aa5-8ffd with addr ssl://172. 17.0.23: 4567 timed out, no messages seen in PT3S, socket stats: rtt: 1276 rttvar: 1889 rto: 1616000 lost: 1 last_data_recv: 3248 cwnd: 1 last_queued_since: 6030 last_delivered_ since: 3247917910 send_queue_length: 1 send_queue_bytes: 212 segment: 0 messages: 1 17.0.23: 4567
2022-07-07 0:20:16 0 [Warning] WSREP: read_handler(): unexpected eof while reading (167772454: 'error:0A000126:SSL routines:
2022-07-07 0:20:16 0 [Note] WSREP: (880dbb5d-adac, 'ssl://
2022-07-07 0:20:17 0 [Note] WSREP: (880dbb5d-adac, 'ssl://
2022-07-07 0:20:17 0 [Note] WSREP: (880dbb5d-adac, 'ssl://
2022-07-07 0:20:21 0 [Note] WSREP: (880dbb5d-adac, 'ssl://
2022-07-07 0:20:36 0 [Note] WSREP: (880dbb5d-adac, 'ssl://
2022-07-07 0:20:36 0 [Note] WSREP: (880dbb5d-adac, 'ssl://
2022-07-07 0:20:36 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: ssl://172.