Randomly CI jobs are failing trying to start *_db_sync containers

Bug #1980921 reported by Sandeep Yadav
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Triaged
High
Unassigned

Bug Description

Hello,

I have noticed some random failures in CI this week during ruck/rover. These failures are inconsistent, frequency is not high, and this issue passes in rerun.

Randomly jobs are failing while starting some containers(so far observed on *db_sync containers).

https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset039-wallaby/d0a6625/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

~~~
2022-07-07 00:24:58 | 2022-07-07 00:24:58.267625 | fa163e7c-2d07-b378-483a-000000008f78 | FATAL | Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_3 | overcloud-controller-0 | error={"changed": false, "msg": "Failed containers: nova_db_sync"}
2022-07-07 00:24:58 | 2022-07-07 00:24:58.268448 | fa163e7c-2d07-b378-483a-000000008f78 | TIMING | tripleo_container_manage : Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_3 | overcloud-controller-0 | 0:35:03.382647 | 58.44s
~~~

https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset039-wallaby/d0a6625/logs/overcloud-controller-0/var/log/extra/podman/containers/nova_db_sync/stdout.log.txt.gz

~~~
Running command: '/usr/bin/bootstrap_host_exec nova_conductor su nova -s /bin/bash -c '/usr/bin/nova-manage db sync ''
+ CMD='/usr/bin/bootstrap_host_exec nova_conductor su nova -s /bin/bash -c '\''/usr/bin/nova-manage db sync '\'''
+ ARGS=
+ [[ ! -n '' ]]
+ . kolla_extend_start
+ echo 'Running command: '\''/usr/bin/bootstrap_host_exec nova_conductor su nova -s /bin/bash -c '\''/usr/bin/nova-manage db sync '\'''\'''
+ exec /usr/bin/bootstrap_host_exec nova_conductor su nova -s /bin/bash -c ''\''/usr/bin/nova-manage' db sync \'
An error has occurred:
Traceback (most recent call last):
  File "/usr/lib64/python3.9/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
    self.dialect.do_execute(
  File "/usr/lib64/python3.9/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/lib/python3.9/site-packages/pymysql/cursors.py", line 163, in execute
    result = self._query(query)
  File "/usr/lib/python3.9/site-packages/pymysql/cursors.py", line 321, in _query
    conn.query(q)
  File "/usr/lib/python3.9/site-packages/pymysql/connections.py", line 505, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)
  File "/usr/lib/python3.9/site-packages/pymysql/connections.py", line 724, in _read_query_result
    result.read()
  File "/usr/lib/python3.9/site-packages/pymysql/connections.py", line 1069, in read
    first_packet = self.connection._read_packet()
  File "/usr/lib/python3.9/site-packages/pymysql/connections.py", line 676, in _read_packet
    packet.raise_for_error()
  File "/usr/lib/python3.9/site-packages/pymysql/protocol.py", line 223, in raise_for_error
    err.raise_mysql_exception(self._data)
  File "/usr/lib/python3.9/site-packages/pymysql/err.py", line 107, in raise_mysql_exception
    raise errorclass(errno, errval)
pymysql.err.OperationalError: (1047, 'WSREP has not yet prepared node for application use')
~~~

Another example where multiple containers failed:

~~~
2022-07-07 00:04:24 | 2022-07-07 00:04:24.188154 | fa163e28-8c0a-f976-258d-000000007385 | FATAL | Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_3 | overcloud-controller-0 | error={"changed": false, "msg": "Failed containers: glance_api_db_sync, keystone_db_sync, neutron_db_sync, nova_api_db_sync, nova_api_ensure_default_cell, keystone_bootstrap"}
2022-07-07 00:04:24 | 2022-07-07 00:04:24.188710 | fa163e28-8c0a-f976-258d-000000007385 | TIMING | tripleo_container_manage : Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_3 | overcloud-controller-0 | 0:27:18.507050 | 137.92s
~~~

Tags: ci
Revision history for this message
Rabi Mishra (rabi) wrote :

This one has similar error like https://bugs.launchpad.net/tripleo/+bug/1980921, but it's using ssl endpoints and hence some ssl warnings.

Similar galera bug https://jira.mariadb.org/browse/MDEV-28688

2022-07-07 0:20:16 0 [Warning] WSREP: read_completion_condition(): unexpected eof while reading (167772454: 'error:0A000126:SSL routines::unexpected eof while reading')
2022-07-07 0:20:16 0 [Warning] WSREP: read_handler(): unexpected eof while reading (167772454: 'error:0A000126:SSL routines::unexpected eof while reading')
2022-07-07 0:20:16 0 [Note] WSREP: (880dbb5d-adac, 'ssl://172.17.0.226:4567') turning message relay requesting on, nonlive peers: ssl://172.17.0.23:4567
2022-07-07 0:20:17 0 [Note] WSREP: (880dbb5d-adac, 'ssl://172.17.0.226:4567') reconnecting to 8d698aa5-8ffd (ssl://172.17.0.23:4567), attempt 0
2022-07-07 0:20:17 0 [Note] WSREP: (880dbb5d-adac, 'ssl://172.17.0.226:4567') connection established to 8d698aa5-8ffd ssl://172.17.0.23:4567
2022-07-07 0:20:21 0 [Note] WSREP: (880dbb5d-adac, 'ssl://172.17.0.226:4567') turning message relay requesting off
2022-07-07 0:20:36 0 [Note] WSREP: (880dbb5d-adac, 'ssl://172.17.0.226:4567') connection to peer 82f5ae11-ab9a with addr ssl://172.17.0.203:4567 timed out, no messages seen in PT3S, socket stats: rtt: 2678 rttvar: 4608 rto: 1624000 lost: 1 last_data_recv: 3247 cwnd: 1 last_queued_since: 500077861 last_delivered_since: 3247760837 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
2022-07-07 0:20:36 0 [Note] WSREP: (880dbb5d-adac, 'ssl://172.17.0.226:4567') connection to peer 8d698aa5-8ffd with addr ssl://172.17.0.23:4567 timed out, no messages seen in PT3S, socket stats: rtt: 1276 rttvar: 1889 rto: 1616000 lost: 1 last_data_recv: 3248 cwnd: 1 last_queued_since: 6030 last_delivered_since: 3247917910 send_queue_length: 1 send_queue_bytes: 212 segment: 0 messages: 1
2022-07-07 0:20:36 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: ssl://172.17.0.23:4567

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

Thank you Rabi,

SSL Warning is probably not fatal because fs001(which is not SSL job) is also failing but I will request someone from #pidone to take a look.

https://logserver.rdoproject.org/66/43466/8/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby/71b4cc2/logs/overcloud-controller-0/var/log/extra/failed_containers.log.txt.gz

~~~
cd59b53e048c 192.168.24.1:8787/tripleowallabycentos8/openstack-cinder-api:0038058798919720edb4da904876798d-updated-20220707062809 kolla_start 7 minutes ago Exited (1) 7 minutes ago cinder_api_db_sync
f1a950530035 192.168.24.1:8787/tripleowallabycentos8/openstack-glance-api:0038058798919720edb4da904876798d-updated-20220707062809 /usr/bin/bootstra... 7 minutes ago Exited (1) 7 minutes ago glance_api_db_sync
db2882c2bb27 192.168.24.1:8787/tripleowallabycentos8/openstack-heat-engine:0038058798919720edb4da904876798d-updated-20220707062809 kolla_start 7 minutes ago Exited (1) 7 minutes ago heat_engine_db_sync
9a58fecdf395 192.168.24.1:8787/tripleowallabycentos8/openstack-keystone:0038058798919720edb4da904876798d-updated-20220707062809 /usr/bin/bootstra... 7 minutes ago Exited (1) 7 minutes ago keystone_db_sync
0ddc3ac6c747 192.168.24.1:8787/tripleowallabycentos8/openstack-neutron-server:0038058798919720edb4da904876798d-updated-20220707062809 kolla_start 7 minutes ago Exited (1) 7 minutes ago neutron_db_sync
~~~

Services are unable to make a connection my MySQL:-

https://logserver.rdoproject.org/66/43466/8/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby/71b4cc2/logs/overcloud-controller-0/var/log/extra/errors.txt.gz

~~~
2022-07-07 07:53:23.361 ERROR /var/log/containers/glance/api.log: 17 ERROR glance raise exc.ResourceClosedError("This Connection is closed")
2022-07-07 07:53:23.361 ERROR /var/log/containers/glance/api.log: 17 ERROR glance sqlalchemy.exc.DBAPIError: (sqlalchemy.exc.ResourceClosedError) This Connection is closed
2022-07-07 07:53:23.361 ERROR /var/log/containers/glance/api.log: 17 ERROR glance (Background on this error at: http://sqlalche.me/e/dbapi)
2022-07-07 07:53:23.361 ERROR /var/log/containers/glance/api.log: 17 ERROR glance
~~~

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

Hello,

This looks like a network issue based on the discussion with #pidone team.

~~~
I had a word with Luca Micinni from pidone, Based on the discussion this looks like random Galera failures and could be network related, end result is a split brain so the cluster takes a while to recover.[1]

About these warnings which Rabi pointed out in comment[2] - As per Luca those also could be network related, like OpenSSL complaining about not receiving the stuff it was expecting but we need Damien to confirm that(but he is on PTO this week).

[1] https://logserver.rdoproject.org/66/43466/8/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby/71b4cc2/logs/overcloud-controller-2/var/log/containers/mysql/mysqld.log.txt.gz
[2] https://bugs.launchpad.net/tripleo/+bug/1980921/comments/1
~~~

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.