Ugrades ussuri jobs fail in CI

Bug #1911020 reported by Sagi (Sergey) Shnaidman
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

Failing jobs:
tripleo-ci-centos-8-undercloud-upgrade-ussuri
https://zuul.openstack.org/builds?job_name=tripleo-ci-centos-8-undercloud-upgrade-ussuri+
tripleo-ci-centos-8-scenario000-multinode-oooq-container-upgrades-ussuri
https://zuul.openstack.org/builds?job_name=tripleo-ci-centos-8-scenario000-multinode-oooq-container-upgrades-ussuri+

Failing overcloud network when running tempest:
keystoneauth1.exceptions.catalog.EndpointNotFound: internal endpoint for network service in regionOne region not found
https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4e4/769595/1/check/tripleo-ci-centos-8-scenario000-multinode-oooq-container-upgrades-ussuri/4e49baf/job-output.txt

Doesn't find overcloud cloud when running tempest:

 Cloud overcloud was not found.
https://340bd2476776cdcc0b0e-9816e50b961196213d5638e985d3f02c.ssl.cf5.rackcdn.com/769595/1/check/tripleo-ci-centos-8-undercloud-upgrade-ussuri/d917201/job-output.txt

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Revision history for this message
wes hayutin (weshayutin) wrote :
Download full text (11.8 KiB)

This looks like the mariadb connection is fubard.

https://6f2db3aca8a21b7bf2ca-43939889d4a60991ed49de41534ef9b0.ssl.cf5.rackcdn.com/769595/1/gate/tripleo-ci-centos-8-undercloud-upgrade-ussuri/1323868/logs/undercloud/var/log/extra/big-errors.txt.txt

CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
2021-01-09 09:28:29.503 ERROR /var/log/containers/keystone/keystone.log.1: 153 ERROR keystone.server.flask.request_processing.middleware.auth_context oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')

2021-01-09 09:09:18.196 ERROR /var/log/containers/ironic/app.log: 14 ERROR oslo_db.sqlalchemy.engines [req-d76944e1-6b0a-4208-8db3-327fbb60bb2a f297f11b2175441caa3f28589a3d9c36 98d7ebc325ed40ae878d07cb59a9ddaf - default default] Database connection was found disconnected; reconnecting: oslo_db.exception.DBConnectionError: (pymysql.err.InternalError) (1927, 'Connection was killed')
2021-01-09 09:09:18.196 ERROR /var/log/containers/ironic/app.log: 14 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):
2021-01-09 09:09:18.196 ERROR /var/log/containers/ironic/app.log: 14 ERROR oslo_db.sqlalchemy.engines File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1244, in _execute_context
2021-01-09 09:09:18.196 ERROR /var/log/containers/ironic/app.log: 14 ERROR oslo_db.sqlalchemy.engines cursor, statement, parameters, context
2021-01-09 09:09:18.196 ERROR /var/log/containers/ironic/app.log: 14 ERROR oslo_db.sqlalchemy.engines File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
2021-01-09 09:09:18.196 ERROR /var/log/containers/ironic/app.log: 14 ERROR oslo_db.sqlalchemy.engines cursor.execute(statement, parameters)
2021-01-09 09:09:18.196 ERROR /var/log/containers/ironic/app.log: 14 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/cursors.py", line 165, in execute
2021-01-09 09:09:18.196 ERROR /var/log/containers/ironic/app.log: 14 ERROR oslo_db.sqlalchemy.engines result = self._query(query)
2021-01-09 09:09:18.196 ERROR /var/log/containers/ironic/app.log: 14 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/cursors.py", line 321, in _query
2021-01-09 09:09:18.196 ERROR /var/log/containers/ironic/app.log: 14 ERROR oslo_db.sqlalchemy.engines conn.query(q)
2021-01-09 09:09:18.196 ERROR /var/log/containers/ironic/app.log: 14 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 860, in query
2021-01-09 09:09:18.196 ERROR /var/log/containers/ironic/app.log: 14 ERROR oslo_db.sqlalchemy.engines self._affected_rows = self._read_query_result(unbuffered=unbuffered)
2021-01-09 09:09:18.196 ERROR /var/log/containers/ironic/app.log: 14 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 1061, in _read_query_result
2021-01-09 09:09:18.196 ERROR /var/log/containers/ironic/app.log: 14 ERROR oslo_db.sqlalchemy.engines result.read()
2021-01-09 09:09:18.196 ERROR /var/log/containers/ironic/app.log: 14 ERROR oslo_db....

Revision history for this message
wes hayutin (weshayutin) wrote :

IMHO.. this should have failed in the upgrade, and not completed and moved to tempest.

Changed in tripleo:
milestone: none → wallaby-2
Revision history for this message
wes hayutin (weshayutin) wrote :

in the overcloud upgrade.. it looks like the mariadb container is caught w/ a health check

https://4c7502961c4643b6854e-e3236fcfb41e99e283b71fa476352983.ssl.cf5.rackcdn.com/769595/1/gate/tripleo-ci-centos-8-scenario000-multinode-oooq-container-upgrades-ussuri/203dc2b/logs/subnode-1/var/log/extra/failed_services.txt

Jan 09 09:55:17 centos-8-ovh-bhs1-0022460204 podman[180634]: 2021-01-09 09:55:17.643981633 +0000 UTC m=+0.527802145 container exec cc49c932d9f609a97e0992f586158db5270cf333e2c0488d35cc981f972448fe (image=192.168.24.1:8787/tripleou/centos-binary-mariadb:57e21c8afa1aacd4f03195f409ff827b, name=clustercheck)
Jan 09 09:55:17 centos-8-ovh-bhs1-0022460204 podman[180634]: unhealthy

Revision history for this message
Alex Schultz (alex-schultz) wrote :

https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/770188 tempest was switched to use an rpm and it doesn't work

Revision history for this message
Marios Andreou (marios-b) wrote :

the revert has been blocked by workflow -1 at https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/770188

apparently arx is trying to fix this in another way (I was pointed at https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/770357 but there may be others?)

@Arx please update this with the fixes you have proposed. It is still blocking ussuri. IMO we should have let the revert merge and then fix it as this blocks gates

Revision history for this message
Marios Andreou (marios-b) wrote :
Revision history for this message
Arx Cruz (arxcruz) wrote :

Hello,

The problem was mixed variables between validate-tempest and os_tempest.
Since we were running part of the jobs with os_tempest and another part (old releases) with validate-tempest, we had run_tempest and use_os_tempest variables.

When we move everything to run with os_tempest, by mistake, I assume all the jobs were running tempest, and set the default use_os_tempest to True. This make update/upgrade jobs run tempest and fail (which is something we should investigate later).

The patch https://review.opendev.org/c/openstack/tripleo-quickstart/+/770359 fix this problem, replace run_tempest with use_os_tempest, and set it to false on all the featuresets that had it set to false.

A followup patch will remove the remain variables and explicitly set use_os_tempest to true on remain featuresets, for the sake of documentation.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

the patches are merged, was it fixed yet?

Changed in tripleo:
status: Triaged → Fix Committed
Revision history for this message
wes hayutin (weshayutin) wrote :
Changed in tripleo:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.