fs001 and fs035 OVB jobs failing tempest - identity/haproxy connection errors

Bug #1971465 reported by Ronelle Landy
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Invalid
Critical
Unassigned

Bug Description

fs001 and fs035 OVB have been largely unstable for about a month - across multiple releases.
There are various errors being watch (including node provision and overcloud deploy failures), however many test are failing at the tempest stage - with, what looks like, related connection errors.

On CentOS 8 releases(wallaby and train), we see the following errors:

ft1.1: setUpClass (keystone_tempest_plugin.tests.api.identity.v3.test_identity_providers.IndentityProvidersTest)testtools.testresult.real._StringException: Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tempest/test.py", line 182, in setUpClass

....

   'Unexpected status code {0}'.format(resp.status))
tempest.lib.exceptions.IdentityError: Got identity error
Details: Unexpected status code 500

Example logs:

https://logserver.rdoproject.org/55/36255/67/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-wallaby/9967e52/logs/undercloud/var/log/tempest/stestr_results.html.gz

https://logserver.rdoproject.org/62/42462/2/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-train/9da7633/logs/undercloud/var/log/tempest/stestr_results.html.gz

https://logserver.rdoproject.org/91/42491/1/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-train/626ac5d/logs/undercloud/var/log/tempest/stestr_results.html.gz

with related logs:

https://logserver.rdoproject.org/55/36255/67/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-wallaby/9967e52/logs/overcloud-controller-0/var/log/containers/keystone/keystone.log.txt.gz

https://logserver.rdoproject.org/55/36255/67/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-wallaby/9967e52/logs/overcloud-controller-0/var/log/containers/haproxy/haproxy.log.txt.gz

On CentOS 9 releases:

We see different - but possibly related - errors:

https://logserver.rdoproject.org/55/36255/67/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset035-master/498eb76/logs/undercloud/var/log/tempest/stestr_results.html.gz

  File "/usr/lib/python3.9/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

These errors are preventing promotions of wallaby and train c8 and delaying the master c9 promotion.

Revision history for this message
Ronelle Landy (rlandy) wrote :

For additional info: general OVB stats:

job_name, success, failure
periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-wallaby 0 10
periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby 1 13
periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset064-wallaby 16 22
periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset039-wallaby 15 27
periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset035-wallaby 10 29
periodic-tripleo-ci-centos-9-ovb-1ctlr_2comp-featureset020-wallaby 15 31
periodic-tripleo-ci-centos-9-ovb-1ctlr_1comp-featureset002-wallaby 40 7
periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-wallaby 6 40
periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset064-master 6 31
periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset039-master 4 33
periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset035-master 3 42
periodic-tripleo-ci-centos-9-ovb-1ctlr_2comp-featureset020-master 5 43
periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master 7 39
periodic-tripleo-ci-centos-9-ovb-1ctlr_1comp-featureset002-master 19 32
periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039-train 18 10
periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-train 1 19
periodic-tripleo-ci-centos-8-ovb-1ctlr_2comp-featureset020-train 14 16
periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-train 5 25
periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-train 21 7
periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039-victoria 6 7
periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-victoria 10 6
periodic-tripleo-ci-centos-8-ovb-1ctlr_2comp-featureset020-victoria 0 17
periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-victoria 0 13
periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-victoria 2 14

Changed in tripleo:
milestone: none → zed-1
importance: Undecided → Critical
status: New → Triaged
tags: added: promotion-blocker
Revision history for this message
Ronelle Landy (rlandy) wrote :

The suspicion is that this is not tempest related - just tempest revealing the problem.
python-tempestconf runs and succeeds to set up - so looking at the failures in connectivity between the controllers / haproxy

Revision history for this message
Ronelle Landy (rlandy) wrote :

Attempting to add swap to overcloud nodes per comments:

<cloudnull> undercloud is showing a pretty heavily used swap
<cloudnull> overcloud controller is showing no swap
<rlandy> empty there
<cloudnull> yeah
<cloudnull> I suspect because none exists
<cloudnull> maybe we can include the swap template on the deployment

https://review.opendev.org/c/openstack/tripleo-quickstart/+/840283

Revision history for this message
Marios Andreou (marios-b) wrote :
Revision history for this message
Marios Andreou (marios-b) wrote :

in the latest run of the master centos9 integration line [1]:

fs35 actually passed
fs1 failed tempest.api.volume.test_volumes_snapshots.VolumesSnapshotTestJSON
fs20 failed neutron_tempest_plugin.scenario. various tests

[1] https://review.rdoproject.org/zuul/buildset/86001145b6ee4bbf9d69d6e9345c52d6

Revision history for this message
Marios Andreou (marios-b) wrote (last edit ):

[EDIT]: I filed this as a new bug with https://bugs.launchpad.net/tripleo/+bug/1971566

another data point... not sure if related to the same root cause (e.g. if this is environmental/performance issue with the nodes or vexx in general) but we are also seeing a lot of package download/mirror issues - e.g. just from my rounds today I found these 4 i am sure there are more and I saw them yesterday too

        * https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset039-master/ade50e9/logs/undercloud/home/zuul/undercloud_install.log.txt.gz
        * 2022-05-03 22:17:40.994822 | fa163ec5-0263-12c1-ad00-000000000c63 | FATAL | ensure apache is installed | undercloud | error={"changed": false, "msg": "Failed to download packages: httpd-2.4.51-5.el9.x86_64: Cannot download, all mirrors were already tried without success", "results": []}

        * https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset064-master/69bce4b/logs/undercloud/home/zuul/undercloud_install.log.txt.gz
        * 2022-05-03 22:32:15.611563 | fa163eb8-ebe6-ddfe-5123-000000002790 | TIMING | Wait for puppet host configuration to finish | undercloud | 0:18:44.834319 | 10.41s
        * Error: Error downloading packages:", "<13>May 3 22:32:12 puppet-user: net-snmp-1:5.9.1-7.el9.x86_64: Cannot download, all mirrors were already tried without success", "<13>May 3 22:32:12 puppet-user: Error: /Stage[main]/Snmp/Package[snmpd]/ensure: change from 'purged' to 'present' failed: Execution of '/bin/dnf -d 0 -e 1 -y install net-snmp' returned 1: Error: Error downloading packages:

        * https://logserver.rdoproject.org/openstack-component-security/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset039-security-master/ddf9a0d/logs/undercloud/home/zuul/undercloud_install.log.txt.gz
        * 2022-04-30 11:55:49.289903 | fa163efa-1aa9-264b-8b35-000000000c64 | FATAL | ensure apache is installed | undercloud | error={"changed": false, "msg": "Failed to download packages: apr-util-bdb-1.6.1-20.el9.x86_64: Cannot download, all mirrors were already tried without success", "results": []}

        * https://logserver.rdoproject.org/openstack-component-security/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset039-security-master/d5bcdbe/job-output.txt
        * 2022-05-02 13:26:37.637485 | primary | fatal: [undercloud]: FAILED! => {"changed": false, "msg": "Failed to download packages: python3-virtualenv-20.4.4-1.el9s.noarch: Cannot download, all mirrors were already tried without success", "results": []}

Revision history for this message
Marios Andreou (marios-b) wrote :

another angle we are trying is to split the tempest tests across two different jobs - fs1 executes all tempest.api tests currently

patch to split the tests there https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/840444

testing @ https://review.rdoproject.org/r/c/testproject/+/42554

Revision history for this message
Marios Andreou (marios-b) wrote :
Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

We hit bunch of other issue last week and we were failing before tempest execution, We are keeping this under observation.

Revision history for this message
Alan Pevec (apevec) wrote :

how does it look like now?

Revision history for this message
Marios Andreou (marios-b) wrote :

still not completely stable @Apevec I think we need to keep this around for a bit longer

Revision history for this message
Alan Pevec (apevec) wrote :

closing old promotion-blocker

Changed in tripleo:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.