tripleo-ci-centos-8-scenario001-standalone is failing in master and wallaby (check/gate/periodic) - tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern

Bug #1940866 reported by Ronelle Landy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Francesco Pantano

Bug Description

tripleo-ci-centos-8-scenario001-standalone (master and stable/wallaby)
periodic-tripleo-ci-centos-8-scenario001-standalone-wallaby
periodic-tripleo-ci-centos-8-scenario001-standalone-master

are showing failures in tempest tests:

tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern
and
tempest.scenario.test_snapshot_pattern.TestSnapshotPattern

The pass/fail trends of these tests are linked below:

https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-8-scenario001-standalone&branch=master

https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-8-scenario001-standalone&branch=master

https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-8-scenario001-standalone-wallaby

https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-8-scenario001-standalone-master

Possibly there are different issues here.

Looking at the following log, the traces are visible:

https://69d5e090e76b3fc6d697-75d4dc47c3578b823b14aaecdf686f09.ssl.cf2.rackcdn.com/805526/1/gate/tripleo-ci-centos-8-scenario001-standalone/d75446c/logs/undercloud/var/log/tempest/stestr_results.html

tempest.scenario.test_volume_boot_pattern:
tempest.lib.exceptions.UnexpectedResponseCode: Unexpected response code received
Details: 503

tempest.scenario.test_snapshot_pattern.TestSnapshotPattern:
tempest.lib.exceptions.UnexpectedResponseCode: Unexpected response code received
Details: 503

Ronelle Landy (rlandy)
Changed in tripleo:
milestone: none → xena-3
importance: Undecided → Critical
status: New → Triaged
tags: added: ci promotion-blocker
Revision history for this message
Giulio Fidente (gfidente) wrote :

503 seems to be coming from neutron

    Response - Headers: {'cache-control': 'no-cache', 'connection': 'close', 'content-type': 'text/html', 'status': '503', 'content-location': 'http://192.168.24.3:9696/v2.0/floatingips/9222ffe1-b94e-4f62-a8c1-330fbc546493'}

but cinder is also failing to connect to mysql; wonder if this isn't the result of oom killer doing its work?

Revision history for this message
Giulio Fidente (gfidente) wrote :

503 seems to be coming from neutron

    Response - Headers: {'cache-control': 'no-cache', 'connection': 'close', 'content-type': 'text/html', 'status': '503', 'content-location': 'http://192.168.24.3:9696/v2.0/floatingips/9222ffe1-b94e-4f62-a8c1-330fbc546493'}

but cinder is also failing to connect to mysql; wonder if this isn't all just a consequence of the oom killer doing its work?

Revision history for this message
Douglas Viroel (dviroel) wrote :
Revision history for this message
Douglas Viroel (dviroel) wrote :
Revision history for this message
yatin (yatinkarel) wrote :

<< Not sure the root cause of the issue yet.

Had noticed the failures on https://review.opendev.org/c/openstack/tripleo-heat-templates/+/805029 which failed at multiple recheck. All the failures are caused by too much memory consumption leading to oom and pcs timeouts which resulted into tempest or some other failure. Inspecting atop log i see ceph process are consuming much memory and seems https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/803493/ has triggered the issue.

Symptoms and timings lead to the patch, Can verify by the revert of https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/803493/(corresponding depends-on tripleo-common patches may also need to be reverted) and running multiple scenarion001 job as job is failing randomly. If revert confirms the issue then can go with the revert and someone from ceph can check why ceph process consuming much memory.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart-extras (master)

Change abandoned by "yatin <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/806382
Reason: Being fixed in https://review.opendev.org/c/openstack/tripleo-heat-templates/+/806122

Revision history for this message
Francesco Pantano (fmount) wrote :

After more investigation on this bug, we found the oom killer being called against the ceph-mgr multiple times.
So as per more tests we conducted via [1] and [2], looks like the root cause here is a memory issue, that can be fixed via [1].

[1] https://review.opendev.org/q/topic:%22disable_cephadm%22+(status:open%20OR%20status:merged)
[2] https://review.rdoproject.org/r/c/testproject/+/35173

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/806122
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/733d1bad46bb17b7e68bc61af47819423aaf6efc
Submitter: "Zuul (22348)"
Branch: master

commit 733d1bad46bb17b7e68bc61af47819423aaf6efc
Author: Francesco Pantano <email address hidden>
Date: Thu Aug 26 11:12:34 2021 +0200

    Disable cephadm when ceph is deployed

    After ceph is deployed, which means day1 operations are over
    and the Ceph cluster daemons are up && running, cephadm can
    be paused by running the 'disable_cephadm' playbook.
    This can be triggered at step3 (to ensure all the daemons are
    started) and can be triggered by the DisableCephadm exposed
    parameter.

    Closes-Bug: 1940866
    Depends-On: I5952bb8a7a327e37a39acc95751cd5746fd9a41d
    Change-Id: Idbafdaced85c945d38940be83c1a269c73e4cb6b

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/806528

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/806528
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/1b41f26657a08e2844f7f8633763262b093591b1
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 1b41f26657a08e2844f7f8633763262b093591b1
Author: Francesco Pantano <email address hidden>
Date: Thu Aug 26 11:12:34 2021 +0200

    Disable cephadm when ceph is deployed

    After ceph is deployed, which means day1 operations are over
    and the Ceph cluster daemons are up && running, cephadm can
    be paused by running the 'disable_cephadm' playbook.
    This can be triggered at step3 (to ensure all the daemons are
    started) and can be triggered by the DisableCephadm exposed
    parameter.

    Closes-Bug: 1940866
    Depends-On: I5952bb8a7a327e37a39acc95751cd5746fd9a41d
    Change-Id: Idbafdaced85c945d38940be83c1a269c73e4cb6b
    (cherry picked from commit 733d1bad46bb17b7e68bc61af47819423aaf6efc)

tags: added: in-stable-wallaby
Changed in tripleo:
assignee: nobody → Francesco Pantano (fmount)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 15.1.0

This issue was fixed in the openstack/tripleo-heat-templates 15.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.