Centos-9 wallaby scenario 1 standalone tempest fails with SSHTimeout

Bug #1964133 reported by Marios Andreou
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
John Fulton

Bug Description

At [1][2] the periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby fails during the standalone deploy with trace like:

        * 2022-03-07 13:36:09.261895 | fa163e01-389b-8a12-9130-0000000026ca | FATAL | search triple_run_cephadm_output of cephadm run(s) non-zero return codes | undercloud | error={"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result"}
 2022-03-07 13:36:09.263944 | fa163e01-389b-8a12-9130-0000000026ca | TIMING | tripleo_run_cephadm : search triple_run_cephadm_output of cephadm run(s) non-zero return codes | undercloud | 0:36:04.360107 | 0.03s

Looks like the issue is timeout waiting for OSD [3]

 2022-03-08 14:10:25,196 p=104898 u=root n=ansible | 2022-03-08 14:10:25.195284 | fa163e84-153e-1b31-161f-0000000002ab | FATAL | Wait for expected number of osds to be running | standalone | error={"attempts": 40, "changed": true, "cmd": "podman run --rm --net=host --ipc=host --volume /etc/ceph:/etc/ceph:z --volume /home/ceph-admin/assimilate_ceph.conf:/home/assimilate_ceph.conf:z --volume /home/ceph-admin/specs/ceph_spec.yaml:/home/ceph_spec.yaml:z --entrypoint ceph 192.168.24.1:8787/ceph/daemon:v6.0.6-stable-6.0-pacific-centos-8-x86_64 --fsid 4b5c8c0a-ff60-454b-a1b4-9747aa737d19 -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring status --format json | jq .osdmap.num_up_osds", "delta": "0:00:00.880216", "end": "2022-03-08 14:10:25.163897", "msg": "", "rc": 0, "start": "2022-03-08 14:10:24.283681", "stderr": "", "stderr_lines": [], "stdout": "0", "stdout_lines": ["0"]}

This is Centos 9 wallaby promotion blocker

[1] https://logserver.rdoproject.org/3d/3de35c67026f664c2dc7fca0bce8500664a5a0b0/openstack-periodic-integration-stable1-cs9/periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby/94d57f3/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz
[2] https://logserver.rdoproject.org/openstack-periodic-integration-stable1-cs9/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby/5504f1d/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz
[3] https://logserver.rdoproject.org/openstack-periodic-integration-stable1-cs9/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby/5504f1d/logs/undercloud/home/zuul/tripleo-deploy/standalone-ansible-ttaesph_/cephadm/cephadm_command.log.txt.gz

Changed in tripleo:
assignee: nobody → John Fulton (jfulton-org)
Revision history for this message
John Fulton (jfulton-org) wrote :

Though tripleo-ci-centos-9-scenario001-standalone [0] failed this morning the failure was in tempest and all OSDs came up [1] before tempest ran. Is there a relevant difference between these two jobs?

- periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby
- tripleo-ci-centos-9-scenario001-standalone

[0] https://zuul.opendev.org/t/openstack/build/5de6e6e2e2b441fd91382f2b7b2d6ed0/logs

[1] https://d2b5bc52e78cdf005618-65472637883373db40924f3bc9846264.ssl.cf2.rackcdn.com/831860/1/check/tripleo-ci-centos-9-scenario001-standalone/5de6e6e/logs/undercloud/home/zuul/tripleo-deploy/standalone-ansible-dazmxuhp/cephadm/cephadm_command.log

Revision history for this message
John Fulton (jfulton-org) wrote :

As per comment #0 periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby failed for OSDs not starting at 2022-03-08 14:10

However, at 2022-03-08 19:46 we see 2 cinder tempest failures [1] so in that case the OSDs _did_ come up.

Depending on further runs we might need to focus more on the cinder tempest failures.

[1] https://logserver.rdoproject.org/54/36254/71/check/periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby/323904a/logs/undercloud/var/log/tempest/stestr_results.html.gz

Revision history for this message
John Fulton (jfulton-org) wrote :

I'll re-trigger the job with logging to file for ceph and try to correlate cinder tempest failures with anything in the ceph logs via timestamp.

 https://review.opendev.org/c/openstack/tripleo-heat-templates/+/832712

Revision history for this message
Ronelle Landy (rlandy) wrote :

Looks like we are down to tempest failures - deploy is passing now

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)
Revision history for this message
Marios Andreou (marios-b) wrote : Re: Centos-9 wallaby scenario 1 standalone fails running cephadm timeout waiting for OSDs

posted a test for the related versions bump from comment #5 (https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/832756)

test at https://review.rdoproject.org/r/c/testproject/+/40231 for scen1/4 master and wallaby

but we'll need the stable/wallaby cherrypick for https://review.opendev.org/c/openstack/tripleo-common/+/832755 before we can test those wallaby jobs.

Revision history for this message
Francesco Pantano (fmount) wrote :

I tried to go through [1], analyzing the history of the job runs, and this issue appeared for the first time on Mar 07 with job [2] and we saw the same result on Mar 08 with job [3].
The other failures (for job runs before Mar 07) are unrelated to the issue reported here, where the Ceph cluster is up && running (including the OSD).
Now, focusing on job [1][2], after digging more and analyzing the logs I found this is related to the OSD which fails to start [4]:

'''
2022-03-08T13:49:20.458+0000 7f4d84ee5080 -1 OSD::mkfs: ObjectStore::mkfs failed with error (5) Input/output error
2022-03-08T13:49:20.458+0000 7f4d84ee5080 -1 ** ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-0/: (5) Input/output error
'''
Still not 100% clear why this is happening (now) but we're start moving to the new Ceph container based on c8s and see if the issue is still there.

[1] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby

[2] https://logserver.rdoproject.org/3d/3de35c67026f664c2dc7fca0bce8500664a5a0b0/openstack-periodic-integration-stable1-cs9/periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby/94d57f3/logs/undercloud/home/zuul/tripleo-deploy/standalone-ansible-yj9fjd75/cephadm/cephadm_command.log.txt.gz

[3] https://logserver.rdoproject.org/openstack-periodic-integration-stable1-cs9/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby/5504f1d/logs/undercloud/home/zuul/tripleo-deploy/standalone-ansible-ttaesph_/cephadm/cephadm_command.log.txt.gz

[4] https://logserver.rdoproject.org/openstack-periodic-integration-stable1-cs9/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario001-standalone-wallaby/5504f1d/logs/undercloud/var/log/ceph/4b5c8c0a-ff60-454b-a1b4-9747aa737d19/ceph-osd.0.log.txt.gz

Revision history for this message
Francesco Pantano (fmount) wrote :

We created the reviews [1] to bump the Ceph container from v6.0.6 to v6.0.7 built on top of c8s and we run tests on RDO [2].
From the logs we can see for both sc01 and sc04 the Ceph cluster is up && running and the OSD is built properly, with no I/O errors on mkfs.
Additional patches to test tripleo-ci are running [3] and master is green.
After confirming wallaby is green too (at least for Ceph), we can move forward and promote the new
container image.

[1] https://review.opendev.org/q/topic:ceph_pacific_promotion
[2] https://review.rdoproject.org/r/c/testproject/+/40231
[3] https://review.opendev.org/q/topic:centos-storage-sig

summary: - Centos-9 wallaby scenario 1 standalone fails running cephadm timeout
- waiting for OSDs
+ Centos-9 wallaby scenario 1 standalone fails cinder tempest
summary: - Centos-9 wallaby scenario 1 standalone fails cinder tempest
+ Centos-9 wallaby scenario 1 standalone tempest fails with SSHTimeout
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-common/+/832810

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-common/+/832755
Committed: https://opendev.org/openstack/tripleo-common/commit/854cf9fadf1f1b210c5ccb855b6df44e53a668c2
Submitter: "Zuul (22348)"
Branch: master

commit 854cf9fadf1f1b210c5ccb855b6df44e53a668c2
Author: Francesco Pantano <email address hidden>
Date: Wed Mar 9 08:03:45 2022 +0100

    Bump Ceph container daemons to v6.0.7

    This change aligns the new tag with the latest released stable
    content (which is v6.0.7 for the daemon images).

    Related-Bug: #1964133
    Change-Id: I8f3841838cba0befebaf9f8ebe6f0e0923ce4b05

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/tripleo-common/+/832810
Committed: https://opendev.org/openstack/tripleo-common/commit/82eea318b88f7b8785910f04620a7e5c7cb5a22b
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 82eea318b88f7b8785910f04620a7e5c7cb5a22b
Author: Francesco Pantano <email address hidden>
Date: Wed Mar 9 08:03:45 2022 +0100

    Bump Ceph container daemons to v6.0.7

    This change aligns the new tag with the latest released stable
    content (which is v6.0.7 for the daemon images).

    Related-Bug: #1964133
    Change-Id: I8f3841838cba0befebaf9f8ebe6f0e0923ce4b05

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/832756
Committed: https://opendev.org/openstack/tripleo-quickstart-extras/commit/2a77c1ad58b7e37001519e74cc09fbe72a24bbf8
Submitter: "Zuul (22348)"
Branch: master

commit 2a77c1ad58b7e37001519e74cc09fbe72a24bbf8
Author: Francesco Pantano <email address hidden>
Date: Wed Mar 9 08:05:57 2022 +0100

    Bump Ceph container daemons to v6.0.7

    This change aligns the new tag with the latest released stable
    content (which is v6.0.7 for the daemon images).

    Depends-On: I8f3841838cba0befebaf9f8ebe6f0e0923ce4b05
    Related-Bug: #1964133
    Change-Id: Ieb0b9bc590b87925290d4b0dea5d09a29ab71f57

Alan Pevec (apevec)
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.