centos9 zed component and integration lines blocked fetching delorean.repo.md5

Bug #1991523 reported by Marios Andreou
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

At [1][2][3] (*many* other examples) the centos-9-zed-component-******-promote-to-promoted-components and the centos-9-zed-promote-promoted-components-to-tripleo-ci-testing jobs are failing very early in the run, specifically trying to fetch the delorean.repo.md5 hash from the delorean repo:

        2022-10-03 10:02:42.815024 | TASK [get_hash : get commit.yaml file]
 2022-10-03 10:04:27.031219 | primary | ERROR
 2022-10-03 10:04:27.031545 | primary | {
 2022-10-03 10:04:27.031598 | primary | "attempts": 3,
 2022-10-03 10:04:27.031635 | primary | "dest": "/home/zuul/workspace/commit.yaml",
 2022-10-03 10:04:27.031668 | primary | "elapsed": 3,
 2022-10-03 10:04:27.031700 | primary | "msg": "Request failed: <urlopen error [Errno 113] No route to host>",
 2022-10-03 10:04:27.031732 | primary | "url": "https://trunk.rdoproject.org/centos9-zed/component/compute/component-ci-testing/commit.yaml"

The issue is hitting us quite a lot; for reference the zed pipeline builds are at [4] - at the time of writing the last two runs at [5] and [6] are blocked on this. It is also inconsistent. You can find many examples in the testproject zuul reports at [7] https://review.rdoproject.org/r/c/testproject/+/45037/6#message-90bf4d44fe1faebbac166520a77e46ca9be9fc02 that were cleared on recheck.

[1] https://logserver.rdoproject.org/openstack-promote-component/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-9-zed-component-compute-promote-to-promoted-components/18e3302/job-output.txt
[2] https://logserver.rdoproject.org/openstack-periodic-integration-zed-centos9/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-9-zed-promote-promoted-components-to-tripleo-ci-testing/1e6e8fa/job-output.txt
[3] https://logserver.rdoproject.org/openstack-periodic-integration-zed-centos9/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-9-zed-promote-promoted-components-to-tripleo-ci-testing/a69b1b6/job-output.txt
[4] https://review.rdoproject.org/zuul/buildsets?pipeline=openstack-periodic-integration-zed-centos9
[5] https://review.rdoproject.org/zuul/buildset/9eb8d151c70741c59f482a8ecfd964d4
[6] https://review.rdoproject.org/zuul/buildset/d04732e4454e4c92b3feef4612b9c220
[7] https://review.rdoproject.org/r/c/testproject/+/45037/6#message-90bf4d44fe1faebbac166520a77e46ca9be9fc02

Revision history for this message
Alan Pevec (apevec) wrote :

we'll need timestamps of all network failures in order to file a ticket with our cloud provider

Revision history for this message
Marios Andreou (marios-b) wrote :

Perhaps we can try bumping the retries/delay on the affected task.

Something like https://review.rdoproject.org/r/c/rdo-infra/ci-config/+/45255 to see if it helps us as a temporary measure.

Revision history for this message
Marios Andreou (marios-b) wrote (last edit ):

so we merged https://review.rdoproject.org/r/c/rdo-infra/ci-config/+/45449 bumping the timeout to 5 mins (from ~1.5)

we'll have to keep an eye on it over the next few days [1] - anything in that list with duration ~1 min or so is likely hitting this. Hopefully we should not see it again after the 5th.

But again this is really a temp measure I mean even 1.5 mins should be long enough for network to resolve so we can fetch the dlrn hash.

[1] https://review.rdoproject.org/zuul/buildsets?pipeline=openstack-periodic-integration-zed-centos9

Revision history for this message
Marios Andreou (marios-b) wrote :

So a mixed result... seems like the timeout bump has helped us at least a bit...

I cant see any more of this in the builds at [1] for the promote-promoted-components-to-tripleo-ci-testing job and the last fail is from the 4th: 1 min 43 secs 2022-10-04 13:15:58

HOWEVER, i have found one example where the timeout didn't help and the issue persists at [2] in the promote-to-promoted-components job:

 2022-10-06 02:27:41.957134 | TASK [get_hash : get commit.yaml file]
 2022-10-06 02:33:06.607446 | primary | ERROR
 2022-10-06 02:33:06.607735 | primary | {
 2022-10-06 02:33:06.607788 | primary | "attempts": 6,
 2022-10-06 02:33:06.607826 | primary | "dest": "/home/zuul/workspace/commit.yaml",
 2022-10-06 02:33:06.607863 | primary | "elapsed": 3,
 2022-10-06 02:33:06.607915 | primary | "msg": "Request failed: <urlopen error [Errno 113] No route to host>",
 2022-10-06 02:33:06.607971 | primary | "url": "https://trunk.rdoproject.org/centos9-zed/component/compute/component-ci-testing/commit.yaml"
 2022-10-06 02:33:06.608008 | primary | }

[1] https://review.rdoproject.org/zuul/buildsets?pipeline=openstack-periodic-integration-zed-centos9
[2] https://logserver.rdoproject.org/openstack-promote-component/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-9-zed-component-compute-promote-to-promoted-components/d54d8c0/job-output.txt

Revision history for this message
Marios Andreou (marios-b) wrote :
Revision history for this message
Marios Andreou (marios-b) wrote (last edit ):
Revision history for this message
Marios Andreou (marios-b) wrote :
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.