periodic integration all branches RETRY Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org

Bug #1983817 reported by Marios Andreou
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

At [1][2][3] you can see the latest buildsets for all periodic integration lines - multiple jobs are failing with status RETRY. The logs show an error during infra mirror setup e.g. from [4] like:

 2022-08-07 17:14:47.486138 | primary | CentOS Stream 9 - BaseOS 0.0 B/s | 0 B 01:38
 2022-08-07 17:14:47.486217 | primary | Errors during downloading metadata for repository 'baseos':
 2022-08-07 17:14:47.486274 | primary | - Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml [Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org]
 2022-08-07 17:14:47.486327 | primary | Error: Failed to download metadata for repo 'baseos': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried

Logs show this happened yesterday [4] and continue this morning (eg [1]). This blocks c9/master, c9/wallaby, c8/wallaby, and c8/train lines.

[1] https://review.rdoproject.org/zuul/buildset/73812fd92e0f42978a461fd1cb890697
[2] https://review.rdoproject.org/zuul/buildset/1f2562fd8610489cb06d52171ed4db89
[3] https://review.rdoproject.org/zuul/buildset/bab36687352542d79a767437171197dd
[4] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-scenario002-standalone-master/6642351/job-output.txt

Revision history for this message
Marios Andreou (marios-b) wrote :

adding some more info here - the url does resolve at least for me locally, and amoralej also just confirmed the same.

so could be infra related on vexx side? going to try reaching out on irc for now

description: updated
Revision history for this message
Marios Andreou (marios-b) wrote :

still ongoing. latest example there [1]

  - Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml [Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org]
Error: Failed to download metadata for repo 'baseos': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried

from (still running) openstack-periodic-integration-stable1

[1] https://review.rdoproject.org/zuul/build/77db171b24754221862c7e61805b1195

Revision history for this message
Alan Pevec (apevec) wrote :

which name resolver is used by CI job?

Revision history for this message
Marios Andreou (marios-b) wrote :

@Alan

the ones with RETRY have no logs but looking at the other jobs of the same buildset we can see /etc/resolv.conf [1] from green fs39

search ooo.test
nameserver 10.0.0.250

so that must be an internal vexx name server i think

[1] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset039-master/07558e1/logs/undercloud/etc/resolv.conf.txt.gz

Revision history for this message
Marios Andreou (marios-b) wrote :

This issue is ongoing today with examples from all the lines.

Master [1] Wallaby/9 [2] and Train/8 [3] (only one example in [3])

As before the jobs are in status RETRY with typical runtime of ~8 mins before the error hits

Example from [4]

2022-08-09 01:54:37.052925 | primary | Errors during downloading metadata for repository 'baseos':
2022-08-09 01:54:37.052967 | primary | - Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml [Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org]
2022-08-09 01:54:37.053008 | primary | Error: Failed to download metadata for repo 'baseos': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried

[1] https://review.rdoproject.org/zuul/buildset/c9e4fb9c8c8043e9bcd5e7690908ecd3
[2] https://review.rdoproject.org/zuul/buildset/a1734cf8868f414ca292564e74c570ba
[3] https://review.rdoproject.org/zuul/buildset/946a0f57002b42849e09c27efe101e99
[4] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset064-master/001daa6/job-output.txt

Revision history for this message
Marios Andreou (marios-b) wrote :

15:54 < guilhermesp___> nameserver 10.0.0.250 -- thats actually not an internal dns on us. We actually dont manage dns and when it comes by default on subnets, it is used to be cloudflare ( 1.1.1.1 )

Revision history for this message
Marios Andreou (marios-b) wrote :

been digging a bit here but the main problem is we have no/few logs when we actually hit RETRY e.g. at [1] no logs there besides the main out.txt

the failing task is happening *very* early, before any of undercloud or overcloud deployment. In fact this is before we execute any tripleo/quickstart code - it happens during configure-mirrors role invocation [2] which is in zuul-jobs. The failing task specifically is at [3]

This is why i thought 10.0.0.250 came from vexx but according to guilhemesp on IRC it is not the case.

[1] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset039-wallaby/d2eb845/logs/
[2] https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/configure-mirrors/tasks
[3] https://opendev.org/zuul/zuul-jobs/src/commit/2bc6d7b9c30c27cf9e7a1b53cb8a128a493d89af/roles/configure-mirrors/handlers/main.yaml#L10

Revision history for this message
Marios Andreou (marios-b) wrote :

the 10.0.0.250 is a red herring - we don't know what name resolution is used when the error hits as we have no logs in those cases.

to clarify a bit because we are confusing two things here (and it is in large part my fault).

1. We DONT have actual logs from RETRY jobs. so we don't know what is in resolv.conf in those cases.

2. THe 10.0.0.250 is from successful jobs, where we've already done all the tripleo undercloud/overcloud config etc.

Revision history for this message
Marios Andreou (marios-b) wrote :

I think the dns in play for the error nodes might be 1.1.1.1 & 8.8.8.8 cloudflare/google

Even though we don't have access to /etc or other logs in [1] we do have the zuul/ansible host-info e.g. [1]

  ansible_dns:
    nameservers:
    - 1.1.1.1
    - 8.8.8.8
    search:
    - openstacklocal
    - novalocal

[1] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-standalone-on-multinode-ipa-master/1a8f79c/logs/
[2] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-standalone-on-multinode-ipa-master/1a8f79c/zuul-info/host-info.secondary.yaml

Revision history for this message
Marios Andreou (marios-b) wrote :

noting that this continues today with master [1] wallaby/9 [2] and train lines [3]

it hits inconsistently so in the same build only some jobs (not consistently the same ones) are failing with this

[1] https://review.rdoproject.org/zuul/buildset/8d02c4052a0d46668331ca4c6f63eb2a
[2] https://review.rdoproject.org/zuul/buildset/c580f8e4923c46bd9afa780e994814e7
[3] https://review.rdoproject.org/zuul/buildset/dbeaa6fd208246acbe5c4c2c8bd12f59

Revision history for this message
Marios Andreou (marios-b) wrote :

a slight improvement in latest runs... none in master [1] one for wallaby/9 [2] but a few for train [3]

I will try and get a node hold but it is tricky as this is inconsistent.

I'll try re-running those same train jobs lets see

[1] https://review.rdoproject.org/zuul/buildset/cf05ba782561462bb009e489fed49854
[2] https://review.rdoproject.org/zuul/buildset/dfb198613ff344ed8807d849e0c42f56
[3] https://review.rdoproject.org/zuul/buildset/c8650794203747d9aac787f76a8b3749

Revision history for this message
Marios Andreou (marios-b) wrote :

Per comment #11 above, trying to get the hold with https://review.rdoproject.org/r/c/testproject/+/44513

Revision history for this message
Marios Andreou (marios-b) wrote :

this issue persists and seen in the latest runs.

NOTE to avoid confusion we are also seeing NODE_FAILURE in some jobs which is reported as bug/1986502 [1]

The RETRY/dns issue tracked here is happening in latest master [2] and train [3] buildsets (8 jobs hit that in the train run)

[1] https://bugs.launchpad.net/tripleo/+bug/1986502
[2] https://review.rdoproject.org/zuul/buildset/953eacd0246648ada324268a4c21a8e6
[3] https://review.rdoproject.org/zuul/buildset/0a685aa16ac04f0eb43c04c925bfd36b

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :
Download full text (3.6 KiB)

We are still hitting this randomly:-

https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset035-master/85376db/job-output.txt
~~~

2022-08-18 01:59:36.761993 | LOOP [configure-mirrors : Update yum/dnf cache]
2022-08-18 01:59:37.607884 | primary | 28 files removed
2022-08-18 02:01:16.428508 | primary | Loaded plugins: builddep, changelog, config-manager, copr, debug, debuginfo-install, download, generate_completion_cache, groups-manager, needs-restarting, playground, repoclosure, repodiff, repograph, repomanage, reposync
2022-08-18 02:01:16.428645 | primary | DNF version: 4.12.0
2022-08-18 02:01:16.428680 | primary | cachedir: /var/cache/dnf
2022-08-18 02:01:16.428707 | primary | Making cache files for all metadata files.
2022-08-18 02:01:16.428734 | primary | baseos: has expired and will be refreshed.
2022-08-18 02:01:16.428758 | primary | appstream: has expired and will be refreshed.
2022-08-18 02:01:16.428781 | primary | crb: has expired and will be refreshed.
2022-08-18 02:01:16.428803 | primary | extras-common: has expired and will be refreshed.
2022-08-18 02:01:16.428826 | primary | repo: downloading from remote: baseos
2022-08-18 02:01:16.428849 | primary | error: Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml [Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org] (http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml).
2022-08-18 02:01:16.428873 | primary | error: Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml [Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org] (http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml).
2022-08-18 02:01:16.428897 | primary | error: Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml [Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org] (http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml).
2022-08-18 02:01:16.428930 | primary | error: Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml [Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org] (http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml).
2022-08-18 02:01:16.428957 | primary | CentOS Stream 9 - BaseOS 0.0 B/s | 0 B 01:38
2022-08-18 02:01:16.428982 | primary | Errors during downloading metadata for repository 'baseos':
2022-08-18 02:01:16.429007 | primary | - Cur...

Read more...

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :
Revision history for this message
Alan Pevec (apevec) wrote :

That should be resolved based on updates in the VH support ticket, please review job runs since Aug 23 1400 UTC yesterday

Revision history for this message
Soniya Murlidhar Vyas (svyas) wrote :
Revision history for this message
daniel.pawlik (daniel-pawlik) wrote :

@svyas it seems that Centos mirror was down (or Cloud provider can not reach the host), not to the DNS issue.

Revision history for this message
Ronelle Landy (rlandy) wrote :

https://support.vexxhost.com/hc/en-us/requests/362706 - comments left there to close this out

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.