Bug #1983817 “periodic integration all branches RETRY Could not ...” : Bugs : tripleo

Revision history for this message

Marios Andreou (marios-b) wrote on 2022-08-08:

#1

adding some more info here - the url does resolve at least for me locally, and amoralej also just confirmed the same.

so could be infra related on vexx side? going to try reaching out on irc for now

Marios Andreou (marios-b) on 2022-08-08

description:

updated

Revision history for this message

Marios Andreou (marios-b) wrote on 2022-08-08:

#2

still ongoing. latest example there [1]

- Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml [Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org]
Error: Failed to download metadata for repo 'baseos': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried

from (still running) openstack-periodic-integration-stable1

[1] https://review.rdoproject.org/zuul/build/77db171b24754221862c7e61805b1195

Revision history for this message

Alan Pevec (apevec) wrote on 2022-08-08:

#3

which name resolver is used by CI job?

Revision history for this message

Marios Andreou (marios-b) wrote on 2022-08-09:

#4

@Alan

the ones with RETRY have no logs but looking at the other jobs of the same buildset we can see /etc/resolv.conf [1] from green fs39

search ooo.test
nameserver 10.0.0.250

so that must be an internal vexx name server i think

[1] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset039-master/07558e1/logs/undercloud/etc/resolv.conf.txt.gz

Revision history for this message

Marios Andreou (marios-b) wrote on 2022-08-09:

#5

This issue is ongoing today with examples from all the lines.

Master [1] Wallaby/9 [2] and Train/8 [3] (only one example in [3])

As before the jobs are in status RETRY with typical runtime of ~8 mins before the error hits

Example from [4]

2022-08-09 01:54:37.052925 | primary | Errors during downloading metadata for repository 'baseos':
2022-08-09 01:54:37.052967 | primary | - Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml [Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org]
2022-08-09 01:54:37.053008 | primary | Error: Failed to download metadata for repo 'baseos': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried

[1] https://review.rdoproject.org/zuul/buildset/c9e4fb9c8c8043e9bcd5e7690908ecd3
[2] https://review.rdoproject.org/zuul/buildset/a1734cf8868f414ca292564e74c570ba
[3] https://review.rdoproject.org/zuul/buildset/946a0f57002b42849e09c27efe101e99
[4] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset064-master/001daa6/job-output.txt

Revision history for this message

Marios Andreou (marios-b) wrote on 2022-08-09:

#6

15:54 < guilhermesp___> nameserver 10.0.0.250 -- thats actually not an internal dns on us. We actually dont manage dns and when it comes by default on subnets, it is used to be cloudflare ( 1.1.1.1 )

Revision history for this message

Marios Andreou (marios-b) wrote on 2022-08-09:

#7

been digging a bit here but the main problem is we have no/few logs when we actually hit RETRY e.g. at [1] no logs there besides the main out.txt

the failing task is happening *very* early, before any of undercloud or overcloud deployment. In fact this is before we execute any tripleo/quickstart code - it happens during configure-mirrors role invocation [2] which is in zuul-jobs. The failing task specifically is at [3]

This is why i thought 10.0.0.250 came from vexx but according to guilhemesp on IRC it is not the case.

[1] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset039-wallaby/d2eb845/logs/
[2] https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/configure-mirrors/tasks
[3] https://opendev.org/zuul/zuul-jobs/src/commit/2bc6d7b9c30c27cf9e7a1b53cb8a128a493d89af/roles/configure-mirrors/handlers/main.yaml#L10

Revision history for this message

Marios Andreou (marios-b) wrote on 2022-08-10:

#8

the 10.0.0.250 is a red herring - we don't know what name resolution is used when the error hits as we have no logs in those cases.

to clarify a bit because we are confusing two things here (and it is in large part my fault).

1. We DONT have actual logs from RETRY jobs. so we don't know what is in resolv.conf in those cases.

2. THe 10.0.0.250 is from successful jobs, where we've already done all the tripleo undercloud/overcloud config etc.

Revision history for this message

Marios Andreou (marios-b) wrote on 2022-08-10:

#9

I think the dns in play for the error nodes might be 1.1.1.1 & 8.8.8.8 cloudflare/google

Even though we don't have access to /etc or other logs in [1] we do have the zuul/ansible host-info e.g. [1]

  ansible_dns:
    nameservers:
    - 1.1.1.1
    - 8.8.8.8
    search:
    - openstacklocal
    - novalocal

[1] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-standalone-on-multinode-ipa-master/1a8f79c/logs/
[2] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-standalone-on-multinode-ipa-master/1a8f79c/zuul-info/host-info.secondary.yaml

Revision history for this message

Marios Andreou (marios-b) wrote on 2022-08-10:

#10

noting that this continues today with master [1] wallaby/9 [2] and train lines [3]

it hits inconsistently so in the same build only some jobs (not consistently the same ones) are failing with this

[1] https://review.rdoproject.org/zuul/buildset/8d02c4052a0d46668331ca4c6f63eb2a
[2] https://review.rdoproject.org/zuul/buildset/c580f8e4923c46bd9afa780e994814e7
[3] https://review.rdoproject.org/zuul/buildset/dbeaa6fd208246acbe5c4c2c8bd12f59

Revision history for this message

Marios Andreou (marios-b) wrote on 2022-08-11:

#11

a slight improvement in latest runs... none in master [1] one for wallaby/9 [2] but a few for train [3]

I will try and get a node hold but it is tricky as this is inconsistent.

I'll try re-running those same train jobs lets see

[1] https://review.rdoproject.org/zuul/buildset/cf05ba782561462bb009e489fed49854
[2] https://review.rdoproject.org/zuul/buildset/dfb198613ff344ed8807d849e0c42f56
[3] https://review.rdoproject.org/zuul/buildset/c8650794203747d9aac787f76a8b3749

Revision history for this message

Marios Andreou (marios-b) wrote on 2022-08-11:

#12

Per comment #11 above, trying to get the hold with https://review.rdoproject.org/r/c/testproject/+/44513

Revision history for this message

Marios Andreou (marios-b) wrote on 2022-08-15:

#13

this issue persists and seen in the latest runs.

NOTE to avoid confusion we are also seeing NODE_FAILURE in some jobs which is reported as bug/1986502 [1]

The RETRY/dns issue tracked here is happening in latest master [2] and train [3] buildsets (8 jobs hit that in the train run)

[1] https://bugs.launchpad.net/tripleo/+bug/1986502
[2] https://review.rdoproject.org/zuul/buildset/953eacd0246648ada324268a4c21a8e6
[3] https://review.rdoproject.org/zuul/buildset/0a685aa16ac04f0eb43c04c925bfd36b

Revision history for this message

Sandeep Yadav (sandeepyadav93) wrote on 2022-08-18:

#14

Download full text (3.6 KiB)

We are still hitting this randomly:-

https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset035-master/85376db/job-output.txt
~~~

2022-08-18 01:59:36.761993 | LOOP [configure-mirrors : Update yum/dnf cache]
2022-08-18 01:59:37.607884 | primary | 28 files removed
2022-08-18 02:01:16.428508 | primary | Loaded plugins: builddep, changelog, config-manager, copr, debug, debuginfo-install, download, generate_completion_cache, groups-manager, needs-restarting, playground, repoclosure, repodiff, repograph, repomanage, reposync
2022-08-18 02:01:16.428645 | primary | DNF version: 4.12.0
2022-08-18 02:01:16.428680 | primary | cachedir: /var/cache/dnf
2022-08-18 02:01:16.428707 | primary | Making cache files for all metadata files.
2022-08-18 02:01:16.428734 | primary | baseos: has expired and will be refreshed.
2022-08-18 02:01:16.428758 | primary | appstream: has expired and will be refreshed.
2022-08-18 02:01:16.428781 | primary | crb: has expired and will be refreshed.
2022-08-18 02:01:16.428803 | primary | extras-common: has expired and will be refreshed.
2022-08-18 02:01:16.428826 | primary | repo: downloading from remote: baseos
2022-08-18 02:01:16.428849 | primary | error: Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml [Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org] (http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml).
2022-08-18 02:01:16.428873 | primary | error: Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml [Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org] (http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml).
2022-08-18 02:01:16.428897 | primary | error: Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml [Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org] (http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml).
2022-08-18 02:01:16.428930 | primary | error: Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml [Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org] (http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml).
2022-08-18 02:01:16.428957 | primary | CentOS Stream 9 - BaseOS 0.0 B/s | 0 B 01:38
2022-08-18 02:01:16.428982 | primary | Errors during downloading metadata for repository 'baseos':
2022-08-18 02:01:16.429007 | primary | - Cur...

We are still hitting this randomly:-

https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset035-master/85376db/job-output.txt
~~~

2022-08-18 01:59:36.761993 | LOOP [configure-mirrors : Update yum/dnf cache]
2022-08-18 01:59:37.607884 | primary | 28 files removed
2022-08-18 02:01:16.428508 | primary | Loaded plugins: builddep, changelog, config-manager, copr, debug, debuginfo-install, download, generate_completion_cache, groups-manager, needs-restarting, playground, repoclosure, repodiff, repograph, repomanage, reposync
2022-08-18 02:01:16.428645 | primary | DNF version: 4.12.0
2022-08-18 02:01:16.428680 | primary | cachedir: /var/cache/dnf
2022-08-18 02:01:16.428707 | primary | Making cache files for all metadata files.
2022-08-18 02:01:16.428734 | primary | baseos: has expired and will be refreshed.
2022-08-18 02:01:16.428758 | primary | appstream: has expired and will be refreshed.
2022-08-18 02:01:16.428781 | primary | crb: has expired and will be refreshed.
2022-08-18 02:01:16.428803 | primary | extras-common: has expired and will be refreshed.
2022-08-18 02:01:16.428826 | primary | repo: downloading from remote: baseos
2022-08-18 02:01:16.428849 | primary | error: Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml [Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org] (http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml).
2022-08-18 02:01:16.428873 | primary | error: Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml [Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org] (http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml).
2022-08-18 02:01:16.428897 | primary | error: Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml [Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org] (http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml).
2022-08-18 02:01:16.428930 | primary | error: Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml [Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org] (http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml).
2022-08-18 02:01:16.428957 | primary | CentOS Stream 9 - BaseOS 0.0 B/s | 0 B 01:38
2022-08-18 02:01:16.428982 | primary | Errors during downloading metadata for repository 'baseos':
2022-08-18 02:01:16.429007 | primary | - Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/repomd.xml [Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org]
2022-08-18 02:01:16.429031 | primary | Error: Failed to download metadata for repo 'baseos': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
2022-08-18 02:01:16.445201 |

~~~

Revision history for this message

Sandeep Yadav (sandeepyadav93) wrote on 2022-08-18:

#15

Please see this buildset: https://review.rdoproject.org/zuul/buildset/4d16c97499b94cbcbcc9a8940c47b127

Many jobs end up in retry with the same issue:

One of the example
https://logserver.rdoproject.org/openstack-promote-component/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-9-wallaby-component-clients-promote-to-promoted-components/28a8d7a/job-output.txt

Revision history for this message

Alan Pevec (apevec) wrote on 2022-08-24:

#16

That should be resolved based on updates in the VH support ticket, please review job runs since Aug 23 1400 UTC yesterday

Revision history for this message

Soniya Murlidhar Vyas (svyas) wrote on 2022-08-25:

#17

this issue is still happening in periodic-tripleo-ci-centos-9-standalone-full-tempest-scenario-compute-master[1][2][3].

[1] https://logserver.rdoproject.org/openstack-component-compute/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-standalone-full-tempest-scenario-compute-master/8928215/job-output.txt
[2] https://logserver.rdoproject.org/openstack-component-compute/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-standalone-full-tempest-scenario-compute-master/a4d37f4/job-output.txt
[3] https://logserver.rdoproject.org/openstack-component-compute/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-standalone-full-tempest-scenario-compute-master/4e01a07/job-output.txt

Revision history for this message

daniel.pawlik (daniel-pawlik) wrote on 2022-08-25:

#18

@svyas it seems that Centos mirror was down (or Cloud provider can not reach the host), not to the DNS issue.

Revision history for this message

Ronelle Landy (rlandy) wrote on 2022-08-29:

#19

https://support.vexxhost.com/hc/en-us/requests/362706 - comments left there to close this out

Changed in tripleo:
status:	Triaged → Fix Released

tripleo

periodic integration all branches RETRY Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org

Bug Description

Other bug subscribers

Remote bug watches