Master/ Zed/Wallaby container build jobs are failing with error "Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org"

Bug #1997202 reported by Sandeep Yadav
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

Description:

Master/ Zed/Wallaby Centos9 container build jobs are failing with error "Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org"

Issue started on 19th Nov,2022 as per wallaby build history:

https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-build-containers-centos-9-push-wallaby&skip=0

Error Snippet:

Two different Errors[1] and [2] seen in logs related for DNS resolution

Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org

Errors during downloading metadata for repository 'quickstart-centos-crb':
  - Curl error (28): Timeout was reached for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/CRB/x86_64/os/repodata/repomd.xml [Resolving timed out after 30000 milliseconds]

[1]
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-build-containers-centos-9-push-master/f9330ab/logs/container-builds/ced8339a-520e-4c47-8410-922e846e8836/base/os/gnocchi-base/gnocchi-base-build.log

~~~
[MIRROR] python-oslo-cache-lang-3.3.0-0.20221118113846.2bf1bfc.el9.noarch.rpm: Curl error (6): Couldn't resolve host name for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org:8080/rdo/centos9-master/component/common/ad/a2/ada28b3ac6e6c19688dc6fd6a19630d1cf5e0530_fb321988/python-oslo-cache-lang-3.3.0-0.20221118113846.2bf1bfc.el9.noarch.rpm [Could not resolve host: mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org]
[FAILED] python-oslo-cache-lang-3.3.0-0.20221118113846.2bf1bfc.el9.noarch.rpm: No more mirrors to try - All mirrors were already tried without success

The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'dnf clean packages'.
Error: Error downloading packages:
  python-oslo-cache-lang-3.3.0-0.20221118113846.2bf1bfc.el9.noarch: Cannot download, all mirrors were already tried without success
time="2022-11-20T13:02:37-05:00" level=debug msg="Error building at step {Env:[TRIPLEO_ANSIBLE_REQ=/usr/share/openstack-tripleo-common-containers/container-images/kolla/tripleo-ansible-ee/requirements.yaml PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LANG=en_US.UTF-8 container=oci] Command:run Args:[dnf -y install gnocchi-common python3-rados python3-eventlet httpd librados2 mod_ssl python3-boto3 python3-ldappool python3-mod_wsgi && dnf clean all && rm -rf /var/cache/dnf] Flags:[] Attrs:map[] Message:RUN dnf -y install gnocchi-common python3-rados python3-eventlet httpd librados2 mod_ssl python3-boto3 python3-ldappool python3-mod_wsgi && dnf clean all && rm -rf /var/cache/dnf Original:RUN dnf -y install gnocchi-common python3-rados python3-eventlet httpd librados2 mod_ssl python3-boto3 python3-ldappool python3-mod_wsgi && dnf clean all && rm -rf /var/cache/dnf}: while running runtime: exit status 1"
~~~

[2] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-build-containers-centos-9-push-master/d7ce661/logs/container-builds/4dd7c1f5-dae1-4365-a16d-68a2853d67ae/base/qdrouterd/qdrouterd-build.log
~~~
Errors during downloading metadata for repository 'quickstart-centos-crb':
  - Curl error (28): Timeout was reached for http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos-stream/9-stream/CRB/x86_64/os/repodata/repomd.xml [Resolving timed out after 30000 milliseconds]
Error: Failed to download metadata for repo 'quickstart-centos-crb': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
time="2022-11-20T20:34:29-05:00" level=debug msg="Error building at step {Env:[TRIPLEO_ANSIBLE_REQ=/usr/share/openstack-tripleo-common-containers/container-images/kolla/tripleo-ansible-ee/requirements.yaml PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LANG=en_US.UTF-8 container=oci] Command:run Args:[dnf -y install cyrus-sasl-lib cyrus-sasl-plain qpid-dispatch-router qpid-dispatch-tools && dnf clean all && rm -rf /var/cache/dnf] Flags:[] Attrs:map[] Message:RUN dnf -y install cyrus-sasl-lib cyrus-sasl-plain qpid-dispatch-router qpid-dispatch-tools && dnf clean all && rm -rf /var/cache/dnf Original:RUN dnf -y install cyrus-sasl-lib cyrus-sasl-plain qpid-dispatch-router qpid-dispatch-tools && dnf clean all && rm -rf /var/cache/dnf}: while running runtime: exit status 1"
~~~

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

Centos 8 container build jobs are passing and only Centos 9 jobs are affected.

Centos 8 wallaby container build job history:

https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-build-containers-centos-8-quay-push-wallaby

Centos 8 train container build job history:

https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-build-containers-centos-8-quay-push-train

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

We see the same issue in Upstream infra:

https://c6e907431413b276a63b-23a7172f5dc1e58445390ec61883d218.ssl.cf2.rackcdn.com/864994/8/gate/tripleo-ci-centos-9-content-provider/9d99392/logs/container-builds/522fec05-8e3f-4cc7-8901-45f0ca049eed/base/os/horizon/horizon-build.log

~~~
Installed size: 6.8 M
Downloading Packages:
[MIRROR] python3-PyMySQL-0.10.1-6.el9.noarch.rpm: Curl error (6): Couldn't resolve host name for http://mirror.bhs1.ovh.opendev.org/centos-stream/9-stream/AppStream/x86_64/os/Packages/python3-PyMySQL-0.10.1-6.el9.noarch.rpm [Could not resolve host: mirror.bhs1.ovh.opendev.org]
[MIRROR] sscg-3.0.0-5.el9.x86_64.rpm: Curl error (6): Couldn't resolve host name for http://mirror.bhs1.ovh.opendev.org/centos-stream/9-stream/AppStream/x86_64/os/Packages/sscg-3.0.0-5.el9.x86_64.rpm [Could not resolve host: mirror.bhs1.ovh.opendev.org]
[MIRROR] mod_ssl-2.4.53-7.el9.x86_64.rpm: Curl error (6): Couldn't resolve host name for http://mirror.bhs1.ovh.opendev.org/centos-stream/9-stream/AppStream/x86_64/os/Packages/mod_ssl-2.4.53-7.el9.x86_64.rpm [Could not resolve host: mirror.bhs1.ovh.opendev.org]
[MIRROR] sscg-3.0.0-5.el9.x86_64.rpm: Curl error (6): Couldn't resolve host name for http://mirror.bhs1.ovh.opendev.org/centos-stream/9-stream/AppStream/x86_64/os/Packages/sscg-3.0.0-5.el9.x86_64.rpm [Could not resolve host: mirror.bhs1.ovh.opendev.org]
[FAILED] sscg-3.0.0-5.el9.x86_64.rpm: No more mirrors to try - All mirrors were already tried without success

The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'dnf clean packages'.
Error: Error downloading packages:
  sscg-3.0.0-5.el9.x86_64: Cannot download, all mirrors were already tried without success
~~~

Another example:

https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_2ea/864469/1/gate/tripleo-ci-centos-9-content-provider/2ea04fe/logs/container-builds/3b86b709-7cd1-4929-bbc8-7ce5c35137a4/base/os/glance-api/glance-api-build.log

Revision history for this message
Takashi Kajinami (kajinamit) wrote :

So it seems there are two patterns of failure.

1. In ovh, resolved.conf has the dns server set but DNS name resolution fails

2. In rax or rackcdn resolve.conf has no dns servers set.
 https://zuul.opendev.org/t/openstack/build/22500697c5244d31b0687057040cf1af

Though the both patterns are occurring not consistently but only occasionally.
We probably want to monitor the status before we determine whether the problem is a clear blocker or not.

Revision history for this message
Clark Boylan (cboylan) wrote :

Don't confuse where we store job builds logs with where jobs execute. We use OVH and rax/rackcdn based swift storage for storage of build logs. Jobs running in rackspace may upload to OVH swift and vice versa. All this to say that the location you see in the log url should not have any impact on job behavior that may be build provider specific. You need to check the build node names in the logs directly to determine where they ran.

Revision history for this message
Clark Boylan (cboylan) wrote :

Noted this on IRC but recording it here as well. At the beginning of the job we record the ansible host info. This host info shows that 127.0.0.1 is properly set as a DNS resolver: https://zuul.opendev.org/t/openstack/build/22500697c5244d31b0687057040cf1af/log/zuul-info/host-info.primary.yaml#187-189. But then after the jobs has failed an the job records logs it grabs the resolv.conf file which shows no resolvers are set: https://zuul.opendev.org/t/openstack/build/22500697c5244d31b0687057040cf1af/log/logs/undercloud/etc/resolv.conf.

This points to something updating the DNS configuration on the node while the job is running and when it does so appears to break DNS. For example in the links above we see no resolver is set, possibly because NetworkManager is attempting to set that info via DHCP but rackspace nodes do not do DHCP. Elsewhere DHCP may be overriding DNS resolvers to use local cloud resolvers which are overwhelmed by the number of requests? In any case this appears to be something in the job itself modifying the commit the node starts with when the job begins.

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

I have proposed a test patch to configure NetworkManager to not overwrite resolv.conf, /etc/resolv.conf is not overwritten after that during the job run.[2]

But we face the same DNS issue during the container image build.[3]

[1] https://review.opendev.org/c/openstack/tripleo-ci/+/865290/6/playbooks/tripleo-ci/test.yml#16

[2] https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_e4d/865290/6/check/tripleo-ci-centos-9-content-provider/e4d0b10/logs/undercloud/etc/resolv.conf

~~~
nameserver 127.0.0.1
~~~

[3] https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_e4d/865290/6/check/tripleo-ci-centos-9-content-provider/e4d0b10/logs/container-builds/0b349a4b-84f3-48f5-bd5a-b02eca881b57/base/os/horizon/horizon-build.log

~~~
[MIRROR] openstack-dashboard-23.1.0-0.20221107095346.c319118.el9.noarch.rpm: Curl error (28): Timeout was reached for http://mirror.mtl01.iweb.opendev.org:8080/rdo/centos9-master/component/ui/d8/5b/d85b61155590d61edbdc326dcd0e070c15f86fcb_6b89ae7d/openstack-dashboard-23.1.0-0.20221107095346.c319118.el9.noarch.rpm [Resolving timed out after 30000 milliseconds]
[FAILED] python-oslo-concurrency-lang-5.0.1-0.20220913081342.01cf2ff.el9.noarch.rpm: No more mirrors to try - All mirrors were already tried without success

The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'dnf clean packages'.
Error: Error downloading packages:
  python-oslo-concurrency-lang-5.0.1-0.20220913081342.01cf2ff.el9.noarch: Cannot download, all mirrors were already tried without success
~~~

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

Adding one observation:

Since Friday 18th Nov,

Container build in periodic promotion pipeline running on Vexxhost cloud is taking longer than usual.

Earlier those jobs used to finish within ~40 min and now they are taking longer than 1 hour.[1]

I noticed podman/buildah/netavark and some other rpm are bumped since Friday. Attaching diff of rpm.[2]

[1] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-build-containers-ubi-9-quay-push-master&result=SUCCESS&skip=0
[2] https://www.diffchecker.com/9gfEnYH1/

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

Hello,

We tried excluding the latest netavark-1.3.*, aardvark-dns-1.3.*(as its a dep) and container build is back to its original time of ~40 mins (Without the exclude it was taking ~20 mins more)

https://review.rdoproject.org/zuul/build/6f230ae82e8a4f60aa7e38472de27572

Found a recent netavark issue[1] which may be affecting our jobs, Fix of this issue is not in netavark-1.3.*.

[1] https://github.com/containers/netavark/issues/491

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

Thanks to Rabi for patch[1] that switches to the host network for building containers(We already use the host network for running containers.)

Master patch merged and Zed cherry-pick is up.

[1] https://review.opendev.org/c/openstack/tripleo-common/+/865116
[2] https://review.opendev.org/c/openstack/tripleo-common/+/865391

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.