In this case, we have a job running on a VM on OVH, using mirror.bhs1.ovh.opendev.org as the AFS mirror. It fails on puppet-tripleo-8.6.0-0.20200530225056.6124a33.el7.noarch at 22:37:09. Looking at the Apache access logs, and trying to find that file accessed from that mirror, we get the following instances for June 3rd:
So when yum failed to install, and later when it worked, it was receiving the same file from the AFS mirror. My theory is that something in the apache cache might not have worked as expected, leading to a corrupted file. In that job, retrying worked fine and masked the error.
Is there a way to implement a similar retry mechanism for the jobs that are currently affected by the issue? It seems to be a common caching issue, happening on both OpenDev and RDO Infra.
I have dug deeper into one of the instances of this issue: https:/ /storage. bhs.cloud. ovh.net/ v1/AUTH_ dcaab5e32b234d5 6b626f72581e364 4c/zuul_ opendev_ logs_280/ 733407/ 1/check/ tripleo- ci-centos- 7-containers- multinode- queens/ 2803d89/ logs/undercloud /var/log/ extra/logstash. txt
In this case, we have a job running on a VM on OVH, using mirror. bhs1.ovh. opendev. org as the AFS mirror. It fails on puppet- tripleo- 8.6.0-0. 20200530225056. 6124a33. el7.noarch at 22:37:09. Looking at the Apache access logs, and trying to find that file accessed from that mirror, we get the following instances for June 3rd:
158.69.73.218 - - [03/Jun/ 2020:02: 35:47 +0000] "GET /centos7- queens/ d3/b5/d3b551e5f 5f77e6304f2d82e f35d95816cece3f 8_749827fe/ puppet- tripleo- 8.6.0-0. 20200530225056. 6124a33. el7.noarch. rpm HTTP/1.1" 304 - "-" "urlgrabber/3.10 yum/3.4.3"
158.69.73.218 - - [03/Jun/ 2020:04: 05:18 +0000] "GET /centos7- queens/ d3/b5/d3b551e5f 5f77e6304f2d82e f35d95816cece3f 8_749827fe/ puppet- tripleo- 8.6.0-0. 20200530225056. 6124a33. el7.noarch. rpm HTTP/1.1" 304 - "-" "urlgrabber/3.10 yum/3.4.3"
158.69.73.218 - - [03/Jun/ 2020:13: 58:15 +0000] "GET /centos7- queens/ d3/b5/d3b551e5f 5f77e6304f2d82e f35d95816cece3f 8_749827fe/ puppet- tripleo- 8.6.0-0. 20200530225056. 6124a33. el7.noarch. rpm HTTP/1.1" 200 256812 "-" "urlgrabber/3.10 yum/3.4.3"
158.69.73.218 - - [03/Jun/ 2020:22: 40:59 +0000] "GET /centos7- queens/ d3/b5/d3b551e5f 5f77e6304f2d82e f35d95816cece3f 8_749827fe/ puppet- tripleo- 8.6.0-0. 20200530225056. 6124a33. el7.noarch. rpm HTTP/1.1" 304 - "-" "urlgrabber/3.10 yum/3.4.3"
158.69.73.218 - - [03/Jun/ 2020:23: 46:47 +0000] "GET /centos7- queens/ d3/b5/d3b551e5f 5f77e6304f2d82e f35d95816cece3f 8_749827fe/ puppet- tripleo- 8.6.0-0. 20200530225056. 6124a33. el7.noarch. rpm HTTP/1.1" 304 - "-" "urlgrabber/3.10 yum/3.4.3"
So when yum failed to install, and later when it worked, it was receiving the same file from the AFS mirror. My theory is that something in the apache cache might not have worked as expected, leading to a corrupted file. In that job, retrying worked fine and masked the error.
Is there a way to implement a similar retry mechanism for the jobs that are currently affected by the issue? It seems to be a common caching issue, happening on both OpenDev and RDO Infra.