Comment 5 for bug 1882664

Revision history for this message
Javier Peña (jpena-c) wrote :

I have dug deeper into one of the instances of this issue: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_280/733407/1/check/tripleo-ci-centos-7-containers-multinode-queens/2803d89/logs/undercloud/var/log/extra/logstash.txt

In this case, we have a job running on a VM on OVH, using mirror.bhs1.ovh.opendev.org as the AFS mirror. It fails on puppet-tripleo-8.6.0-0.20200530225056.6124a33.el7.noarch at 22:37:09. Looking at the Apache access logs, and trying to find that file accessed from that mirror, we get the following instances for June 3rd:

158.69.73.218 - - [03/Jun/2020:02:35:47 +0000] "GET /centos7-queens/d3/b5/d3b551e5f5f77e6304f2d82ef35d95816cece3f8_749827fe/puppet-tripleo-8.6.0-0.20200530225056.6124a33.el7.noarch.rpm HTTP/1.1" 304 - "-" "urlgrabber/3.10 yum/3.4.3"

158.69.73.218 - - [03/Jun/2020:04:05:18 +0000] "GET /centos7-queens/d3/b5/d3b551e5f5f77e6304f2d82ef35d95816cece3f8_749827fe/puppet-tripleo-8.6.0-0.20200530225056.6124a33.el7.noarch.rpm HTTP/1.1" 304 - "-" "urlgrabber/3.10 yum/3.4.3"

158.69.73.218 - - [03/Jun/2020:13:58:15 +0000] "GET /centos7-queens/d3/b5/d3b551e5f5f77e6304f2d82ef35d95816cece3f8_749827fe/puppet-tripleo-8.6.0-0.20200530225056.6124a33.el7.noarch.rpm HTTP/1.1" 200 256812 "-" "urlgrabber/3.10 yum/3.4.3"

158.69.73.218 - - [03/Jun/2020:22:40:59 +0000] "GET /centos7-queens/d3/b5/d3b551e5f5f77e6304f2d82ef35d95816cece3f8_749827fe/puppet-tripleo-8.6.0-0.20200530225056.6124a33.el7.noarch.rpm HTTP/1.1" 304 - "-" "urlgrabber/3.10 yum/3.4.3"

158.69.73.218 - - [03/Jun/2020:23:46:47 +0000] "GET /centos7-queens/d3/b5/d3b551e5f5f77e6304f2d82ef35d95816cece3f8_749827fe/puppet-tripleo-8.6.0-0.20200530225056.6124a33.el7.noarch.rpm HTTP/1.1" 304 - "-" "urlgrabber/3.10 yum/3.4.3"

So when yum failed to install, and later when it worked, it was receiving the same file from the AFS mirror. My theory is that something in the apache cache might not have worked as expected, leading to a corrupted file. In that job, retrying worked fine and masked the error.

Is there a way to implement a similar retry mechanism for the jobs that are currently affected by the issue? It seems to be a common caching issue, happening on both OpenDev and RDO Infra.