Bug #1882664 “error: unpacking of archive failed on file /usr/sh...” : Bugs : tripleo

Revision history for this message

chandan kumar (chkumar246) wrote on 2020-06-09:

#1

Few more logs: https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-8-buildimage-overcloud-full-master/0d0a002/build.log

Revision history for this message

chandan kumar (chkumar246) wrote on 2020-06-09:

#2

https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-8-buildimage-overcloud-full-master/2d740ee/build.log

Revision history for this message

Michele Baldessari (michele) wrote on 2020-06-09:

#3

$curl -O https://trunk.rdoproject.org/centos8/component/tripleo/current/ansible-pacemaker-1.0.4-0.20200417100509.5847167.el8.noarch.rpm
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 19552 100 19552 0 0 34362 0 --:--:-- --:--:-- --:--:-- 34301
$ rpm2cpio ansible-pacemaker-1.0.4-0.20200417100509.5847167.el8.noarch.rpm|cpio -id
78 blocks

It looks like the download got corrupted or the vm's disk was acting up?

Revision history for this message

Alfredo Moralejo (amoralej) wrote on 2020-06-10:

#4

After some investigation, this is what we've found so far. We've found several occurences of this issue:

9-jun https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_b14/729465/37/gate/tripleo-ci-centos-7-containers-multinode-train/b14d06f/logs/undercloud/var/log/tripleo-container-image-prepare.log

9-jun https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_091/733471/3/gate/tripleo-ci-centos-7-containers-multinode-train/0915037/logs/undercloud/var/log/tripleo-container-image-prepare.log

10-jun https://675ed5a5fc3b18055111-587e1fde8c10362d45d985729e2fba7d.ssl.cf1.rackcdn.com/732618/2/check/tripleo-buildimage-overcloud-full-centos-8/5600fa5/build.log

Looking in logstash:

http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22unpacking%20of%20archive%20failed%20on%5C%22

Some findings:

- First occurrence i see is on 2020-06-03 22:37:09
- Similar error happens in different jobs, not only in build image but also in containers build ones or even in other oooq regular jobs. We don't see it in non-oooq jobs although it may be related to lack of logs of p-o-i and packstack jobs in logstash.
- Affects different packages, although it seems some of them are more likely.
- Affects CentOS7 and CentOS8
- Affects jobs running in different cloud providers
- In most cases, jobs have retry logic and after failing to install, it retries the same package from same repo and it succeeded so the job finishes fine and errors are unnoticed. Build image jobs have no retry logic and that's that makes them more apparent there.

Similar errors have been observed in fedora infra and were related to packages caching:

https://pagure.io/koji/issue/1418
https://pagure.io/koji/issue/290

There may have been some change in repos mirroring or caching in infra lastly that we may relate with this issue?

Revision history for this message

Javier Peña (jpena-c) wrote on 2020-06-10:

#5

I have dug deeper into one of the instances of this issue: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_280/733407/1/check/tripleo-ci-centos-7-containers-multinode-queens/2803d89/logs/undercloud/var/log/extra/logstash.txt

In this case, we have a job running on a VM on OVH, using mirror.bhs1.ovh.opendev.org as the AFS mirror. It fails on puppet-tripleo-8.6.0-0.20200530225056.6124a33.el7.noarch at 22:37:09. Looking at the Apache access logs, and trying to find that file accessed from that mirror, we get the following instances for June 3rd:

158.69.73.218 - - [03/Jun/2020:02:35:47 +0000] "GET /centos7-queens/d3/b5/d3b551e5f5f77e6304f2d82ef35d95816cece3f8_749827fe/puppet-tripleo-8.6.0-0.20200530225056.6124a33.el7.noarch.rpm HTTP/1.1" 304 - "-" "urlgrabber/3.10 yum/3.4.3"

158.69.73.218 - - [03/Jun/2020:04:05:18 +0000] "GET /centos7-queens/d3/b5/d3b551e5f5f77e6304f2d82ef35d95816cece3f8_749827fe/puppet-tripleo-8.6.0-0.20200530225056.6124a33.el7.noarch.rpm HTTP/1.1" 304 - "-" "urlgrabber/3.10 yum/3.4.3"

158.69.73.218 - - [03/Jun/2020:13:58:15 +0000] "GET /centos7-queens/d3/b5/d3b551e5f5f77e6304f2d82ef35d95816cece3f8_749827fe/puppet-tripleo-8.6.0-0.20200530225056.6124a33.el7.noarch.rpm HTTP/1.1" 200 256812 "-" "urlgrabber/3.10 yum/3.4.3"

158.69.73.218 - - [03/Jun/2020:22:40:59 +0000] "GET /centos7-queens/d3/b5/d3b551e5f5f77e6304f2d82ef35d95816cece3f8_749827fe/puppet-tripleo-8.6.0-0.20200530225056.6124a33.el7.noarch.rpm HTTP/1.1" 304 - "-" "urlgrabber/3.10 yum/3.4.3"

158.69.73.218 - - [03/Jun/2020:23:46:47 +0000] "GET /centos7-queens/d3/b5/d3b551e5f5f77e6304f2d82ef35d95816cece3f8_749827fe/puppet-tripleo-8.6.0-0.20200530225056.6124a33.el7.noarch.rpm HTTP/1.1" 304 - "-" "urlgrabber/3.10 yum/3.4.3"

So when yum failed to install, and later when it worked, it was receiving the same file from the AFS mirror. My theory is that something in the apache cache might not have worked as expected, leading to a corrupted file. In that job, retrying worked fine and masked the error.

Is there a way to implement a similar retry mechanism for the jobs that are currently affected by the issue? It seems to be a common caching issue, happening on both OpenDev and RDO Infra.