error: unpacking of archive failed on file /usr/share/ansible/plugins/modules/pacemaker_cluster.py;5eded785: cpio: open failed - Inappropriate ioctl for device

Bug #1882664 reported by Pooja Jadhav
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
yatin

Bug Description

centos 8 master job failing with below Error Logs :

2020-06-09 00:28:33.381 |
2020-06-09 00:28:33.437 | Installing : pacemaker-remote-2.0.2-3.el8_1.2.x86_64 302/317
2020-06-09 00:28:33.468 | Running scriptlet: pacemaker-remote-2.0.2-3.el8_1.2.x86_64 302/317
2020-06-09 00:28:33.481 | Installing : ipa-client-4.8.0-10.module_el8.1.0+255+aa2bdb39. 303/317
2020-06-09 00:28:33.499 | Running scriptlet: ipa-client-4.8.0-10.module_el8.1.0+255+aa2bdb39. 303/317
2020-06-09 00:28:33.499 | Installing : ansible-pacemaker-1.0.4-0.20200417100509.5847167 304/317Error unpacking rpm package ansible-pacemaker-1.0.4-0.20200417100509.5847167.el8.noarch
2020-06-09 00:28:33.503 |
2020-06-09 00:28:33.504 | Installing : crudini-0.9.3-1.el8.noarch 305/317
2020-06-09 00:28:33.504 | error: unpacking of archive failed on file /usr/share/ansible/plugins/modules/pacemaker_cluster.py;5eded785: cpio: open failed - Inappropriate ioctl for device
2020-06-09 00:28:33.504 | error: ansible-pacemaker-1.0.4-0.20200417100509.5847167.el8.noarch: install failed
2020-06-09 00:28:33.504 |

It failed to install ansible-pacemaker-1.0.4-0.20200417100509.5847167.el8.noarch

Reference Links :

https://review.rdoproject.org/zuul/builds?pipeline=openstack-periodic-master&job_name=periodic-tripleo-centos-8-buildimage-overcloud-full-master

https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-8-buildimage-overcloud-full-master/419668a/build.log

https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-8-buildimage-overcloud-full-master/068b878/build.log

Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
Michele Baldessari (michele) wrote :

$curl -O https://trunk.rdoproject.org/centos8/component/tripleo/current/ansible-pacemaker-1.0.4-0.20200417100509.5847167.el8.noarch.rpm
  % Total % Received % Xferd Average Speed Time Time Time Current
                                 Dload Upload Total Spent Left Speed
100 19552 100 19552 0 0 34362 0 --:--:-- --:--:-- --:--:-- 34301
$ rpm2cpio ansible-pacemaker-1.0.4-0.20200417100509.5847167.el8.noarch.rpm|cpio -id
78 blocks

It looks like the download got corrupted or the vm's disk was acting up?

Revision history for this message
Alfredo Moralejo (amoralej) wrote :

After some investigation, this is what we've found so far. We've found several occurences of this issue:

9-jun https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_b14/729465/37/gate/tripleo-ci-centos-7-containers-multinode-train/b14d06f/logs/undercloud/var/log/tripleo-container-image-prepare.log

9-jun https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_091/733471/3/gate/tripleo-ci-centos-7-containers-multinode-train/0915037/logs/undercloud/var/log/tripleo-container-image-prepare.log

10-jun https://675ed5a5fc3b18055111-587e1fde8c10362d45d985729e2fba7d.ssl.cf1.rackcdn.com/732618/2/check/tripleo-buildimage-overcloud-full-centos-8/5600fa5/build.log

Looking in logstash:

http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22unpacking%20of%20archive%20failed%20on%5C%22

Some findings:

- First occurrence i see is on 2020-06-03 22:37:09
- Similar error happens in different jobs, not only in build image but also in containers build ones or even in other oooq regular jobs. We don't see it in non-oooq jobs although it may be related to lack of logs of p-o-i and packstack jobs in logstash.
- Affects different packages, although it seems some of them are more likely.
- Affects CentOS7 and CentOS8
- Affects jobs running in different cloud providers
- In most cases, jobs have retry logic and after failing to install, it retries the same package from same repo and it succeeded so the job finishes fine and errors are unnoticed. Build image jobs have no retry logic and that's that makes them more apparent there.

Similar errors have been observed in fedora infra and were related to packages caching:

https://pagure.io/koji/issue/1418
https://pagure.io/koji/issue/290

There may have been some change in repos mirroring or caching in infra lastly that we may relate with this issue?

Revision history for this message
Javier Peña (jpena-c) wrote :

I have dug deeper into one of the instances of this issue: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_280/733407/1/check/tripleo-ci-centos-7-containers-multinode-queens/2803d89/logs/undercloud/var/log/extra/logstash.txt

In this case, we have a job running on a VM on OVH, using mirror.bhs1.ovh.opendev.org as the AFS mirror. It fails on puppet-tripleo-8.6.0-0.20200530225056.6124a33.el7.noarch at 22:37:09. Looking at the Apache access logs, and trying to find that file accessed from that mirror, we get the following instances for June 3rd:

158.69.73.218 - - [03/Jun/2020:02:35:47 +0000] "GET /centos7-queens/d3/b5/d3b551e5f5f77e6304f2d82ef35d95816cece3f8_749827fe/puppet-tripleo-8.6.0-0.20200530225056.6124a33.el7.noarch.rpm HTTP/1.1" 304 - "-" "urlgrabber/3.10 yum/3.4.3"

158.69.73.218 - - [03/Jun/2020:04:05:18 +0000] "GET /centos7-queens/d3/b5/d3b551e5f5f77e6304f2d82ef35d95816cece3f8_749827fe/puppet-tripleo-8.6.0-0.20200530225056.6124a33.el7.noarch.rpm HTTP/1.1" 304 - "-" "urlgrabber/3.10 yum/3.4.3"

158.69.73.218 - - [03/Jun/2020:13:58:15 +0000] "GET /centos7-queens/d3/b5/d3b551e5f5f77e6304f2d82ef35d95816cece3f8_749827fe/puppet-tripleo-8.6.0-0.20200530225056.6124a33.el7.noarch.rpm HTTP/1.1" 200 256812 "-" "urlgrabber/3.10 yum/3.4.3"

158.69.73.218 - - [03/Jun/2020:22:40:59 +0000] "GET /centos7-queens/d3/b5/d3b551e5f5f77e6304f2d82ef35d95816cece3f8_749827fe/puppet-tripleo-8.6.0-0.20200530225056.6124a33.el7.noarch.rpm HTTP/1.1" 304 - "-" "urlgrabber/3.10 yum/3.4.3"

158.69.73.218 - - [03/Jun/2020:23:46:47 +0000] "GET /centos7-queens/d3/b5/d3b551e5f5f77e6304f2d82ef35d95816cece3f8_749827fe/puppet-tripleo-8.6.0-0.20200530225056.6124a33.el7.noarch.rpm HTTP/1.1" 304 - "-" "urlgrabber/3.10 yum/3.4.3"

So when yum failed to install, and later when it worked, it was receiving the same file from the AFS mirror. My theory is that something in the apache cache might not have worked as expected, leading to a corrupted file. In that job, retrying worked fine and masked the error.

Is there a way to implement a similar retry mechanism for the jobs that are currently affected by the issue? It seems to be a common caching issue, happening on both OpenDev and RDO Infra.

Revision history for this message
yatin (yatinkarel) wrote :

So posting finding relevant to overcloud-full image, other container image issue can be debugged seperately:-

- image is built at a temporary location in /tmp (/tmp/dib_build.xxxxx/mnt)

- systemd-tmpfiles is configured to cleanup temporary files as configured below(only pasting relevant paths) after 15 minutes of system booting[1].
$ sudo grep "d$" /usr/lib/tmpfiles.d/tmp.conf
q /tmp 1777 root root 10d
q /var/tmp 1777 root root 30d

- In CI since packages are being installed around that time during image building, the old directories are cleaned up due to tmpfiles cleanup configuration. The issue is random, if cleanup runs before image building starts or in it's initial phase issue will not be seen.

- We tried a workaround https://review.rdoproject.org/r/#/c/27998/5/playbooks/tmp.yaml, which seems to work fine. Until we get a solution we can implement the workaround in the job.

All the below bugs are due to the same issue:-
https://bugs.launchpad.net/tripleo/+bug/1873770
https://bugs.launchpad.net/tripleo/+bug/1867602
https://bugs.launchpad.net/tripleo/+bug/1879766

[1] systemctl cat systemd-tmpfiles-clean.timer|grep OnBootSec
OnBootSec=15min

Revision history for this message
yatin (yatinkarel) wrote :

<< - We tried a workaround https://review.rdoproject.org/r/#/c/27998/5/playbooks/tmp.yaml, which seems to work fine. Until we get a solution we can implement the workaround in the job.
Workaround https://review.rdoproject.org/r/#/c/28041/ is applied in periodic jobs until we get proper fix for the issue. Let's see how the result goes in next periodic pipeline runs for ussuri and master.

Changed in tripleo:
status: Triaged → In Progress
assignee: nobody → yatin (yatinkarel)
Revision history for this message
yatin (yatinkarel) wrote :

<<< Workaround https://review.rdoproject.org/r/#/c/28041/ is applied in periodic jobs until we get proper fix for the issue. Let's see how the result goes in next periodic pipeline runs for ussuri and master.

Results are good, workaround is moved to upstream jobs https://review.opendev.org/#/c/738469 and reverted from rdo https://review.rdoproject.org/r/#/c/28290/.

Changed in tripleo:
milestone: victoria-1 → victoria-3
Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.