podman 1.5.1 race makes some jobs fail unexpectedly in CI

Bug #1856324 reported by Damien Ciabrini
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Damien Ciabrini

Bug Description

The version of podman that we currently consume in CI [1] has a race
that manifests itself under heavy disk IO load. With that race,
when a container <C> exits, podman doesn't cleanup everything as
expected in the container-storage., so it becomes impossible to
recreate a new container <C2> with the same name as <C>.

For example, we see the following race manifesting while
container-puppet.py regenerates the configuration files of
all services after it ran once succesfully [2]:

2019-12-12 20:15:20 | "2019-12-12 20:14:45,542 ERROR: 40368 -- ['/usr/bin/podman', 'run', '--user', '0', '--name', u'container-puppet-memcached', '--env', u'PUPPET_TAGS=file,file_line,concat,augeas,cron,file', '--env', u'NAME=memcached', '--env', u'HOSTNAME=centos-7-rax-ord-0013361579', '--env', 'NO_ARCHIVE=', '--env', 'STEP=6', '--env', 'NET_HOST=true', '--env', 'DEBUG=True', '--volume', '/etc/localtime:/etc/localtime:ro', '--volume', '/tmp/tmp9FcIfj:/etc/config.pp:ro', '--volume', '/etc/puppet/:/tmp/puppet-etc/:ro', '--volume', '/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume', '/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume', '/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume', '/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume', '/var/lib/config-data:/var/lib/config-data/:rw', '--volume', '/var/lib/container-puppet/puppetlabs/facter.conf:/etc/puppetlabs/facter/facter.conf:ro', '--volume', '/var/lib/container-puppet/puppetlabs/:/opt/puppetlabs/:ro', '--volume', '/dev/log:/dev/log:rw', '--rm', '--log-driver', 'k8s-file', '--log-opt', u'path=/var/log/containers/stdouts/container-puppet-memcached.log', '--security-opt', 'label=disable', '--volume', '/usr/share/openstack-puppet/modules/:/usr/share/openstack-puppet/modules/:ro', '--entrypoint', '/var/lib/container-puppet/container-puppet.sh', '--net', 'host', '--volume', '/etc/hosts:/etc/hosts:ro', '--volume', '/var/lib/container-puppet/container-puppet.sh:/var/lib/container-puppet/container-puppet.sh:ro', u'192.168.24.1:8787/tripleotrain/centos-binary-memcached:45b91c3f09b20dcdbf685ff265919a7a8906698d_1b3fd27f-updated-20191212184057'] run failed after Error: error creating container storage: the container name \"container-puppet-memcached\" is already in use by \"fd8614e0e22782af1921bb5a08b41398248c9b5d2dfd1b9113f66352097f9db8\". You have to remove that container to be able to reuse that name.: that name is already in use",

After the first successful run, no container should remain because
container-puppet create them with "--rm". Still, sometimes one
cannot recreate a new container with the same name, even if no container
with that name shows up in "podman ps -a" [3].

To workaround this race, podman has a special command "podman rm
--storage" that can force-clean any leftover content in the
container-storage layer [4]. But even then, sometimes the command
fails because the overlayfs mount that was created with the original
container was improperly unmounted on container exit. For instance
with debug-mode enabled in podman:

time="2019-12-13T09:54:53Z" level=debug msg="New container created \"52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230\""
time="2019-12-13T09:54:53Z" level=debug msg="container \"52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230\" has CgroupParent \"machine.slice/libpod-52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230.scope\""
time="2019-12-13T09:54:53Z" level=debug msg="Not attaching to stdin"
time="2019-12-13T09:54:53Z" level=debug msg="overlay: mount_data=lowerdir=/var/lib/containers/storage/overlay/l/PXENLYOWV4LLIA72N6XBX7B6BZ:/var/lib/containers/storage/overlay/l/2LFI74ZECHEFJCII3OGJGWQ2NH:/var/lib/containers/storage/overlay/l/SKCDFS7CPZHE5UVCWPH66IMW5Q:/var/lib/containers/storage/overlay/l/UV2DAVP6DOLCTTOQ2QC37WOKPS:/var/lib/containers/storage/overlay/l/3RKD3IBFKCXALJZP3D4OQ7QWRB:/var/lib/containers/storage/overlay/l/EKWSVNAGJD3ZNLTMAFC7R3QHJD,upperdir=/var/lib/containers/storage/overlay/d96d4811c6ec0ebea8623153b59fd16666dead08aa7b17dcb3de0247a3b7b9d3/diff,workdir=/var/lib/containers/storage/overlay/d96d4811c6ec0ebea8623153b59fd16666dead08aa7b17dcb3de0247a3b7b9d3/work,context=\"system_u:object_r:container_file_t:s0:c177,c448\""
time="2019-12-13T09:54:53Z" level=debug msg="mounted container \"52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230\" at \"/var/lib/containers/storage/overlay/d96d4811c6ec0ebea8623153b59fd16666dead08aa7b17dcb3de0247a3b7b9d3/merged\""
time="2019-12-13T09:54:53Z" level=debug msg="Created root filesystem for container 52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230 at /var/lib/containers/storage/overlay/d96d4811c6ec0ebea8623153b59fd16666dead08aa7b17dcb3de0247a3b7b9d3/merged"
time="2019-12-13T09:54:53Z" level=debug msg="adding entry to /etc/passwd for non existent default user"
time="2019-12-13T09:54:53Z" level=debug msg="/etc/system-fips does not exist on host, not mounting FIPS mode secret"
time="2019-12-13T09:54:53Z" level=warning msg="User mount overriding libpod mount at \"/etc/hosts\""
time="2019-12-13T09:54:53Z" level=debug msg="Setting CGroups for container 52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230 to machine.slice:libpod:52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230"
time="2019-12-13T09:54:53Z" level=debug msg="reading hooks from /usr/share/containers/oci/hooks.d"
time="2019-12-13T09:54:53Z" level=debug msg="added hook /usr/share/containers/oci/hooks.d/oci-systemd-hook.json"
time="2019-12-13T09:54:53Z" level=debug msg="hook oci-systemd-hook.json did not match"
time="2019-12-13T09:54:53Z" level=debug msg="reading hooks from /etc/containers/oci/hooks.d"
time="2019-12-13T09:54:53Z" level=debug msg="Created OCI spec for container 52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230 at /var/lib/containers/storage/overlay-containers/52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230/userdata/config.json"
time="2019-12-13T09:54:53Z" level=debug msg="/usr/bin/conmon messages will be logged to syslog"
# conmon was launched which then spawns runc. runc gets a single path where everything should be overlayed and bind mounted
# i think this path is the one after the -b
# but there must be also some config.json file that specifies "more stuff for runc" because i don't think everything gets passed via args
time="2019-12-13T09:54:53Z" level=debug msg="running conmon: /usr/bin/conmon" args="[--api-version 1 -s -c 52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230 -u 52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230 -r /usr/bin/runc -b /var/lib/containers/storage/overlay-contai ers/52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230/userdata -p /var/run/containers/storage/overlay-containers/52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230/userdata/pidfile -l k8s-file:/var/log/containers/stdouts/container-puppet-heat_api.log --exit-dir /var/run/libpod/exits --socket-dir-path /var/run/libpod/socket --log-level debug --syslog --conmon-pidfile /var/run/containers/storage/overlay-containers/52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /var/run/containers/storage --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /var/run/libpod --exit-command-arg --runtime --exit-command-arg runc --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg --syslog --exit-command-arg true --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230]"
time="2019-12-13T09:54:53Z" level=info msg="Running conmon under slice machine.slice and unitName libpod-conmon-52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230.scope"
time="2019-12-13T09:54:53Z" level=debug msg="Received: -1"
time="2019-12-13T09:54:53Z" level=debug msg="Cleaning up container 52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230"
time="2019-12-13T09:54:53Z" level=debug msg="Network is already cleaned up, skipping..."
time="2019-12-13T09:54:53Z" level=debug msg="[graphdriver] trying provided driver \"overlay\""
time="2019-12-13T09:54:53Z" level=debug msg="cached value indicated that overlay is supported"
time="2019-12-13T09:54:53Z" level=debug msg="cached value indicated that metacopy is not being used"
time="2019-12-13T09:54:53Z" level=debug msg="backingFs=xfs, projectQuotaSupported=false, useNativeDiff=true, usingMetacopy=false"
time="2019-12-13T09:54:53Z" level=debug msg="Failed to unmount d96d4811c6ec0ebea8623153b59fd16666dead08aa7b17dcb3de0247a3b7b9d3 overlay: /var/lib/containers/storage/overlay/d96d4811c6ec0ebea8623153b59fd16666dead08aa7b17dcb3de0247a3b7b9d3/merged - invalid argument"
time="2019-12-13T09:54:53Z" level=debug msg="Failed to remove mountpoint d96d4811c6ec0ebea8623153b59fd16666dead08aa7b17dcb3de0247a3b7b9d3 overlay: /var/lib/containers/storage/overlay/d96d4811c6ec0ebea8623153b59fd16666dead08aa7b17dcb3de0247a3b7b9d3/merged - device or resource busy"

Both conditions can currently impact container-puppet (and probably
paunch) and cause spurious failures in CI jobs.

[1] podman-1.5.1-3.el7.x86_64
[2] https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_6f3/698580/1/gate/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/6f39b33/logs/undercloud/home/zuul/overcloud_update_run_Controller.log.txt.gz
[3] https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_6f3/698580/1/gate/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/6f39b33/logs/undercloud/var/log/extra/podman/podman_allinfo.log.txt.gz
[4] https://github.com/containers/libpod/pull/3330

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.opendev.org/698962

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to paunch (master)

Fix proposed to branch: master
Review: https://review.opendev.org/698999

Changed in tripleo:
assignee: Damien Ciabrini (dciabrin) → Luke Short (ekultails)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.opendev.org/699737

Changed in tripleo:
assignee: Luke Short (ekultails) → Damien Ciabrini (dciabrin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/699803

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Damien Ciabrini (<email address hidden>) on branch: master
Review: https://review.opendev.org/698962
Reason: abandoned in favour of Ibf4aab00c43597681ca27c876d9a6725b0a3a1f0 that is shared with paunch

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/699737
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=acd176e811d39aad1d5cbc64ab41e46dd825daf9
Submitter: Zuul
Branch: master

commit acd176e811d39aad1d5cbc64ab41e46dd825daf9
Author: Damien Ciabrini <email address hidden>
Date: Wed Dec 18 18:29:19 2019 +0100

    Limit concurrency in container-puppet to image pull

    Due to a race in podman 1.6.3, when containers are created in
    parallel under high disk IO, their associated overlayFS mount point
    might become invalid, and container-puppet would fail to run.

    In order to avoid the bug, run all docker-puppet containers
    sequentially to avoid concurrent container creation. Will still
    pull the images concurrently, so the time loss due to lack of
    parallelism is only limited to puppet runs.

    Change-Id: I29a9ffc621bcbd5e1b09490c5fe09a363286bb8b
    Closes-Bug: #1856324

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Damien Ciabrini (<email address hidden>) on branch: master
Review: https://review.opendev.org/699803
Reason: Newer podman doesn't suffer that race, so no need to make things more complex. Abandoning this review accordingly

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 12.1.0

This issue was fixed in the openstack/tripleo-heat-templates 12.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on paunch (master)

Change abandoned by Luke Short (<email address hidden>) on branch: master
Review: https://review.opendev.org/698999
Reason: As we transition CI from CentOS 7.7 (Podman 1.4.4-4) to CentOS 8.1 (Podman 1.6.4-2) this should no longer be an issue. Production deployments on CentOS 7 should only be using docker.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.