The version of podman that we currently consume in CI [1] has a race
that manifests itself under heavy disk IO load. With that race,
when a container <C> exits, podman doesn't cleanup everything as
expected in the container-storage., so it becomes impossible to
recreate a new container <C2> with the same name as <C>.
For example, we see the following race manifesting while
container-puppet.py regenerates the configuration files of
all services after it ran once succesfully [2]:
2019-12-12 20:15:20 | "2019-12-12 20:14:45,542 ERROR: 40368 -- ['/usr/bin/podman', 'run', '--user', '0', '--name', u'container-puppet-memcached', '--env', u'PUPPET_TAGS=file,file_line,concat,augeas,cron,file', '--env', u'NAME=memcached', '--env', u'HOSTNAME=centos-7-rax-ord-0013361579', '--env', 'NO_ARCHIVE=', '--env', 'STEP=6', '--env', 'NET_HOST=true', '--env', 'DEBUG=True', '--volume', '/etc/localtime:/etc/localtime:ro', '--volume', '/tmp/tmp9FcIfj:/etc/config.pp:ro', '--volume', '/etc/puppet/:/tmp/puppet-etc/:ro', '--volume', '/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume', '/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume', '/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume', '/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume', '/var/lib/config-data:/var/lib/config-data/:rw', '--volume', '/var/lib/container-puppet/puppetlabs/facter.conf:/etc/puppetlabs/facter/facter.conf:ro', '--volume', '/var/lib/container-puppet/puppetlabs/:/opt/puppetlabs/:ro', '--volume', '/dev/log:/dev/log:rw', '--rm', '--log-driver', 'k8s-file', '--log-opt', u'path=/var/log/containers/stdouts/container-puppet-memcached.log', '--security-opt', 'label=disable', '--volume', '/usr/share/openstack-puppet/modules/:/usr/share/openstack-puppet/modules/:ro', '--entrypoint', '/var/lib/container-puppet/container-puppet.sh', '--net', 'host', '--volume', '/etc/hosts:/etc/hosts:ro', '--volume', '/var/lib/container-puppet/container-puppet.sh:/var/lib/container-puppet/container-puppet.sh:ro', u'192.168.24.1:8787/tripleotrain/centos-binary-memcached:45b91c3f09b20dcdbf685ff265919a7a8906698d_1b3fd27f-updated-20191212184057'] run failed after Error: error creating container storage: the container name \"container-puppet-memcached\" is already in use by \"fd8614e0e22782af1921bb5a08b41398248c9b5d2dfd1b9113f66352097f9db8\". You have to remove that container to be able to reuse that name.: that name is already in use",
After the first successful run, no container should remain because
container-puppet create them with "--rm". Still, sometimes one
cannot recreate a new container with the same name, even if no container
with that name shows up in "podman ps -a" [3].
To workaround this race, podman has a special command "podman rm
--storage" that can force-clean any leftover content in the
container-storage layer [4]. But even then, sometimes the command
fails because the overlayfs mount that was created with the original
container was improperly unmounted on container exit. For instance
with debug-mode enabled in podman:
time="2019-12-13T09:54:53Z" level=debug msg="New container created \"52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230\""
time="2019-12-13T09:54:53Z" level=debug msg="container \"52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230\" has CgroupParent \"machine.slice/libpod-52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230.scope\""
time="2019-12-13T09:54:53Z" level=debug msg="Not attaching to stdin"
time="2019-12-13T09:54:53Z" level=debug msg="overlay: mount_data=lowerdir=/var/lib/containers/storage/overlay/l/PXENLYOWV4LLIA72N6XBX7B6BZ:/var/lib/containers/storage/overlay/l/2LFI74ZECHEFJCII3OGJGWQ2NH:/var/lib/containers/storage/overlay/l/SKCDFS7CPZHE5UVCWPH66IMW5Q:/var/lib/containers/storage/overlay/l/UV2DAVP6DOLCTTOQ2QC37WOKPS:/var/lib/containers/storage/overlay/l/3RKD3IBFKCXALJZP3D4OQ7QWRB:/var/lib/containers/storage/overlay/l/EKWSVNAGJD3ZNLTMAFC7R3QHJD,upperdir=/var/lib/containers/storage/overlay/d96d4811c6ec0ebea8623153b59fd16666dead08aa7b17dcb3de0247a3b7b9d3/diff,workdir=/var/lib/containers/storage/overlay/d96d4811c6ec0ebea8623153b59fd16666dead08aa7b17dcb3de0247a3b7b9d3/work,context=\"system_u:object_r:container_file_t:s0:c177,c448\""
time="2019-12-13T09:54:53Z" level=debug msg="mounted container \"52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230\" at \"/var/lib/containers/storage/overlay/d96d4811c6ec0ebea8623153b59fd16666dead08aa7b17dcb3de0247a3b7b9d3/merged\""
time="2019-12-13T09:54:53Z" level=debug msg="Created root filesystem for container 52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230 at /var/lib/containers/storage/overlay/d96d4811c6ec0ebea8623153b59fd16666dead08aa7b17dcb3de0247a3b7b9d3/merged"
time="2019-12-13T09:54:53Z" level=debug msg="adding entry to /etc/passwd for non existent default user"
time="2019-12-13T09:54:53Z" level=debug msg="/etc/system-fips does not exist on host, not mounting FIPS mode secret"
time="2019-12-13T09:54:53Z" level=warning msg="User mount overriding libpod mount at \"/etc/hosts\""
time="2019-12-13T09:54:53Z" level=debug msg="Setting CGroups for container 52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230 to machine.slice:libpod:52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230"
time="2019-12-13T09:54:53Z" level=debug msg="reading hooks from /usr/share/containers/oci/hooks.d"
time="2019-12-13T09:54:53Z" level=debug msg="added hook /usr/share/containers/oci/hooks.d/oci-systemd-hook.json"
time="2019-12-13T09:54:53Z" level=debug msg="hook oci-systemd-hook.json did not match"
time="2019-12-13T09:54:53Z" level=debug msg="reading hooks from /etc/containers/oci/hooks.d"
time="2019-12-13T09:54:53Z" level=debug msg="Created OCI spec for container 52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230 at /var/lib/containers/storage/overlay-containers/52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230/userdata/config.json"
time="2019-12-13T09:54:53Z" level=debug msg="/usr/bin/conmon messages will be logged to syslog"
# conmon was launched which then spawns runc. runc gets a single path where everything should be overlayed and bind mounted
# i think this path is the one after the -b
# but there must be also some config.json file that specifies "more stuff for runc" because i don't think everything gets passed via args
time="2019-12-13T09:54:53Z" level=debug msg="running conmon: /usr/bin/conmon" args="[--api-version 1 -s -c 52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230 -u 52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230 -r /usr/bin/runc -b /var/lib/containers/storage/overlay-contai ers/52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230/userdata -p /var/run/containers/storage/overlay-containers/52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230/userdata/pidfile -l k8s-file:/var/log/containers/stdouts/container-puppet-heat_api.log --exit-dir /var/run/libpod/exits --socket-dir-path /var/run/libpod/socket --log-level debug --syslog --conmon-pidfile /var/run/containers/storage/overlay-containers/52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /var/run/containers/storage --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /var/run/libpod --exit-command-arg --runtime --exit-command-arg runc --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg --syslog --exit-command-arg true --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230]"
time="2019-12-13T09:54:53Z" level=info msg="Running conmon under slice machine.slice and unitName libpod-conmon-52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230.scope"
time="2019-12-13T09:54:53Z" level=debug msg="Received: -1"
time="2019-12-13T09:54:53Z" level=debug msg="Cleaning up container 52086fca7a2ef9b46ab2a0d07184b888946bfc158232ca8b6c2c9a89b576a230"
time="2019-12-13T09:54:53Z" level=debug msg="Network is already cleaned up, skipping..."
time="2019-12-13T09:54:53Z" level=debug msg="[graphdriver] trying provided driver \"overlay\""
time="2019-12-13T09:54:53Z" level=debug msg="cached value indicated that overlay is supported"
time="2019-12-13T09:54:53Z" level=debug msg="cached value indicated that metacopy is not being used"
time="2019-12-13T09:54:53Z" level=debug msg="backingFs=xfs, projectQuotaSupported=false, useNativeDiff=true, usingMetacopy=false"
time="2019-12-13T09:54:53Z" level=debug msg="Failed to unmount d96d4811c6ec0ebea8623153b59fd16666dead08aa7b17dcb3de0247a3b7b9d3 overlay: /var/lib/containers/storage/overlay/d96d4811c6ec0ebea8623153b59fd16666dead08aa7b17dcb3de0247a3b7b9d3/merged - invalid argument"
time="2019-12-13T09:54:53Z" level=debug msg="Failed to remove mountpoint d96d4811c6ec0ebea8623153b59fd16666dead08aa7b17dcb3de0247a3b7b9d3 overlay: /var/lib/containers/storage/overlay/d96d4811c6ec0ebea8623153b59fd16666dead08aa7b17dcb3de0247a3b7b9d3/merged - device or resource busy"
Both conditions can currently impact container-puppet (and probably
paunch) and cause spurious failures in CI jobs.
[1] podman-1.5.1-3.el7.x86_64
[2] https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_6f3/698580/1/gate/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/6f39b33/logs/undercloud/home/zuul/overcloud_update_run_Controller.log.txt.gz
[3] https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_6f3/698580/1/gate/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/6f39b33/logs/undercloud/var/log/extra/podman/podman_allinfo.log.txt.gz
[4] https://github.com/containers/libpod/pull/3330
Fix proposed to branch: master /review. opendev. org/698962
Review: https:/