Dying containers during undercloud-upgrade

Bug #1845935 reported by Jose Luis Franco
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Unassigned

Bug Description

It's being observed in the master patches jobs that the undercloud-upgrade job is failing because it can bring up the new tripleomaster containers it tries to upgrade too.

Job logs: https://storage.bhs1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_9cd/681647/3/check/tripleo-ci-centos-7-containerized-undercloud-upgrades/9cdb462/

2019-09-27 06:28:20 | TASK [Debug output for task: Run container-puppet tasks (generate config) during step 1] ***
2019-09-27 06:28:20 | Friday 27 September 2019 06:28:20 +0000 (0:13:30.037) 0:28:12.335 ******
2019-09-27 06:28:21 | fatal: [undercloud]: FAILED! => {
2019-09-27 06:28:21 | "failed_when_result": true,
2019-09-27 06:28:21 | "outputs.stdout_lines | default([]) | union(outputs.stderr_lines | default([]))": [
2019-09-27 06:28:21 | "2019-09-27 06:14:52,045 INFO: 153808 -- Running container-puppet",
2019-09-27 06:28:21 | "2019-09-27 06:14:52,049 INFO: 153808 -- Service compilation completed.",
2019-09-27 06:28:21 | "2019-09-27 06:14:52,050 INFO: 153808 -- Starting multiprocess configuration steps. Using 6 processes.",
2019-09-27 06:28:21 | "2019-09-27 06:14:52,080 INFO: 153850 -- Starting configuration of heat_api using image docker.io/tripleomaster/centos-binary-heat-api:current-tripleo",
2019-09-27 06:28:21 | "2019-09-27 06:14:52,081 INFO: 153851 -- Starting configuration of ironic_api using image docker.io/tripleomaster/centos-binary-ironic-api:current-tripleo",
2019-09-27 06:28:21 | "2019-09-27 06:14:52,082 INFO: 153852 -- Starting configuration of swift_ringbuilder using image docker.io/tripleomaster/centos-binary-swift-proxy-server:current-tripleo",
2019-09-27 06:28:21 | "2019-09-27 06:14:52,083 INFO: 153854 -- Starting configuration of ironic using image docker.io/tripleomaster/centos-binary-ironic-pxe:current-tripleo",
2019-09-27 06:28:21 | "2019-09-27 06:14:52,083 INFO: 153853 -- Starting configuration of mysql using image docker.io/tripleomaster/centos-binary-mariadb:current-tripleo",
2019-09-27 06:28:21 | "2019-09-27 06:14:52,085 INFO: 153855 -- Starting configuration of barbican using image docker.io/tripleomaster/centos-binary-barbican-api:current-tripleo",
2019-09-27 06:28:21 | "2019-09-27 06:14:52,352 INFO: 153855 -- Removing container: container-puppet-barbican-quz0jq0y",
2019-09-27 06:28:21 | "2019-09-27 06:14:52,373 INFO: 153851 -- Removing container: container-puppet-ironic_api-8gzzfkrr",
2019-09-27 06:28:21 | "2019-09-27 06:14:52,385 INFO: 153852 -- Removing container: container-puppet-swift_ringbuilder-sw42djxe",
2019-09-27 06:28:21 | "2019-09-27 06:14:52,399 INFO: 153854 -- Removing container: container-puppet-ironic-zwr1mod0",
2019-09-27 06:28:21 | "2019-09-27 06:14:52,435 INFO: 153853 -- Removing container: container-puppet-mysql-hvhbvx6r",
2019-09-27 06:28:21 | "2019-09-27 06:14:52,456 INFO: 153850 -- Removing container: container-puppet-heat_api-7zzpyw4w",
2019-09-27 06:28:21 | "2019-09-27 06:14:53,466 INFO: 153855 -- Pulling image: docker.io/tripleomaster/centos-binary-barbican-api:current-tripleo",
2019-09-27 06:28:21 | "2019-09-27 06:14:53,478 INFO: 153852 -- Pulling image: docker.io/tripleomaster/centos-binary-swift-proxy-server:current-tripleo",
2019-09-27 06:28:21 | "2019-09-27 06:14:53,503 INFO: 153851 -- Pulling image: docker.io/tripleomaster/centos-binary-ironic-api:current-tripleo",
2019-09-27 06:28:21 | "2019-09-27 06:14:53,530 INFO: 153853 -- Pulling image: docker.io/tripleomaster/centos-binary-mariadb:current-tripleo",
2019-09-27 06:28:21 | "2019-09-27 06:14:53,549 INFO: 153854 -- Pulling image: docker.io/tripleomaster/centos-binary-ironic-pxe:current-tripleo",
2019-09-27 06:28:21 | "2019-09-27 06:14:53,612 INFO: 153850 -- Pulling image: docker.io/tripleomaster/centos-binary-heat-api:current-tripleo",
2019-09-27 06:28:21 | "2019-09-27 06:17:59,483 ERROR: 153853 -- ['/usr/bin/podman', 'run', '--user', 'root', '--name', u'container-puppet-mysql-hvhbvx6r', '--env', u'PUPPET_TAGS=file,file_line,concat,augeas,cron,file', '--env', u'NAME=mysql', '--env', u'HOSTNAME=undercloud', '--env', 'NO_ARCHIVE=', '--env', 'STEP=6', '--env', 'NET_HOST=true', '--env', 'DEBUG=False', '--volume', '/etc/localtime:/etc/localtime:ro', '--volume', '/tmp/tmp7Cpyif:/etc/config.pp:ro', '--volume', '/etc/puppet/:/tmp/puppet-etc/:ro', '--volume', '/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume', '/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume', '/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume', '/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume', '/var/lib/config-data:/var/lib/config-data/:rw', '--volume', '/var/lib/container-puppet/puppetlabs/facter.conf:/etc/puppetlabs/facter/facter.conf:ro', '--volume', '/var/lib/container-puppet/puppetlabs/:/opt/puppetlabs/:ro', '--volume', '/dev/log:/dev/log:rw', '--rm', '--log-driver', 'k8s-file', '--log-opt', u'path=/var/log/containers/stdouts/container-puppet-mysql-hvhbvx6r.log', '--security-opt', 'label=disable', '--volume', '/usr/share/openstack-puppet/modules/:/usr/share/openstack-puppet/modules/:ro', '--entrypoint', '/var/lib/container-puppet/container-puppet.sh', '--net', 'host', '--volume', '/etc/hosts:/etc/hosts:ro', '--volume', '/var/lib/container-puppet/container-puppet.sh:/var/lib/container-puppet/container-puppet.sh:ro', u'docker.io/tripleomaster/centos-binary-mariadb:current-tripleo'] run failed after + mkdir -p /etc/puppet",

When folling up the container container-puppet-mysql-hvhbvx6r in the system logs we can see the following:

Sep 27 06:16:43 undercloud.localdomain podman[158411]: 2019-09-27 06:16:43.637386442 +0000 UTC m=+0.478621202 container create 26fd6bca4c121c0dd42c540c9e3ddf3a550c6ed62da985ddd3750a8e58cfbd33 (image=docker.io/tripleomaster/centos-binary-mariadb:current-tripleo, name=container-puppet-mysql-hvhbvx6r)
Sep 27 06:16:43 undercloud.localdomain systemd[1]: Started libpod-conmon-26fd6bca4c121c0dd42c540c9e3ddf3a550c6ed62da985ddd3750a8e58cfbd33.scope.
Sep 27 06:16:44 undercloud.localdomain systemd[1]: Started libcontainer container 26fd6bca4c121c0dd42c540c9e3ddf3a550c6ed62da985ddd3750a8e58cfbd33.
Sep 27 06:16:45 undercloud.localdomain podman[158411]: 2019-09-27 06:16:45.141450029 +0000 UTC m=+1.982684733 container init 26fd6bca4c121c0dd42c540c9e3ddf3a550c6ed62da985ddd3750a8e58cfbd33 (image=docker.io/tripleomaster/centos-binary-mariadb:current-tripleo, name=container-puppet-mysql-hvhbvx6r)
Sep 27 06:16:45 undercloud.localdomain podman[158411]: 2019-09-27 06:16:45.191456641 +0000 UTC m=+2.032691937 container start 26fd6bca4c121c0dd42c540c9e3ddf3a550c6ed62da985ddd3750a8e58cfbd33 (image=docker.io/tripleomaster/centos-binary-mariadb:current-tripleo, name=container-puppet-mysql-hvhbvx6r)
Sep 27 06:16:45 undercloud.localdomain podman[158411]: 2019-09-27 06:16:45.191704809 +0000 UTC m=+2.032940285 container attach
Sep 27 06:17:55 undercloud.localdomain podman[158411]: 2019-09-27 06:17:55.956049756 +0000 UTC m=+72.797284334 container died 26fd6bca4c121c0dd42c540c9e3ddf3a550c6ed62da985ddd3750a8e58cfbd33 (image=docker.io/tripleomaster/centos-binary-mariadb:current-tripleo, name=container-puppet-mysql-hvhbvx6r)
Sep 27 06:17:56 undercloud.localdomain podman[158411]: 2019-09-27 06:17:56.440265698 +0000 UTC m=+73.281500375 container remove 26fd6bca4c121c0dd42c540c9e3ddf3a550c6ed62da985ddd3750a8e58cfbd33 (image=docker.io/tripleomaster/centos-binary-mariadb:current-tripleo, name=container-puppet-mysql-hvhbvx6r)

And when checking if any of the tripleomaster containers were started, we can't see any in the podman ps output:
https://storage.bhs1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_9cd/681647/3/check/tripleo-ci-centos-7-containerized-undercloud-upgrades/9cdb462/logs/undercloud/var/log/extra/podman/podman_allinfo.log.txt.gz

Revision history for this message
Jose Luis Franco (jfrancoa) wrote :

While trying to narrow down in which patch did the job start failing, this one looked like a potential one: https://review.opendev.org/681647 . Tried to revert it to confirm if that's the case: https://review.opendev.org/#/c/685672/

Changed in tripleo:
milestone: none → train-rc1
status: New → Triaged
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Some observations:

in the system journal:
Sep 27 05:57:34 undercloud.localdomain podman[117006]: 2019-09-27 05:57:34.34739525 +0000 UTC m=+1.943270211 container died 5451dfb68fa4509f0f6199d046288a9421ccdd8a82a7ae56865f83f5c4c9447d (image=docker.io/tripleostein/centos-binary-mariadb:current-tripleo, name=mysql)
Sep 27 05:57:39 undercloud.localdomain podman[117163]: 2019-09-27 05:57:39.299614887 +0000 UTC m=+0.190275281 container died 5451dfb68fa4509f0f6199d046288a9421ccdd8a82a7ae56865f83f5c4c9447d (image=docker.io/tripleostein/centos-binary-mariadb:current-tripleo, name=mysql)

in podman inspect:

+ podman inspect mysql
[
    {
        "Id": "5451dfb68fa4509f0f6199d046288a9421ccdd8a82a7ae56865f83f5c4c9447d",
...
        "State": {
            "OciVersion": "1.0.1-dev",
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 117615,
            "ConmonPid": 117591,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2019-09-27T05:57:40.788290976Z",
            "FinishedAt": "2019-09-27T05:57:39.058498867Z",

This shows the failures logged in the journal prepended the normal execution of the container later, and the most recent failure event 2019-09-27 05:57:39.299614887 seems related to that was logged as FinishedAt.

There is also totally a 131 "container died" events had been logged for that example CI job.

Changed in tripleo:
milestone: train-rc1 → ussuri-1
Changed in tripleo:
milestone: ussuri-1 → ussuri-2
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-2 → ussuri-3
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-3 → ussuri-rc3
Revision history for this message
Jose Luis Franco (jfrancoa) wrote :

Issue already fixed, CI jobs passing.

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.