RHEL8 scenario 1 standalone deployment failed with The following containers failed validations and were not started: collectd" for master and train

Bug #1856278 reported by chandan kumar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Cédric Jeanneret

Bug Description

Master/Train Scenario 1 standalone deployment failed on RHEL8 while deploying standalone with following error:
http://logs.rdoproject.org/openstack-periodic-latest-released/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario001-standalone-train/384d9a9/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

2019-12-12 22:56:45 | "+ command -v python3",
2019-12-12 22:56:45 | "+ python3 /container-config-scripts/nova_wait_for_compute_service.py",
2019-12-12 22:56:45 | "The following containers failed validations and were not started: collectd"

While looking at paunch log http://logs.rdoproject.org/openstack-periodic-latest-released/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario001-standalone-train/384d9a9/logs/undercloud/var/log/paunch.log.txt.gz

2019-12-12 22:56:43.366 102944 DEBUG paunch [ ] Completed $ podman run --name ceilometer_gnocchi_upgrade --label config_id=tripleo_step5 --label container_name=ceilometer_gnocchi_upgrade --label managed_by=tripleo-Standalone --label config_data={"command": ["/usr/bin/bootstrap_host_exec", "ceilometer_agent_central", "su ceilometer -s /bin/bash -c 'for n in {1..10}; do /usr/bin/ceilometer-upgrade && exit 0 || sleep 30; done; exit 1'"], "detach": false, "healthcheck": {"test": "/openstack/healthcheck"}, "image": "192.168.24.1:8787/tripleotrain/rhel-binary-ceilometer-central:a8589c8a36e9984c5744c00a528d12bfe2c33e59_39b0634f-updated-20191212171530", "net": "host", "privileged": false, "start_order": 99, "user": "root", "volumes": ["/etc/hosts:/etc/hosts:ro", "/etc/localtime:/etc/localtime:ro", "/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro", "/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro", "/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro", "/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro", "/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro", "/dev/log:/dev/log", "/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro", "/etc/puppet:/etc/puppet:ro", "/var/lib/config-data/ceilometer/etc/ceilometer/:/etc/ceilometer/:ro", "/var/log/containers/ceilometer:/var/log/ceilometer:z"]} --conmon-pidfile=/var/run/ceilometer_gnocchi_upgrade.pid --log-driver k8s-file --log-opt path=/var/log/containers/stdouts/ceilometer_gnocchi_upgrade.log --net=host --privileged=false --user=root --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro --volume=/etc/puppet:/etc/puppet:ro --volume=/var/lib/config-data/ceilometer/etc/ceilometer/:/etc/ceilometer/:ro --volume=/var/log/containers/ceilometer:/var/log/ceilometer:z --cpuset-cpus=0,1,2,3,4,5,6,7 192.168.24.1:8787/tripleotrain/rhel-binary-ceilometer-central:a8589c8a36e9984c5744c00a528d12bfe2c33e59_39b0634f-updated-20191212171530 /usr/bin/bootstrap_host_exec ceilometer_agent_central su ceilometer -s /bin/bash -c 'for n in {1..10}; do /usr/bin/ceilometer-upgrade && exit 0 || sleep 30; done; exit 1'
2019-12-12 22:56:43.366 102944 INFO paunch [ ] stdout:
2019-12-12 22:56:43.366 102944 INFO paunch [ ] stderr:
2019-12-12 22:56:43.366 102944 ERROR paunch [ ] The following containers failed validations and were not started: collectd

We are also seeing the same issue on master also:

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario001-standalone-master/c7ec105/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

and

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario001-standalone-master/c7ec105/logs/undercloud/var/log/paunch.log.txt.gz

https://review.opendev.org/#/c/697666/ and https://review.opendev.org/#/c/698570/1 are added in paunch to fix this bug: https://bugs.launchpad.net/tripleo/+bug/1855444

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

2019-12-12 22:56:05.161 102944 DEBUG paunch [ ] Running container: collectd
2019-12-12 22:56:05.228 102944 DEBUG paunch [ ] $ podman ps -a --filter label=container_name=collectd --filter label=config_id=tripleo_step5 --format {{.Names}}
2019-12-12 22:56:05.328 102944 DEBUG paunch [ ] b''
2019-12-12 22:56:05.328 102944 DEBUG paunch [ ] b''
2019-12-12 22:56:05.328 102944 WARNING paunch [ ] Did not find container with "['podman', 'ps', '-a', '--filter', 'label=container_name=collectd', '--filter', 'label=config_id=tripleo_step5', '--format', '{{.Names}}']" - retrying without config_id
2019-12-12 22:56:05.328 102944 DEBUG paunch [ ] $ podman ps -a --filter label=container_name=collectd --format {{.Names}}
2019-12-12 22:56:05.432 102944 DEBUG paunch [ ] b''
2019-12-12 22:56:05.432 102944 DEBUG paunch [ ] b''
2019-12-12 22:56:05.432 102944 WARNING paunch [ ] Did not find container with "['podman', 'ps', '-a', '--filter', 'label=container_name=collectd', '--format', '{{.Names}}']"
2019-12-12 22:56:05.433 102944 DEBUG paunch [ ] Start container collectd as collectd.
2019-12-12 22:56:05.434 102944 DEBUG paunch [ ] Path seperator found in volume (/var/log/journal), but did not exist on the file system
2019-12-12 22:56:05.434 102944 ERROR paunch [ ] /var/log/journal is not a valid volume source
2019-12-12 22:56:05.434 102944 DEBUG paunch [ ] Validations failed. Skipping container: collectd

So apparently it wants to bind-mount /var/log/journal but it doesn't exist on the host filesystem ?!

Sounds related to:
2742319ba7 (Martin Magr 2019-09-13 14:46:33 +0200 634) - /var/log/journal:/var/log/journal:ro

https://review.opendev.org/682039

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

So, quick summary:
although the patch I pointed was merged weeks ago, it's still the RCA.

A new patch landed lately that enforces strict failure in case of non-existing file on the host for a bind-mount:
https://review.opendev.org/697666

This last patch puts into light the lack of /var/log/journal file on the system (coincidently I'm deploying an undercloud against OSP-16 (Train) on a RHEL-8 node, and indeed, that file is absent)

Changed in tripleo:
assignee: nobody → Martin Mágr (mmagr)
status: Confirmed → In Progress
tags: added: train-backport-potential
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Some more info:
- /var/log/journal should be created by systemd package
- journald is configured with "auto" Storage (the default)
- since we're using VM, I suspect the VM image drops all the log upon creation, and removes /var/log/journal directory in order to provide a clean env (which makes sense)

The "auto" Storage means:
- if /var/log/journal exists, it will switch to "persistent"
- if it doesn't, it will use /run/log/journal, which is volatile (removed upon reboot)

In order to move to persistent, we just need to ensure /var/log/journal exists with correct rights (and setype), journald will switch to it immediately. We can even sync volatile to persistent using "journalctl --flush", but it might take some time.

More to read here:
https://www.golinuxcloud.com/enable-persistent-logging-in-systemd-journald/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/698914

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Martin Mágr (<email address hidden>) on branch: master
Review: https://review.opendev.org/698889
Reason: Abandoning in favor of https://review.opendev.org/#/c/698914

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/698914
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=9cbfdfa1466ca38e7dd9f3df800993b5e2f2571d
Submitter: Zuul
Branch: master

commit 9cbfdfa1466ca38e7dd9f3df800993b5e2f2571d
Author: Cédric Jeanneret <email address hidden>
Date: Fri Dec 13 14:52:22 2019 +0100

    Ensure /var/log/journal exists as soon as possible

    It might happen the /var/log/journal directory doesn't exist on a host,
    especially if this host is a VM (the image creation usually involves
    deep cleaning of the logs).

    This absence might lead to log loss with the default journald
    configuration, which uses the "auto" Storage.
    This "auto" means that:
    - if /var/log/journal exists, journald will use it, and it will be
    persistent
    - if /var/log/journal doesn't exist, journald will use a volatile
    location, /run/log/journal, which is dropped upon system reboot

    Since logs are important, and "old" logs might be useful after a reboot,
    it's better to ensure we have persistent storage for journald.

    Related-Bug: #1856278
    Change-Id: I93dcc57aff63b91dab475b0c114b278324434e41

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/699161

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.opendev.org/699425

Changed in tripleo:
assignee: Martin Mágr (mmagr) → Cédric Jeanneret (cjeanner)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/train)

Change abandoned by Cédric Jeanneret (Tengu) (<email address hidden>) on branch: stable/train
Review: https://review.opendev.org/699161
Reason: The backport of this change will be better:
https://review.opendev.org/699425

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/699438

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/699425
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=f7a35600655d4c4744425b87bb41fbdeceb9d3bb
Submitter: Zuul
Branch: master

commit f7a35600655d4c4744425b87bb41fbdeceb9d3bb
Author: Cédric Jeanneret <email address hidden>
Date: Tue Dec 17 15:17:29 2019 +0100

    Create /var/log/journal directory during step-0

    While I93dcc57aff63b91dab475b0c114b278324434e41 did the right thing, it
    didn't do it in the right place.

    This patch adds the creation without removing the misplaced code, so
    that we can do an easy backport to Train.
    A follow-up will be the actual revert of the other change.

    Change-Id: Id937389e4eda5c1fc4634ab14695390a09300468
    Closes-Bug: #1856278

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/train)

Change abandoned by Cédric Jeanneret (Tengu) (<email address hidden>) on branch: stable/train
Review: https://review.opendev.org/699438
Reason: Gate is RED

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/train)

Reviewed: https://review.opendev.org/699438
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=88492a9656a47011f3dd87ec1a387ae97465d340
Submitter: Zuul
Branch: stable/train

commit 88492a9656a47011f3dd87ec1a387ae97465d340
Author: Cédric Jeanneret <email address hidden>
Date: Tue Dec 17 15:17:29 2019 +0100

    Create /var/log/journal directory during step-0

    While I93dcc57aff63b91dab475b0c114b278324434e41 did the right thing, it
    didn't do it in the right place.

    This patch adds the creation without removing the misplaced code, so
    that we can do an easy backport to Train.
    A follow-up will be the actual revert of the other change.

    Change-Id: Id937389e4eda5c1fc4634ab14695390a09300468
    Closes-Bug: #1856278
    (cherry picked from commit f7a35600655d4c4744425b87bb41fbdeceb9d3bb)

tags: added: in-stable-train
Revision history for this message
Marios Andreou (marios-b) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 11.3.1

This issue was fixed in the openstack/tripleo-heat-templates 11.3.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 12.1.0

This issue was fixed in the openstack/tripleo-heat-templates 12.1.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers