centos-7-containerized-undercloud-upgrades fails in Start or restart systemd services in

Bug #1877449 reported by wes hayutin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Emilien Macchi

Bug Description

2020-05-07 19:41:18 | TASK [tripleo-container-manage : Start or restart systemd services] ************
2020-05-07 19:41:18 | Thursday 07 May 2020 19:41:18 +0000 (0:00:00.259) 0:27:29.574 **********
2020-05-07 19:41:19 | fatal: [undercloud]: FAILED! => {"msg": "The conditional check 'systemd_service_enable.status.Result == \"success\"' failed. The error was: error while evaluating conditional (systemd_service_enable.status.Result == \"success\"): 'dict object' has no attribute 'status'"}

https://da888ef809002d84bac8-a2fbaff4aeaa6e4af27d13483a4b50b0.ssl.cf1.rackcdn.com/726125/1/check/tripleo-ci-centos-7-containerized-undercloud-upgrades/1c50a18/logs/undercloud/home/zuul/undercloud_upgrade.log

https://66f2724215a8c9ca9bcc-67113804333a37bb92970c493e5932c8.ssl.cf5.rackcdn.com/726209/1/check/tripleo-ci-centos-7-containerized-undercloud-upgrades/7018864/logs/undercloud/home/zuul/undercloud_upgrade.log

=========== ERROR HERE ================================================
https://da888ef809002d84bac8-a2fbaff4aeaa6e4af27d13483a4b50b0.ssl.cf1.rackcdn.com/726125/1/check/tripleo-ci-centos-7-containerized-undercloud-upgrades/1c50a18/logs/undercloud/var/log/extra/journal.txt

May 07 19:41:19 undercloud.localdomain systemd[1]: Starting haproxy container...
May 07 19:41:19 undercloud.localdomain podman[200998]: Error: unable to find container haproxy: no container with name or ID haproxy found: no such container
May 07 19:41:19 undercloud.localdomain systemd[1]: tripleo_haproxy.service: control process exited, code=exited status=125
May 07 19:41:19 undercloud.localdomain systemd[1]: Failed to start haproxy container.
May 07 19:41:19 undercloud.localdomain systemd[1]: Unit tripleo_haproxy.service entered failed state.
May 07 19:41:19 undercloud.localdomain systemd[1]: tripleo_haproxy.service failed.

Revision history for this message
wes hayutin (weshayutin) wrote :

192.168.24.1:8787/tripleostein/centos-binary-haproxy 687919e2bdd9e558da8af67c434a9c0e068aa4e5_9aca342f 540d9c702b25 10 days ago 692 MB

listed in https://da888ef809002d84bac8-a2fbaff4aeaa6e4af27d13483a4b50b0.ssl.cf1.rackcdn.com/726125/1/check/tripleo-ci-centos-7-containerized-undercloud-upgrades/1c50a18/logs/undercloud/var/log/extra/podman/podman_allinfo.log

Changed in tripleo:
assignee: nobody → Emilien Macchi (emilienm)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/726277

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/726278

Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :

Hi,

considering this and https://bugs.launchpad.net/tripleo/+bug/1876893, i think it would make sense to add tripleo-ci-centos-7-containerized-undercloud-upgrades to the list of job of tripleo-ansible. WDYT ?

Revision history for this message
Emilien Macchi (emilienm) wrote :

So I found a few issues by investigating that patch:

* systemd fails to start haproxy container because it can't find it:

May 08 06:03:45 undercloud.localdomain podman[222695]: Error: unable to find container haproxy: no container with name or ID haproxy found: no such container
May 08 06:03:45 undercloud.localdomain systemd[1]: tripleo_haproxy.service: control process exited, code=exited status=125
May 08 06:03:45 undercloud.localdomain systemd[1]: Failed to start haproxy container.

* If Ansible is too slow at executing the systemd task, it'll fail with:

dict object' has no attribute 'status'

* A bunch of transient systems services failures, maybe run systemctl reset-failed.

Revision history for this message
Emilien Macchi (emilienm) wrote :

So... spending the day investigating it...
The issue we hit is in podman 1.5.1 which is now what is shipped in train CI.

The issue was already encountered here: https://bugs.launchpad.net/tripleo/+bug/1856324

Now looking why we get this version of Podman...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/726277
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=0c0f904ef56d23535fa4137d2ae3005a87ecb614
Submitter: Zuul
Branch: master

commit 0c0f904ef56d23535fa4137d2ae3005a87ecb614
Author: Emilien Macchi <email address hidden>
Date: Thu May 7 23:19:18 2020 -0400

    podman/systemd: relax the "until" condition

    On slow systems, it's possible that systemd takes more time than usual
    to execute a task from Ansible (e.g. service restart); so Ansible
    doesn't have yet the registered facts from systemd.

    To make sure that Ansible doesn't fail with:
    dict object' has no attribute 'status'

    We first check if status is defined.

    Change-Id: Ie73cecc115c87fe452a90892755a1df5b3d894a7
    Closes-Bug: #1877449

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (stable/train)

Reviewed: https://review.opendev.org/726278
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=c8dc1a7829ff9ed7dfd101973bb2a0f3d33e2aee
Submitter: Zuul
Branch: stable/train

commit c8dc1a7829ff9ed7dfd101973bb2a0f3d33e2aee
Author: Emilien Macchi <email address hidden>
Date: Thu May 7 23:19:18 2020 -0400

    podman/systemd: relax the "until" condition

    On slow systems, it's possible that systemd takes more time than usual
    to execute a task from Ansible (e.g. service restart); so Ansible
    doesn't have yet the registered facts from systemd.

    To make sure that Ansible doesn't fail with:
    dict object' has no attribute 'status'

    We first check if status is defined.

    Change-Id: Ie73cecc115c87fe452a90892755a1df5b3d894a7
    Closes-Bug: #1877449
    (cherry picked from commit 0c0f904ef56d23535fa4137d2ae3005a87ecb614)

tags: added: in-stable-train
Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :
Download full text (4.7 KiB)

The problem has moved but still persists:

2020-05-11 09:36:37 | TASK [tripleo-container-manage : Start or restart systemd services] ************
2020-05-11 09:36:37 | Monday 11 May 2020 09:36:37 +0000 (0:00:00.228) 0:23:42.966 ************
2020-05-11 09:36:38 | FAILED - RETRYING: Start or restart systemd services (5 retries left).
2020-05-11 09:36:43 | FAILED - RETRYING: Start or restart systemd services (4 retries left).
2020-05-11 09:36:48 | FAILED - RETRYING: Start or restart systemd services (3 retries left).
2020-05-11 09:36:54 | FAILED - RETRYING: Start or restart systemd services (2 retries left).
2020-05-11 09:36:59 | FAILED - RETRYING: Start or restart systemd services (1 retries left).
2020-05-11 09:37:04 | failed: [undercloud] (item=haproxy) => {"ansible_loop_var": "container_sysd_name", "attempts": 5, "changed": false, "container_sysd_name": "haproxy", "msg": "Unable to start service tripleo_haproxy.service: Job for tripleo_haproxy.service failed because start of the service was attempted too often. See \"systemctl status tripleo_haproxy.service\" and \"journalctl -xe\" for details.\nTo force a start use \"systemctl reset-failed tripleo_haproxy.service\" followed by \"systemctl start tripleo_haproxy.service\" again.\n"}
2020-05-11 09:37:05 | FAILED - RETRYING: Start or restart systemd services (5 retries left).
2020-05-11 09:37:10 | FAILED - RETRYING: Start or restart systemd services (4 retries left).
2020-05-11 09:37:16 | FAILED - RETRYING: Start or restart systemd services (3 retries left).
2020-05-11 09:37:21 | FAILED - RETRYING: Start or restart systemd services (2 retries left).
2020-05-11 09:37:27 | FAILED - RETRYING: Start or restart systemd services (1 retries left).
2020-05-11 09:37:32 | failed: [undercloud] (item=keepalived) => {"ansible_loop_var": "container_sysd_name", "attempts": 5, "changed": false, "container_sysd_name": "keepalived", "msg": "Unable to start service tripleo_keepalived.service: Job for tripleo_keepalived.service failed because start of the service was attempted too often. See \"systemctl status tripleo_keepalived.service\" and \"journalctl -xe\" for details.\nTo force a start use \"systemctl reset-failed tripleo_keepalived.service\" followed by \"systemctl start tripleo_keepalived.service\" again.\n"}
2020-05-11 09:37:33 | FAILED - RETRYING: Start or restart systemd services (5 retries left).
2020-05-11 09:37:38 | FAILED - RETRYING: Start or restart systemd services (4 retries left).
2020-05-11 09:37:44 | FAILED - RETRYING: Start or restart systemd services (3 retries left).
2020-05-11 09:37:49 | FAILED - RETRYING: Start or restart systemd services (2 retries left).
2020-05-11 09:37:55 | FAILED - RETRYING: Start or restart systemd services (1 retries left).
2020-05-11 09:38:00 | failed: [undercloud] (item=rabbitmq) => {"ansible_loop_var": "container_sysd_name", "attempts": 5, "changed": false, "container_sysd_name": "rabbitmq", "msg": "Unable to start service tripleo_rabbitmq.service: Job for tripleo_rabbitmq.service failed because start of the service was attempted too often. See \"systemctl status tripleo_rabbitmq.service\" and \"journalctl -xe\" for details.\nTo force a start use \"system...

Read more...

Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :

So it failed https://review.opendev.org/#/c/724782/ with the error mentioned above[1], can we make tripleo-ci-centos-7-containerized-undercloud-upgrades nv in tripleo-upgrade stein too ?

[1] https://zuul.opendev.org/t/openstack/build/540dd82f1d06487898478335a87cb988/log/logs/undercloud/home/zuul/undercloud_upgrade.log#5025

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-upgrade (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/726849

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-upgrade (stable/train)

Reviewed: https://review.opendev.org/726849
Committed: https://git.openstack.org/cgit/openstack/tripleo-upgrade/commit/?id=99fde33707fcd0e53b415eb8a47da2300f56f065
Submitter: Zuul
Branch: stable/train

commit 99fde33707fcd0e53b415eb8a47da2300f56f065
Author: Sofer Athlan-Guyot <email address hidden>
Date: Mon May 11 15:38:20 2020 +0200

    [train only] containerized-undercloud-upgrades -> NV.

    To be reverted when the related bug is closed.

    Change-Id: I8d6fab1f79223073403f4b265bdb25cda5a09f97
    Partial-Bug: #1877449

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-upgrade (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/730629

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-ansible 0.6.0

This issue was fixed in the openstack/tripleo-ansible 0.6.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.