tripleo

centos-7-containerized-undercloud-upgrades fails in Start or restart systemd services in

Bug #1877449 reported by wes hayutin on 2020-05-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Emilien Macchi	tripleo victoria-1 "tripleo victoria"

Bug Description

2020-05-07 19:41:18 | TASK [tripleo-container-manage : Start or restart systemd services] ************
2020-05-07 19:41:18 | Thursday 07 May 2020 19:41:18 +0000 (0:00:00.259) 0:27:29.574 **********
2020-05-07 19:41:19 | fatal: [undercloud]: FAILED! => {"msg": "The conditional check 'systemd_service_enable.status.Result == \"success\"' failed. The error was: error while evaluating conditional (systemd_service_enable.status.Result == \"success\"): 'dict object' has no attribute 'status'"}

https://da888ef809002d84bac8-a2fbaff4aeaa6e4af27d13483a4b50b0.ssl.cf1.rackcdn.com/726125/1/check/tripleo-ci-centos-7-containerized-undercloud-upgrades/1c50a18/logs/undercloud/home/zuul/undercloud_upgrade.log

https://66f2724215a8c9ca9bcc-67113804333a37bb92970c493e5932c8.ssl.cf5.rackcdn.com/726209/1/check/tripleo-ci-centos-7-containerized-undercloud-upgrades/7018864/logs/undercloud/home/zuul/undercloud_upgrade.log

=========== ERROR HERE ================================================
https://da888ef809002d84bac8-a2fbaff4aeaa6e4af27d13483a4b50b0.ssl.cf1.rackcdn.com/726125/1/check/tripleo-ci-centos-7-containerized-undercloud-upgrades/1c50a18/logs/undercloud/var/log/extra/journal.txt

May 07 19:41:19 undercloud.localdomain systemd[1]: Starting haproxy container...
May 07 19:41:19 undercloud.localdomain podman[200998]: Error: unable to find container haproxy: no container with name or ID haproxy found: no such container
May 07 19:41:19 undercloud.localdomain systemd[1]: tripleo_haproxy.service: control process exited, code=exited status=125
May 07 19:41:19 undercloud.localdomain systemd[1]: Failed to start haproxy container.
May 07 19:41:19 undercloud.localdomain systemd[1]: Unit tripleo_haproxy.service entered failed state.
May 07 19:41:19 undercloud.localdomain systemd[1]: tripleo_haproxy.service failed.

Tags:

Revision history for this message

wes hayutin (weshayutin) wrote on 2020-05-07:

192.168.24.1:8787/tripleostein/centos-binary-haproxy 687919e2bdd9e558da8af67c434a9c0e068aa4e5_9aca342f 540d9c702b25 10 days ago 692 MB

listed in https://da888ef809002d84bac8-a2fbaff4aeaa6e4af27d13483a4b50b0.ssl.cf1.rackcdn.com/726125/1/check/tripleo-ci-centos-7-containerized-undercloud-upgrades/1c50a18/logs/undercloud/var/log/extra/podman/podman_allinfo.log

Emilien Macchi (emilienm) on 2020-05-07

Changed in tripleo:
assignee:	nobody → Emilien Macchi (emilienm)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-08: Fix proposed to tripleo-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/726277

Changed in tripleo:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-08: Fix proposed to tripleo-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/726278

Revision history for this message

Sofer Athlan-Guyot (sofer-athlan-guyot) wrote on 2020-05-08:

Hi,

considering this and https://bugs.launchpad.net/tripleo/+bug/1876893, i think it would make sense to add tripleo-ci-centos-7-containerized-undercloud-upgrades to the list of job of tripleo-ansible. WDYT ?

Revision history for this message

Emilien Macchi (emilienm) wrote on 2020-05-08:

So I found a few issues by investigating that patch:

* systemd fails to start haproxy container because it can't find it:

May 08 06:03:45 undercloud.localdomain podman[222695]: Error: unable to find container haproxy: no container with name or ID haproxy found: no such container
May 08 06:03:45 undercloud.localdomain systemd[1]: tripleo_haproxy.service: control process exited, code=exited status=125
May 08 06:03:45 undercloud.localdomain systemd[1]: Failed to start haproxy container.

* If Ansible is too slow at executing the systemd task, it'll fail with:

dict object' has no attribute 'status'

* A bunch of transient systems services failures, maybe run systemctl reset-failed.

Revision history for this message

Emilien Macchi (emilienm) wrote on 2020-05-08:

So... spending the day investigating it...
The issue we hit is in podman 1.5.1 which is now what is shipped in train CI.

The issue was already encountered here: https://bugs.launchpad.net/tripleo/+bug/1856324

Now looking why we get this version of Podman...

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-09: Fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/726277
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=0c0f904ef56d23535fa4137d2ae3005a87ecb614
Submitter: Zuul
Branch: master

commit 0c0f904ef56d23535fa4137d2ae3005a87ecb614
Author: Emilien Macchi <email address hidden>
Date: Thu May 7 23:19:18 2020 -0400

podman/systemd: relax the "until" condition

    On slow systems, it's possible that systemd takes more time than usual
    to execute a task from Ansible (e.g. service restart); so Ansible
    doesn't have yet the registered facts from systemd.

To make sure that Ansible doesn't fail with:
dict object' has no attribute 'status'

We first check if status is defined.

Change-Id: Ie73cecc115c87fe452a90892755a1df5b3d894a7
Closes-Bug: #1877449

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-09: Fix merged to tripleo-ansible (stable/train)

Reviewed: https://review.opendev.org/726278
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=c8dc1a7829ff9ed7dfd101973bb2a0f3d33e2aee
Submitter: Zuul
Branch: stable/train

commit c8dc1a7829ff9ed7dfd101973bb2a0f3d33e2aee
Author: Emilien Macchi <email address hidden>
Date: Thu May 7 23:19:18 2020 -0400

podman/systemd: relax the "until" condition

    On slow systems, it's possible that systemd takes more time than usual
    to execute a task from Ansible (e.g. service restart); so Ansible
    doesn't have yet the registered facts from systemd.

To make sure that Ansible doesn't fail with:
dict object' has no attribute 'status'

We first check if status is defined.

    Change-Id: Ie73cecc115c87fe452a90892755a1df5b3d894a7
    Closes-Bug: #1877449
    (cherry picked from commit 0c0f904ef56d23535fa4137d2ae3005a87ecb614)

tags:

added: in-stable-train

Revision history for this message

Sofer Athlan-Guyot (sofer-athlan-guyot) wrote on 2020-05-11:

Download full text (4.7 KiB)

The problem has moved but still persists:

2020-05-11 09:36:37 | TASK [tripleo-container-manage : Start or restart systemd services] ************
2020-05-11 09:36:37 | Monday 11 May 2020 09:36:37 +0000 (0:00:00.228) 0:23:42.966 ************
2020-05-11 09:36:38 | FAILED - RETRYING: Start or restart systemd services (5 retries left).
2020-05-11 09:36:43 | FAILED - RETRYING: Start or restart systemd services (4 retries left).
2020-05-11 09:36:48 | FAILED - RETRYING: Start or restart systemd services (3 retries left).
2020-05-11 09:36:54 | FAILED - RETRYING: Start or restart systemd services (2 retries left).
2020-05-11 09:36:59 | FAILED - RETRYING: Start or restart systemd services (1 retries left).
2020-05-11 09:37:04 | failed: [undercloud] (item=haproxy) => {"ansible_loop_var": "container_sysd_name", "attempts": 5, "changed": false, "container_sysd_name": "haproxy", "msg": "Unable to start service tripleo_haproxy.service: Job for tripleo_haproxy.service failed because start of the service was attempted too often. See \"systemctl status tripleo_haproxy.service\" and \"journalctl -xe\" for details.\nTo force a start use \"systemctl reset-failed tripleo_haproxy.service\" followed by \"systemctl start tripleo_haproxy.service\" again.\n"}
2020-05-11 09:37:05 | FAILED - RETRYING: Start or restart systemd services (5 retries left).
2020-05-11 09:37:10 | FAILED - RETRYING: Start or restart systemd services (4 retries left).
2020-05-11 09:37:16 | FAILED - RETRYING: Start or restart systemd services (3 retries left).
2020-05-11 09:37:21 | FAILED - RETRYING: Start or restart systemd services (2 retries left).
2020-05-11 09:37:27 | FAILED - RETRYING: Start or restart systemd services (1 retries left).
2020-05-11 09:37:32 | failed: [undercloud] (item=keepalived) => {"ansible_loop_var": "container_sysd_name", "attempts": 5, "changed": false, "container_sysd_name": "keepalived", "msg": "Unable to start service tripleo_keepalived.service: Job for tripleo_keepalived.service failed because start of the service was attempted too often. See \"systemctl status tripleo_keepalived.service\" and \"journalctl -xe\" for details.\nTo force a start use \"systemctl reset-failed tripleo_keepalived.service\" followed by \"systemctl start tripleo_keepalived.service\" again.\n"}
2020-05-11 09:37:33 | FAILED - RETRYING: Start or restart systemd services (5 retries left).
2020-05-11 09:37:38 | FAILED - RETRYING: Start or restart systemd services (4 retries left).
2020-05-11 09:37:44 | FAILED - RETRYING: Start or restart systemd services (3 retries left).
2020-05-11 09:37:49 | FAILED - RETRYING: Start or restart systemd services (2 retries left).
2020-05-11 09:37:55 | FAILED - RETRYING: Start or restart systemd services (1 retries left).
2020-05-11 09:38:00 | failed: [undercloud] (item=rabbitmq) => {"ansible_loop_var": "container_sysd_name", "attempts": 5, "changed": false, "container_sysd_name": "rabbitmq", "msg": "Unable to start service tripleo_rabbitmq.service: Job for tripleo_rabbitmq.service failed because start of the service was attempted too often. See \"systemctl status tripleo_rabbitmq.service\" and \"journalctl -xe\" for details.\nTo force a start use \"system...

The problem has moved but still persists:

2020-05-11 09:36:37 | TASK [tripleo-container-manage : Start or restart systemd services] ************
2020-05-11 09:36:37 | Monday 11 May 2020  09:36:37 +0000 (0:00:00.228)       0:23:42.966 ************
2020-05-11 09:36:38 | FAILED - RETRYING: Start or restart systemd services (5 retries left).
2020-05-11 09:36:43 | FAILED - RETRYING: Start or restart systemd services (4 retries left).
2020-05-11 09:36:48 | FAILED - RETRYING: Start or restart systemd services (3 retries left).
2020-05-11 09:36:54 | FAILED - RETRYING: Start or restart systemd services (2 retries left).
2020-05-11 09:36:59 | FAILED - RETRYING: Start or restart systemd services (1 retries left).
2020-05-11 09:37:04 | failed: [undercloud] (item=haproxy) => {"ansible_loop_var": "container_sysd_name", "attempts": 5, "changed": false, "container_sysd_name": "haproxy", "msg": "Unable to start service tripleo_haproxy.service: Job for tripleo_haproxy.service failed because start of the service was attempted too often. See \"systemctl status tripleo_haproxy.service\" and \"journalctl -xe\" for details.\nTo force a start use \"systemctl reset-failed tripleo_haproxy.service\" followed by \"systemctl start tripleo_haproxy.service\" again.\n"}
2020-05-11 09:37:05 | FAILED - RETRYING: Start or restart systemd services (5 retries left).
2020-05-11 09:37:10 | FAILED - RETRYING: Start or restart systemd services (4 retries left).
2020-05-11 09:37:16 | FAILED - RETRYING: Start or restart systemd services (3 retries left).
2020-05-11 09:37:21 | FAILED - RETRYING: Start or restart systemd services (2 retries left).
2020-05-11 09:37:27 | FAILED - RETRYING: Start or restart systemd services (1 retries left).
2020-05-11 09:37:32 | failed: [undercloud] (item=keepalived) => {"ansible_loop_var": "container_sysd_name", "attempts": 5, "changed": false, "container_sysd_name": "keepalived", "msg": "Unable to start service tripleo_keepalived.service: Job for tripleo_keepalived.service failed because start of the service was attempted too often. See \"systemctl status tripleo_keepalived.service\" and \"journalctl -xe\" for details.\nTo force a start use \"systemctl reset-failed tripleo_keepalived.service\" followed by \"systemctl start tripleo_keepalived.service\" again.\n"}
2020-05-11 09:37:33 | FAILED - RETRYING: Start or restart systemd services (5 retries left).
2020-05-11 09:37:38 | FAILED - RETRYING: Start or restart systemd services (4 retries left).
2020-05-11 09:37:44 | FAILED - RETRYING: Start or restart systemd services (3 retries left).
2020-05-11 09:37:49 | FAILED - RETRYING: Start or restart systemd services (2 retries left).
2020-05-11 09:37:55 | FAILED - RETRYING: Start or restart systemd services (1 retries left).
2020-05-11 09:38:00 | failed: [undercloud] (item=rabbitmq) => {"ansible_loop_var": "container_sysd_name", "attempts": 5, "changed": false, "container_sysd_name": "rabbitmq", "msg": "Unable to start service tripleo_rabbitmq.service: Job for tripleo_rabbitmq.service failed because start of the service was attempted too often. See \"systemctl status tripleo_rabbitmq.service\" and \"journalctl -xe\" for details.\nTo force a start use \"systemctl reset-failed tripleo_rabbitmq.service\" followed by \"systemctl start tripleo_rabbitmq.service\" again.\n"}
2020-05-11 09:38:01 | FAILED - RETRYING: Start or restart systemd services (5 retries left).
2020-05-11 09:38:06 | FAILED - RETRYING: Start or restart systemd services (4 retries left).
2020-05-11 09:38:11 | FAILED - RETRYING: Start or restart systemd services (3 retries left).
2020-05-11 09:38:17 | FAILED - RETRYING: Start or restart systemd services (2 retries left).
2020-05-11 09:38:22 | FAILED - RETRYING: Start or restart systemd services (1 retries left).
2020-05-11 09:38:27 | failed: [undercloud] (item=memcached) => {"ansible_loop_var": "container_sysd_name", "attempts": 5, "changed": false, "container_sysd_name": "memcached", "msg": "Unable to start service tripleo_memcached.service: Job for tripleo_memcached.service failed because start of the service was attempted too often. See \"systemctl status tripleo_memcached.service\" and \"journalctl -xe\" for details.\nTo force a start use \"systemctl reset-failed tripleo_memcached.service\" followed by \"systemctl start tripleo_memcached.service\" again.\n"}
2020-05-11 09:38:27 |

Seen in there https://7f2cf3e0bcebae65ba67-e9283d1668aa96fc6a81b85ada542fce.ssl.cf2.rackcdn.com/721700/3/check/tripleo-ci-centos-7-containerized-undercloud-upgrades/f4922e0/logs/undercloud/home/zuul/undercloud_upgrade.log

from review 721700 (still running hence the link)

That job is non voting on tht, but still voting in tripleo-ansible, monitoring https://review.opendev.org/724782 but I'm afraid it's going to fail there as well.

Revision history for this message

Sofer Athlan-Guyot (sofer-athlan-guyot) wrote on 2020-05-11:

#10

So it failed https://review.opendev.org/#/c/724782/ with the error mentioned above[1], can we make tripleo-ci-centos-7-containerized-undercloud-upgrades nv in tripleo-upgrade stein too ?

[1] https://zuul.opendev.org/t/openstack/build/540dd82f1d06487898478335a87cb988/log/logs/undercloud/home/zuul/undercloud_upgrade.log#5025

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-11: Fix proposed to tripleo-upgrade (stable/train)

#11

Fix proposed to branch: stable/train
Review: https://review.opendev.org/726849

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-12: Fix merged to tripleo-upgrade (stable/train)

#12

Reviewed: https://review.opendev.org/726849
Committed: https://git.openstack.org/cgit/openstack/tripleo-upgrade/commit/?id=99fde33707fcd0e53b415eb8a47da2300f56f065
Submitter: Zuul
Branch: stable/train

commit 99fde33707fcd0e53b415eb8a47da2300f56f065
Author: Sofer Athlan-Guyot <email address hidden>
Date: Mon May 11 15:38:20 2020 +0200

[train only] containerized-undercloud-upgrades -> NV.

To be reverted when the related bug is closed.

Change-Id: I8d6fab1f79223073403f4b265bdb25cda5a09f97
Partial-Bug: #1877449

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-25: Related fix proposed to tripleo-upgrade (stable/train)

#13

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/730629

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-02-08: Fix included in openstack/tripleo-ansible 0.6.0

#14

This issue was fixed in the openstack/tripleo-ansible 0.6.0 release.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.