install_tempest randomly fails in CI

Bug #1923728 reported by Gregory Thiemonge
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
devstack
Fix Released
High
Unassigned

Bug Description

It looks like the install_tempest task has recently started failing in many jobs:

In an Octavia job for instance:

https://zuul.opendev.org/t/openstack/build/a2537ed31b314349a156fc411648b5d1

Installation fails with:

++ lib/tempest:install_tempest:725 : PROJECT_VENV["tempest"]=/opt/stack/tempest/.tox/tempest
++ lib/tempest:install_tempest:726 : rm -f /tmp/tempest_u_c_m.tqXALtSUxe
++ lib/tempest:install_tempest:727 : popd
~/devstack
[Async install_tempest:22905]: finished install_tempest with result 127 in 43 seconds
Stopping async wait due to error: install_tempest
++ inc/async:async_wait:166 : return 127
+ extras.d/80-tempest.sh:source:1 : exit_trap

Using logstash/kibana, many jobs are affected (select "Last 30d"):

http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22finished%20install_tempest%20with%20result%20127%5C%22

I think this recent commit in devstack introduced this regression: "Default to parallel execution"
https://review.opendev.org/c/openstack/devstack/+/783311

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Uh-oh, seems the testing was not thorough enough. I wonder if Dan would be able to debug this.

Changed in devstack:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

It seems that all failing runs have the following warning/error line (wait: pid XXX is not a child of this shell):

2021-04-14 00:59:12.086309 | controller | ++ extras.d/80-tempest.sh:source:20 : async_wait install_tempest
2021-04-14 00:59:12.146380 | controller | [Async install_tempest:22905]: Waiting for completion of install_tempest (1 other jobs running)
2021-04-14 00:59:12.149490 | controller | /opt/stack/devstack/inc/async: line 134: wait: pid 22905 is not a child of this shell
2021-04-14 00:59:12.154288 | controller | ++ lib/tempest:install_tempest:705 : git_clone https://opendev.org/openstack/tempest.git /opt/stack/tempest master

I haven't seen that line in successful runs

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to devstack (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/devstack/+/786250

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to devstack (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/devstack/+/786330

Changed in devstack:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to devstack (master)

Reviewed: https://review.opendev.org/c/openstack/devstack/+/786330
Committed: https://opendev.org/openstack/devstack/commit/aa5c38727b314b03cd7ab69612435aa206bd5e2c
Submitter: "Zuul (22348)"
Branch: master

commit aa5c38727b314b03cd7ab69612435aa206bd5e2c
Author: Dan Smith <email address hidden>
Date: Wed Apr 14 14:27:32 2021 -0700

    Work around CHILD_MAX bash limitation for async

    Apparently bash (via POSIX) only guarantees a small (32ish) number of
    children can be started and their statuses retrieved at any given
    point. On larger jobs with lots of plugins and additional work, we
    may go over that limit, especially for long-lived children, such
    as the install_tempest task.

    This works around that issue by creating a fifo for each child at
    spawn time. When the child is complete, it will block on a read
    against that fifo (and thus not exit). When the parent goes to wait
    on the child, it first writes to that fifo, unblocking the child so
    that it can exit near the time we go to wait.

    Closes-Bug: #1923728
    Change-Id: Id755bdb1e7f1664ec08742d034c174e87a3d2902

Changed in devstack:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to devstack (master)

Reviewed: https://review.opendev.org/c/openstack/devstack/+/786250
Committed: https://opendev.org/openstack/devstack/commit/51e384554b4653a05abea435432431cdca4728fb
Submitter: "Zuul (22348)"
Branch: master

commit 51e384554b4653a05abea435432431cdca4728fb
Author: Dan Smith <email address hidden>
Date: Wed Apr 14 07:23:10 2021 -0700

    Add some debug to async_wait failures

    This dumps some data in the case where we fail to wait for a child
    pid to help debug what is going on. This also cleans up a few review
    comments from the actual fix.

    Change-Id: I7b58ce0cf2b41bdffa448973edb4c992fe5f730c
    Related-Bug: #1923728

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.