Deployment doesn't stop and fail when startup containers finish with RC1

Bug #1878074 reported by Emilien Macchi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Emilien Macchi

Bug Description

We only check the Puppet containers if they didn't return a wrong return code.
However we have a lot of containers which just run a command (e.g. db sync) and they exit.
Right now we don't have any mechanism to catch error when they terminate.

We should wait for them to finish and make sure they return an expected code; or fail otherwise like we already do for the puppet containers.

Changed in tripleo:
milestone: none → ussuri-rc3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/726927

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/726933

Changed in tripleo:
status: Triaged → In Progress
Changed in tripleo:
assignee: Emilien Macchi (emilienm) → Alex Schultz (alex-schultz)
Changed in tripleo:
assignee: Alex Schultz (alex-schultz) → Emilien Macchi (emilienm)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/726933
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=504033868672580b3af2b9d1e5d771a8233add9d
Submitter: Zuul
Branch: master

commit 504033868672580b3af2b9d1e5d771a8233add9d
Author: Emilien Macchi <email address hidden>
Date: Mon May 11 14:54:15 2020 -0400

    container_manage: catch more containers with wrong return code

    - helpers/haskey: add excluded_keys argument. It allows to return the
      config that has an attribute but also where some attributed are
      excluded. The use case here is that we have some container configs
      which have both "command" and "action". We want to use that filter to
      build a list of containers where the return code has to be checked;
      which is the not the case for the containers with "action" in their
      configs; since they are used for "podman exec" configs (and there is
      nothing to check in return from podman inspect).

    - check_exit_code: change the list of containers to check the exit code
      to include all the containers with a "command" but not "action".
      It should cover all the containers which are used to run some
      non-services things like db_sync etc.

    - molecule: change the fedora_bis and fedora_three containers to run
      short sleep so we can actually test that change against these
      containers and also on the first deployment of fedora_bis and
      fedora_three, we'll check their return code.

    Change-Id: I466a57bd788e02c32b1efb0ac0223684f0d39393
    Closes-Bug: #1878074

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/728552

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (stable/train)

Reviewed: https://review.opendev.org/728552
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=8150aba8a14157223deb460c848c1a86f7c76e61
Submitter: Zuul
Branch: stable/train

commit 8150aba8a14157223deb460c848c1a86f7c76e61
Author: Emilien Macchi <email address hidden>
Date: Mon May 11 14:54:15 2020 -0400

    container_manage: catch more containers with wrong return code

    - helpers/haskey: add excluded_keys argument. It allows to return the
      config that has an attribute but also where some attributed are
      excluded. The use case here is that we have some container configs
      which have both "command" and "action". We want to use that filter to
      build a list of containers where the return code has to be checked;
      which is the not the case for the containers with "action" in their
      configs; since they are used for "podman exec" configs (and there is
      nothing to check in return from podman inspect).

    - check_exit_code: change the list of containers to check the exit code
      to include all the containers with a "command" but not "action".
      It should cover all the containers which are used to run some
      non-services things like db_sync etc.

    - molecule: change the fedora_bis and fedora_three containers to run
      short sleep so we can actually test that change against these
      containers and also on the first deployment of fedora_bis and
      fedora_three, we'll check their return code.

    Change-Id: I466a57bd788e02c32b1efb0ac0223684f0d39393
    Closes-Bug: #1878074
    (cherry picked from commit 504033868672580b3af2b9d1e5d771a8233add9d)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/729234

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/726927
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=de4dc46ea801e6f868d07cd513977cd9992cbe81
Submitter: Zuul
Branch: master

commit de4dc46ea801e6f868d07cd513977cd9992cbe81
Author: Emilien Macchi <email address hidden>
Date: Mon May 11 14:51:26 2020 -0400

    Configure valid_exit_code for startup containers

    For the containers managed during the deploy tasks; they can also fail
    after being run (e.g. nova db sync); and right now we don't catch it,
    and keep the deployment moving on.

    We catch the errors with puppet containers but not for startup
    containers.
    This patch is a first step in that direction where we only accept
    containers that have exited to return 0 as a valid return code.

    A patch in tripleo-ansible will be made to check these containers which
    must have "command" in their configs.

    Change-Id: I43e42df53b10fc99ca8e0fd8d7a30768e895e91f
    Related-Bug: #1878074

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/731054

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/train)

Reviewed: https://review.opendev.org/729234
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=b4dffb9428465dd659ab8074e537af2230a65570
Submitter: Zuul
Branch: stable/train

commit b4dffb9428465dd659ab8074e537af2230a65570
Author: Emilien Macchi <email address hidden>
Date: Mon May 11 14:51:26 2020 -0400

    Configure valid_exit_code for startup containers

    For the containers managed during the deploy tasks; they can also fail
    after being run (e.g. nova db sync); and right now we don't catch it,
    and keep the deployment moving on.

    We catch the errors with puppet containers but not for startup
    containers.
    This patch is a first step in that direction where we only accept
    containers that have exited to return 0 as a valid return code.

    A patch in tripleo-ansible will be made to check these containers which
    must have "command" in their configs.

    Change-Id: I43e42df53b10fc99ca8e0fd8d7a30768e895e91f
    Related-Bug: #1878074
    (cherry picked from commit de4dc46ea801e6f868d07cd513977cd9992cbe81)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/ussuri)

Reviewed: https://review.opendev.org/731054
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=4ab32733a16e4661181abd59c7cad9f1a69288ca
Submitter: Zuul
Branch: stable/ussuri

commit 4ab32733a16e4661181abd59c7cad9f1a69288ca
Author: Emilien Macchi <email address hidden>
Date: Mon May 11 14:51:26 2020 -0400

    Configure valid_exit_code for startup containers

    For the containers managed during the deploy tasks; they can also fail
    after being run (e.g. nova db sync); and right now we don't catch it,
    and keep the deployment moving on.

    We catch the errors with puppet containers but not for startup
    containers.
    This patch is a first step in that direction where we only accept
    containers that have exited to return 0 as a valid return code.

    A patch in tripleo-ansible will be made to check these containers which
    must have "command" in their configs.

    Change-Id: I43e42df53b10fc99ca8e0fd8d7a30768e895e91f
    Related-Bug: #1878074
    (cherry picked from commit de4dc46ea801e6f868d07cd513977cd9992cbe81)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-ansible 0.6.0

This issue was fixed in the openstack/tripleo-ansible 0.6.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.