MaxFailPercentage: undercloud can be included for an overcloud deploy failure

Bug #1889212 reported by Emilien Macchi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Triaged
Medium
Emilien Macchi

Bug Description

When setting MaxFailPercentage to a certain percentage, we tolerate a certain amount of overcloud nodes to fail during the deployment.

However, some playbooks are executed from the Undercloud and therefore if an overcloud node is down, the playbook will report the error from the Undercloud node.

Example:

FATAL | Discovering nova hosts | undercloud -> 192.168.24.18 | error={"changed": false, "cmd": ["podman", "exec", "nova_compute", "nova-manage", "cell_v2", "discover_hosts", "--by-service"], "delta": "0:00:00.223708", "end": "2020-07-27 22:22:26.422824", "msg": "non-zero return code", "rc": 125, "start": "2020-07-27 22:22:26.199116", "stderr": "Error: no container with name or ID nova_compute found: no such container", "stderr_lines": ["Error: no container with name or ID nova_compute found: no such container"], "stdout": "", "stdout_lines": []}

192.168.24.18 is the compute "down".

It results into this confusing summary:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ State Information ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~ Number of nodes which did not deploy successfully: 2 ~~~~~~~~~~~~~~~~~
 This or these node(s) failed to deploy: overcloud-novacompute-0, undercloud
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When deploying the overcloud, we should not consider the source deploy host for playbooks that fail; and make sure they don't appear in the state information.

Changed in tripleo:
milestone: none → victoria-1
milestone: victoria-1 → victoria-2
importance: Undecided → Medium
status: New → Triaged
Changed in tripleo:
milestone: victoria-2 → victoria-3
Changed in tripleo:
assignee: nobody → Emilien Macchi (emilienm)
tags: added: train-backport-potential ussuri-backport-potential
Revision history for this message
Emilien Macchi (emilienm) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ansible (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/743549

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to python-tripleoclient (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/743556

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/743549
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=f1969830e095401040be66bf245d91d20a08b221
Submitter: Zuul
Branch: master

commit f1969830e095401040be66bf245d91d20a08b221
Author: Emilien Macchi <email address hidden>
Date: Tue Jul 28 10:04:07 2020 -0400

    tripleo_states: change wording

    Change the wording to replace "This or these node(s) failed to deploy"
    by "The following node(s) had failures:"; failures can happen at a
    different level (not necessarily deploy). Update the wording to avoid
    any confusion.

    Change-Id: I80041738df05dbe0da678efa91e861390ad4657e
    Related-Bug: #1889212

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ansible (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/743775

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ansible (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/743776

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-ansible (stable/ussuri)

Reviewed: https://review.opendev.org/743775
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=584add0ca2a3ee3ac2e9de807de0766ff9a72381
Submitter: Zuul
Branch: stable/ussuri

commit 584add0ca2a3ee3ac2e9de807de0766ff9a72381
Author: Emilien Macchi <email address hidden>
Date: Tue Jul 28 10:04:07 2020 -0400

    tripleo_states: change wording

    Change the wording to replace "This or these node(s) failed to deploy"
    by "The following node(s) had failures:"; failures can happen at a
    different level (not necessarily deploy). Update the wording to avoid
    any confusion.

    Change-Id: I80041738df05dbe0da678efa91e861390ad4657e
    Related-Bug: #1889212
    (cherry picked from commit f1969830e095401040be66bf245d91d20a08b221)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-ansible (stable/train)

Reviewed: https://review.opendev.org/743776
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=8d8de17fedac73ee6804e5f2e9a2e22ca30aaf78
Submitter: Zuul
Branch: stable/train

commit 8d8de17fedac73ee6804e5f2e9a2e22ca30aaf78
Author: Emilien Macchi <email address hidden>
Date: Tue Jul 28 10:04:07 2020 -0400

    tripleo_states: change wording

    Change the wording to replace "This or these node(s) failed to deploy"
    by "The following node(s) had failures:"; failures can happen at a
    different level (not necessarily deploy). Update the wording to avoid
    any confusion.

    Change-Id: I80041738df05dbe0da678efa91e861390ad4657e
    Related-Bug: #1889212
    (cherry picked from commit f1969830e095401040be66bf245d91d20a08b221)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to python-tripleoclient (master)

Reviewed: https://review.opendev.org/743556
Committed: https://git.openstack.org/cgit/openstack/python-tripleoclient/commit/?id=9dec1b2e334e110b03f951c9fd3480f6c3dc8e11
Submitter: Zuul
Branch: master

commit 9dec1b2e334e110b03f951c9fd3480f6c3dc8e11
Author: Emilien Macchi <email address hidden>
Date: Tue Jul 28 10:35:14 2020 -0400

    overcloud_deploy: move horizon url/rc files before config-download

    When a deployment fails, we run the playbooks to generate horizon URL &
    RC files anyway. However it is confusing to have them at the end, after
    the actual trace and an operator with a small screen won't see the
    actual errors easily.

    Let's just move these actions before the config download execution,
    which has no impact anyway; but will improve logging a lot.

    Change-Id: I70bbc40f8e5eb709d9f0f608e936a818e082918b
    Related-Bug: #1889212

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to python-tripleoclient (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/747075

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to python-tripleoclient (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/747076

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to python-tripleoclient (stable/ussuri)

Reviewed: https://review.opendev.org/747075
Committed: https://git.openstack.org/cgit/openstack/python-tripleoclient/commit/?id=1ffc56ed97ec8878a255c96baa16652ef825cf9d
Submitter: Zuul
Branch: stable/ussuri

commit 1ffc56ed97ec8878a255c96baa16652ef825cf9d
Author: Emilien Macchi <email address hidden>
Date: Tue Jul 28 10:35:14 2020 -0400

    overcloud_deploy: move horizon url/rc files before config-download

    When a deployment fails, we run the playbooks to generate horizon URL &
    RC files anyway. However it is confusing to have them at the end, after
    the actual trace and an operator with a small screen won't see the
    actual errors easily.

    Let's just move these actions before the config download execution,
    which has no impact anyway; but will improve logging a lot.

    Change-Id: I70bbc40f8e5eb709d9f0f608e936a818e082918b
    Related-Bug: #1889212
    (cherry picked from commit 9dec1b2e334e110b03f951c9fd3480f6c3dc8e11)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to python-tripleoclient (stable/train)

Reviewed: https://review.opendev.org/747076
Committed: https://git.openstack.org/cgit/openstack/python-tripleoclient/commit/?id=fa0129963b463d7d3da3e7950c92f25d449cf0f7
Submitter: Zuul
Branch: stable/train

commit fa0129963b463d7d3da3e7950c92f25d449cf0f7
Author: Emilien Macchi <email address hidden>
Date: Tue Jul 28 10:35:14 2020 -0400

    overcloud_deploy: move horizon url/rc files before config-download

    Note: this is an unclean backport.

    When a deployment fails, we run the playbooks to generate horizon URL &
    RC files anyway. However it is confusing to have them at the end, after
    the actual trace and an operator with a small screen won't see the
    actual errors easily.

    Let's just move these actions before the config download execution,
    which has no impact anyway; but will improve logging a lot.

    Change-Id: I70bbc40f8e5eb709d9f0f608e936a818e082918b
    Related-Bug: #1889212
    (cherry picked from commit 9dec1b2e334e110b03f951c9fd3480f6c3dc8e11)

Changed in tripleo:
milestone: victoria-3 → wallaby-1
Changed in tripleo:
milestone: wallaby-1 → wallaby-2
Changed in tripleo:
milestone: wallaby-2 → wallaby-3
Changed in tripleo:
milestone: wallaby-3 → wallaby-rc1
Changed in tripleo:
milestone: wallaby-rc1 → xena-1
Changed in tripleo:
milestone: xena-1 → xena-2
Changed in tripleo:
milestone: xena-2 → xena-3
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.