Introspection failures are ignored

Bug #1733303 reported by Dmitry Tantsur
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Dougal Matthews

Bug Description

Introspection is a part of the default flow and must be tested by the CI. At least two feature sets are supposed to run it https://docs.openstack.org/tripleo-quickstart/latest/feature-configuration.html but they don't, and there are no traces of introspection in https://github.com/openstack-infra/tripleo-ci/blob/master/toci-quickstart/playbooks/ovb.yml.

UPD: introspection is carefully hidden in the unrelated overcloud-prep-images playbook. the bug is even worse though: the jobs report success even for failed introspections: http://logs.openstack.org/00/521100/1/check-tripleo/legacy-tripleo-ci-centos-7-ovb-ha-oooq/c14746b/logs/undercloud/home/zuul/overcloud_prep_images.log.txt.gz#_2017-11-18_12_02_32

Dmitry Tantsur (divius)
summary: - Introspection is no longer run in the CI
+ Introspection failures are ignored
description: updated
Revision history for this message
Attila Darazs (adarazs) wrote :

I see that the script that runs the introspection is run with "set -eux":

https://github.com/openstack/tripleo-quickstart-extras/blob/master/roles/overcloud-prep-images/templates/overcloud-prep-images.sh.j2#L3

Is it the expected behavior for "openstack overcloud node introspect --all-manageable" to exit with zero when the introspection fails?

https://github.com/openstack/tripleo-quickstart-extras/blob/master/roles/overcloud-prep-images/templates/overcloud-prep-images.sh.j2#L142

Changed in tripleo:
status: Confirmed → Triaged
Revision history for this message
Dmitry Tantsur (divius) wrote :

No, it's not expected. Nor is it expected for 'overcloud node provide' to succeed on failure.

Revision history for this message
Dmitry Tantsur (divius) wrote :

Relevant extract:

 2017-11-18 12:02:32 | Retry limit reached with 4 nodes still failing introspection
 2017-11-18 12:02:32 | Nodes introspected successfully.
 2017-11-18 12:02:32 | Introspection completed.

Dougal Matthews (d0ugal)
Changed in tripleo:
assignee: nobody → Dougal Matthews (d0ugal)
Revision history for this message
Dougal Matthews (d0ugal) wrote :

tl;dr - s/status: ERROR/status: FAILED/

The issue here is that the introspect workflow finishes in SUCCESS state despite the errors. This is when it happens in the logs.

http://logs.openstack.org/00/521100/1/check-tripleo/legacy-tripleo-ci-centos-7-ovb-ha-oooq/c14746b/logs/undercloud/var/log/mistral/engine.log.txt.gz#_2017-11-18_12_02_33_225

The problem is here: https://github.com/openstack/tripleo-common/blob/master/workbooks/baremetal.yaml#L604

It should be "status: FAILED". As that is wrong, this condition doesn't get triggered: https://github.com/openstack/tripleo-common/blob/master/workbooks/baremetal.yaml#L635

Because that trigger isn't hit, the parent workflow introspect_manageable_nodes thinks it completes successfully. This is why we get conflicting error messages.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.openstack.org/521496

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/521496
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=7a18b486a18cf179a29be113ed90d5129c59c00e
Submitter: Zuul
Branch: master

commit 7a18b486a18cf179a29be113ed90d5129c59c00e
Author: Dougal Matthews <email address hidden>
Date: Mon Nov 20 11:30:14 2017 +0000

    Correct the failed status in the baremetal workflow

    Two tasks in the introspect workflow have incorrectly been sending
    "ERROR" statuses when they should have sent "FAILED". This then meant
    the workflow appeared to finish without errors (or failures). This is
    primarily a problem for the introspect_manageable_nodes workflow which
    then can't detect errors and reports that everything was successful.

    Change-Id: I34a91dd14bb19775ad62271def6ecb66398c84db
    Closes-Bug: #1733303

Changed in tripleo:
status: In Progress → Fix Released
tags: added: pike-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/521934

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/pike)

Reviewed: https://review.openstack.org/521934
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=2491e49c3b9f37a8756c311f513d31963d3150f6
Submitter: Zuul
Branch: stable/pike

commit 2491e49c3b9f37a8756c311f513d31963d3150f6
Author: Dougal Matthews <email address hidden>
Date: Mon Nov 20 11:30:14 2017 +0000

    Correct the failed status in the baremetal workflow

    Two tasks in the introspect workflow have incorrectly been sending
    "ERROR" statuses when they should have sent "FAILED". This then meant
    the workflow appeared to finish without errors (or failures). This is
    primarily a problem for the introspect_manageable_nodes workflow which
    then can't detect errors and reports that everything was successful.

    Change-Id: I34a91dd14bb19775ad62271def6ecb66398c84db
    Closes-Bug: #1733303
    (cherry picked from commit 7a18b486a18cf179a29be113ed90d5129c59c00e)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 8.2.0

This issue was fixed in the openstack/tripleo-common 8.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 7.6.6

This issue was fixed in the openstack/tripleo-common 7.6.6 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.