Introspection workflow can get stuck in a loop of "Unhandled workflow error"

Bug #1779097 reported by Dougal Matthews
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
wes hayutin

Bug Description

Seen here: https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload/db23aa3/undercloud/home/jenkins/overcloud_prep_images.log.txt.gz#_2018-06-28_01_57_12

Extract from the output:

2018-06-28 00:33:19 | + openstack overcloud node introspect --all-manageable
2018-06-28 00:33:22 | Waiting for messages on queue 'tripleo' with no timeout.
2018-06-28 01:57:12 | Waiting for introspection to finish...
2018-06-28 01:57:12 | Introspection of node 26e85889-10a9-4ad3-8392-0ee341b2f657 timed out.
2018-06-28 01:57:12 | Introspection of node 6c44bb91-00cf-4363-8787-b29138794395 timed out.
2018-06-28 01:57:12 | Retrying 2 nodes that failed introspection. Attempt 1 of 3
2018-06-28 01:57:12 | Introspection of node 26e85889-10a9-4ad3-8392-0ee341b2f657 timed out.
2018-06-28 01:57:12 | Introspection of node 6c44bb91-00cf-4363-8787-b29138794395 timed out.
2018-06-28 01:57:12 | Retrying 2 nodes that failed introspection. Attempt 2 of 3
2018-06-28 01:57:12 | Introspection of node 6c44bb91-00cf-4363-8787-b29138794395 timed out.
2018-06-28 01:57:12 | Introspection of node 26e85889-10a9-4ad3-8392-0ee341b2f657 timed out.
2018-06-28 01:57:12 | Retrying 2 nodes that failed introspection. Attempt 3 of 3
2018-06-28 01:57:12 | Introspection of node 6c44bb91-00cf-4363-8787-b29138794395 timed out.
2018-06-28 01:57:12 | Introspection of node 26e85889-10a9-4ad3-8392-0ee341b2f657 timed out.
2018-06-28 01:57:12 | Retry limit reached with 2 nodes still failing introspection
2018-06-28 01:57:12 | Unhandled workflow error
2018-06-28 01:57:12 | Unhandled workflow error
2018-06-28 01:57:12 | Unhandled workflow error
2018-06-28 01:57:12 | Unhandled workflow error
2018-06-28 01:57:12 | Unhandled workflow error
2018-06-28 01:57:12 | Unhandled workflow error
2018-06-28 01:57:12 | Unhandled workflow error
2018-06-28 01:57:12 | Unhandled workflow error
2018-06-28 01:57:12 | Unhandled workflow error
2018-06-28 01:57:12 | Unhandled workflow error
2018-06-28 01:57:12 | Unhandled workflow error
2018-06-28 01:57:12 | Unhandled workflow error
2018-06-28 01:57:12 | Unhandled workflow error
2018-06-28 01:57:12 | Unhandled workflow error

.... That then continues until the job is killed.

Tags: workflows
Dougal Matthews (d0ugal)
summary: Introspection workflow can get stuck in a loop of "Unhandled workflow
- errors"
+ error"
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.openstack.org/578763

Changed in tripleo:
status: Confirmed → In Progress
Changed in tripleo:
assignee: Dougal Matthews (d0ugal) → wes hayutin (weshayutin)
Revision history for this message
Dougal Matthews (d0ugal) wrote :

This is critical as it can basically get Mistral stuck in an infinite loop

Changed in tripleo:
importance: Critical → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/578763
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=bb35200060c7c4aafdae3d95f00bb0c94f6259f5
Submitter: Zuul
Branch: master

commit bb35200060c7c4aafdae3d95f00bb0c94f6259f5
Author: Dougal Matthews <email address hidden>
Date: Thu Jun 28 11:23:00 2018 +0100

    Remove the unhandled_error task from baremetal.yaml

    With the recent changing in workflow messaging, it is now possible that
    we will get stuck in a loop of unhandled errors. Removing the unhandled_error
    task will give us the sanest logs on a failure scenario.

    This happens because:
    1. The workflow sends an error message, which causes the messaging
       workflow to error. It ends in an error status so that the parent
       workflow doesn't need to fail itself when sending errors.
    2. The unhandled_error task is triggered because the message sending
       workflow fails. GOTO 1.

    The task-defult error handler was added to these workflows due to the
    complexity. However, searching launchpad doesn't find any cases of it being
    reported and the generic message doesn't seem particularly uesful. Therefore,
    the safest option is to remove it.

    Further investigation here might be useful, as a generic error handler
    that is reliable could be useful. Work in Mistral to send events to a
    Zaqar queue is underway and this will likely be a better solution
    (clients can look for workflow failures in the events).

    Closes-Bug: 1779097
    Depends-On: I5f574923bb3f38b8f71a002e643bf89b52069adc
    Change-Id: I388c4b93de473778cc11f580a80426539aeed7e2

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 9.2.0

This issue was fixed in the openstack/tripleo-common 9.2.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.