Convergence: stacks can get stuck IN_PROGRESS

Bug #1703043 reported by Zane Bitter
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Zane Bitter

Bug Description

When checking a resource, we don't handle exceptions other than ResourceFailure and a few others used for flow control if they are raised by the actual Resource's convergence create/update/delete methods. We also don't handle any exceptions that occur after that, e.g. when collecting data for the SyncPoint. If an exception occurs is prevents SyncPoints being updated and/or new resource checks being triggered, but does not signal that the stack has failed. Therefore it remains stuck IN_PROGRESS, at least until it times out but I suspect permanently.

These exceptions are not even logged, because they occur in an RPC 'cast' call - so nothing is listening at the other end.

Prior to the fix for bug 1492433, we had a more or less constant stream of bugs where the stack would get stuck IN_PROGRESS. Although the scope for such bugs is smaller in a single check_resource call, we run the same risk as we did then.

Zane Bitter (zaneb)
Changed in heat:
status: New → Triaged
importance: Undecided → High
milestone: none → pike-3
Changed in heat:
assignee: nobody → Zane Bitter (zaneb)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/481757
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=33a16aa7a808f3f1a9fc9faf2a8b1017a8bcbbbe
Submitter: Jenkins
Branch: master

commit 33a16aa7a808f3f1a9fc9faf2a8b1017a8bcbbbe
Author: Zane Bitter <email address hidden>
Date: Mon Jul 10 13:48:01 2017 -0400

    Log unhandled exceptions in worker

    RPC calls to the worker use 'cast', so nothing is listening to find out the
    result. If an exception occurs we will never hear about it. This change
    logs such unhandled exceptions as errors.

    Change-Id: I51365a9dee8fd4eff85e77d3e42bf33be814a22c
    Partial-Bug: #1703043

Zane Bitter (zaneb)
tags: added: convergence-bugs
Revision history for this message
Zane Bitter (zaneb) wrote :

Bug 1681416 provides some examples of how to reproduce.

Changed in heat:
assignee: Zane Bitter (zaneb) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/487222

Changed in heat:
assignee: nobody → Zane Bitter (zaneb)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/487652

Rico Lin (rico-lin)
Changed in heat:
milestone: pike-3 → pike-rc1
Rabi Mishra (rabi)
Changed in heat:
milestone: pike-rc1 → pike-rc2
Rico Lin (rico-lin)
Changed in heat:
milestone: pike-rc2 → pike-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/487222
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=ec1801931e13bb31e4fd0efd819b7cb8f4348c35
Submitter: Jenkins
Branch: master

commit ec1801931e13bb31e4fd0efd819b7cb8f4348c35
Author: Zane Bitter <email address hidden>
Date: Tue Jul 25 16:04:23 2017 -0400

    Mark stack failed when exception raised in resource check

    If we get an unexpected exception when checking a resource, try to clean
    up by marking the stack as FAILED. The graph traversal will stop if we
    can't propagate any more RPC messages, so without this the stack would
    be stuck IN_PROGRESS indefinitely.

    Change-Id: I56ecfa7a9a328d1435c1f34ab14e56effb81bb21
    Closes-Bug: #1703043

Changed in heat:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 9.0.0.0rc1

This issue was fixed in the openstack/heat 9.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/493721

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/493723

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/ocata)

Reviewed: https://review.openstack.org/493721
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=ddc4c0da5c90a814846b8f94ed5e5e9ee0d9382a
Submitter: Jenkins
Branch: stable/ocata

commit ddc4c0da5c90a814846b8f94ed5e5e9ee0d9382a
Author: Zane Bitter <email address hidden>
Date: Tue Jul 25 16:04:23 2017 -0400

    Mark stack failed when exception raised in resource check

    If we get an unexpected exception when checking a resource, try to clean
    up by marking the stack as FAILED. The graph traversal will stop if we
    can't propagate any more RPC messages, so without this the stack would
    be stuck IN_PROGRESS indefinitely.

    Change-Id: I56ecfa7a9a328d1435c1f34ab14e56effb81bb21
    Closes-Bug: #1703043
    (cherry picked from commit ec1801931e13bb31e4fd0efd819b7cb8f4348c35)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/487652
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=b4be5f7d8a8ee3a06651cab9f828d541f0f5e220
Submitter: Jenkins
Branch: stable/ocata

commit b4be5f7d8a8ee3a06651cab9f828d541f0f5e220
Author: Zane Bitter <email address hidden>
Date: Mon Jul 10 13:48:01 2017 -0400

    Log unhandled exceptions in worker

    RPC calls to the worker use 'cast', so nothing is listening to find out the
    result. If an exception occurs we will never hear about it. This change
    logs such unhandled exceptions as errors.

    Change-Id: I51365a9dee8fd4eff85e77d3e42bf33be814a22c
    Partial-Bug: #1703043
    (cherry picked from commit 33a16aa7a808f3f1a9fc9faf2a8b1017a8bcbbbe)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 8.0.4

This issue was fixed in the openstack/heat 8.0.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/newton)

Reviewed: https://review.openstack.org/493723
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=8446f07d504f006fccc7664365e91cbfa9ae0731
Submitter: Jenkins
Branch: stable/newton

commit 8446f07d504f006fccc7664365e91cbfa9ae0731
Author: Zane Bitter <email address hidden>
Date: Tue Jul 25 16:04:23 2017 -0400

    Mark stack failed when exception raised in resource check

    If we get an unexpected exception when checking a resource, try to clean
    up by marking the stack as FAILED. The graph traversal will stop if we
    can't propagate any more RPC messages, so without this the stack would
    be stuck IN_PROGRESS indefinitely.

    Change-Id: I56ecfa7a9a328d1435c1f34ab14e56effb81bb21
    Closes-Bug: #1703043
    (cherry picked from commit ec1801931e13bb31e4fd0efd819b7cb8f4348c35)

tags: added: in-stable-newton
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.