After updating a stack stuck IN_PROGRESS, resources will be permanently stuck IN_PROGRESS

Bug #1570576 reported by Zane Bitter
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Tanvir Talukder

Bug Description

If an engine dies in the middle of an update, often both the stack and one or more resources within it will be left in an UPDATE_IN_PROGRESS state. If the user then attempts to do another update, Heat will be able to steal the lock from the dead engine but the update itself will fail because the resource is in an IN_PROGRESS state. So far, so good. The problem is that also sets the state of the stack to UPDATE_FAILED, but *without* resetting the state of any resources it contains (unlike the reset_stack_status task that runs at startup to reset any zombie stacks).

This means that from this point on, the user can attempt to update the stack all they like but it will never succeed because of resources inside that are stuck IN_PROGRESS. Also, there is no way to resolve the situation: restarting heat-engine won't help because reset_stack_status looks only at *stacks* that are IN_PROGRESS, which they may no longer be.

Revision history for this message
Zane Bitter (zaneb) wrote :

Two notable circumstances where this would come up:

1. An engine dies and isn't restarted
2. bug 1570569

Changed in heat:
status: New → Triaged
importance: Undecided → High
Changed in heat:
assignee: nobody → Bathri Ajay Raj (bathri-s)
Zane Bitter (zaneb)
Changed in heat:
milestone: none → newton-1
Zane Bitter (zaneb)
Changed in heat:
assignee: Bathri Ajay Raj (bathri-s) → nobody
Rabi Mishra (rabi)
Changed in heat:
milestone: newton-1 → newton-2
Thomas Herve (therve)
Changed in heat:
milestone: newton-2 → newton-3
Thomas Herve (therve)
Changed in heat:
milestone: newton-3 → next
Revision history for this message
Zane Bitter (zaneb) wrote :

Moving this to rc1 because I've heard a bunch of reports what I think is this bug (or something closely related) happening to people, particularly when they run out of file descriptors on the database.

Changed in heat:
milestone: next → newton-rc1
Changed in heat:
assignee: nobody → Tanvir Talukder (tanvirt16)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/366979

Changed in heat:
status: Triaged → In Progress
Zane Bitter (zaneb)
Changed in heat:
assignee: Tanvir Talukder (tanvirt16) → nobody
status: In Progress → Triaged
Thomas Herve (therve)
Changed in heat:
milestone: newton-rc1 → ocata-1
Changed in heat:
assignee: nobody → Tanvir Talukder (tanvirt16)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat (master)

Change abandoned by Tanvir Talukder (<email address hidden>) on branch: master
Review: https://review.openstack.org/366979
Reason: Workaround already in place

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/378987

Changed in heat:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/386741

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat (master)

Change abandoned by Tanvir Talukder (<email address hidden>) on branch: master
Review: https://review.openstack.org/378987
Reason: Abandoning due to problems resolving merge conflict. New patch set is located here: https://review.openstack.org/#/c/386741/

Rabi Mishra (rabi)
Changed in heat:
milestone: ocata-1 → ocata-2
Revision history for this message
huangtianhua (huangtianhua) wrote :

Now we provide two ways to reset the status:
1. restart engine service, then will reset the stack status to *_FAILED and the resources it contains if they are in-progess
2. provide heat-manage cmd to reset the stack status to *_FAILED and the resources it contains which are in-progress

So, not sure what's the problem this bug tracing?

Rabi Mishra (rabi)
Changed in heat:
milestone: ocata-2 → ocata-3
Revision history for this message
Zane Bitter (zaneb) wrote :

I corrected the typo I made in the description which rendered it nonsensical. Sorry about that!

The problem that this bug is tracing is that the stack can get set to UPDATE_FAILED (by a subsequent update) while the resources are still *_IN_PROGRESS. After that, restarting heat-engine won't help because it only looks at stacks that are *_IN_PROGRESS.

I believe it's correct that the heat-manage command will now be able to reset the resource statuses. That still requires an admin to intervene though.

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/386741
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=d6a90cc6ac1f49286b1c6a53f934d60a579da9bf
Submitter: Jenkins
Branch: master

commit d6a90cc6ac1f49286b1c6a53f934d60a579da9bf
Author: Tanvir Talukder <email address hidden>
Date: Wed Jan 4 11:27:04 2017 -0600

    Fix for resources stuck in progress after engine crash

    When a stack is IN_PROGRESS and an UPDATE or RESTORE is called
    after an engine crash, we set status of the stack and all of its
    IN_PROGRESS resources to FAILED

    Change-Id: Ia3adbfeff16c69719f9e5365657ab46a0932ec9b
    Closes-Bug: #1570576

Changed in heat:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 8.0.0.0b3

This issue was fixed in the openstack/heat 8.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.