Need a way to retry failed operations

Bug #1160052 reported by Clint Byrum on 2013-03-25
38
This bug affects 7 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Zane Bitter

Bug Description

Consider this scenario:

* Start create on a large, expensive stack
* FINAL resource in graph, a WaitCondition fails due to a timeout caused by temporary downtime of some external resource

Currently in Heat, you will have to *delete* the entire stack and try create agian.

I believe that update stack should be possible in this scenario, and uncreated resources should retry create on update, rather than refuse to update.

Steven Hardy (shardy) wrote :

I can understand why this could be useful, but have the following concerns:

- Seems like a corner case, in which case the current behaviour is fine?

- When a resource is in CREATE_FAILED state, the state is unknown, so the only thing we can do is delete it and re-create it (unless we add logic to all resource handle_update which figures out if the failure is recoverable, which seems potentially complex). This is equivalent to mapping resource CREATE_FAILED state to UPDATE_REPLACE in parser.Stack::update(), so we'd need to see if that will work with rollback.

- I see the argument for not deleting everything, and I guess it may be fairly simple with our current serialized resource creation strategy, but what happens when we move to parallel resource creation, is being able to re-start stack creation from a partially created state going to make things much more difficult?

Excerpts from Steven Hardy's message of 2013-03-26 16:44:59 UTC:
> I can understand why this could be useful, but have the following
> concerns:
>
> - Seems like a corner case, in which case the current behaviour is fine?
>

Corner cases are where you are pushing Heat to do something rare. I
don't think mirrors containing packages being slow for a brief period
of time is all that rare. That situation would break a stack which
has a WaitCondition at the end. So I reject the notion that this is a
corner case. Things fail, and that should not cause a whole stack to
be invalidated.

> - When a resource is in CREATE_FAILED state, the state is unknown, so
> the only thing we can do is delete it and re-create it (unless we add
> logic to all resource handle_update which figures out if the failure is
> recoverable, which seems potentially complex). This is equivalent to
> mapping resource CREATE_FAILED state to UPDATE_REPLACE in
> parser.Stack::update(), so we'd need to see if that will work with
> rollback.
>

Am fine with deleting the failed resource. Not the failed stack
though. The failure is, in theory, isolated to those resources that
failed to create, so delete those, and try again from there.

We would have to think through the problem though, as the WaitCondition
that fails is really not the problem.. the problem is further up the
stack. This needs further thought, but I think there is an answer that
isn't "start over from 0".

> - I see the argument for not deleting everything, and I guess it may be
> fairly simple with our current serialized resource creation strategy,
> but what happens when we move to parallel resource creation, is being
> able to re-start stack creation from a partially created state going to
> make things much more difficult?
>

I don't think it makes things difficult at all. We will simply be running
through the same exact graph, but the create and active steps will be
instant because the desired state is already reached. When we get to
a resource that is missing, we carry on. We have to do that in updates
anyway since resources may be added as part of the update.

Assigning to reporter so he can post a patch ;)

Changed in heat:
status: New → Triaged
importance: Undecided → Low
assignee: nobody → Clint Byrum (clint-fewbar)
milestone: none → havana-1
Steven Hardy (shardy) on 2013-05-01
Changed in heat:
milestone: havana-1 → havana-2
Steven Hardy (shardy) on 2013-06-19
Changed in heat:
milestone: havana-2 → havana-3
Steven Hardy (shardy) on 2013-09-03
Changed in heat:
milestone: havana-3 → havana-rc1
importance: Low → Medium
Steven Hardy (shardy) on 2013-09-24
Changed in heat:
milestone: havana-rc1 → icehouse-1
summary: - Need a way to retry creation
+ Need a way to retry failed operations

Excerpts from Steven Hardy's message of 2013-03-26 16:44:59 UTC:
> I can understand why this could be useful, but have the following
> concerns:
>
> - Seems like a corner case, in which case the current behaviour is fine?
>

Rate limits will make this very much not a corner case.

Changed in heat:
importance: Medium → High
Changed in heat:
milestone: icehouse-1 → icehouse-2
Clint Byrum (clint-fewbar) wrote :

Marking in progress as the blueprint is starting to be implemented.

Changed in heat:
status: Triaged → In Progress
Changed in heat:
milestone: icehouse-2 → icehouse-3
Thierry Carrez (ttx) on 2014-03-05
Changed in heat:
milestone: icehouse-3 → icehouse-rc1
Changed in heat:
milestone: icehouse-rc1 → next
Changed in heat:
milestone: next → juno-1
Victor HU (huruifeng) on 2014-04-03
Changed in heat:
status: In Progress → Confirmed
Victor HU (huruifeng) on 2014-04-03
Changed in heat:
status: Confirmed → In Progress
information type: Public → Public Security
Victor HU (huruifeng) on 2014-04-03
information type: Public Security → Public
Thierry Carrez (ttx) on 2014-06-11
Changed in heat:
milestone: juno-1 → juno-2

Fix proposed to branch: master
Review: https://review.openstack.org/102397

Changed in heat:
assignee: Clint Byrum (clint-fewbar) → Steve Baker (steve-stevebaker)
Charles Crouch (ccrouch) wrote :

From Clint: "https://review.openstack.org/102397 should not actually be linked with bug 1160052. The bug is about being able to retry when the unexpected happens. The patch is about avoiding a particular unexpected state altogether."

Changed in heat:
assignee: Steve Baker (steve-stevebaker) → Jason Dunsmore (jasondunsmore)
Steven Hardy (shardy) on 2014-07-21
Changed in heat:
milestone: juno-2 → juno-3
Changed in heat:
assignee: Jason Dunsmore (jasondunsmore) → Steve Baker (steve-stevebaker)
Zane Bitter (zaneb) wrote :

This is basically the same as the linked blueprint update-failure-recovery.

Changed in heat:
assignee: Steve Baker (steve-stevebaker) → Zane Bitter (zaneb)

Reviewed: https://review.openstack.org/112938
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=c2ffae8cd0c6a9e376063edaba1a57c379d9ecfa
Submitter: Jenkins
Branch: master

commit c2ffae8cd0c6a9e376063edaba1a57c379d9ecfa
Author: Zane Bitter <email address hidden>
Date: Tue Aug 26 18:37:09 2014 -0400

    Allow an update after a failure

    Change-Id: I41ce08c33780642c31b81763032d6c089e903c48
    Implements: blueprint update-failure-recovery
    Closes-bug: #1160052

Changed in heat:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2014-09-05
Changed in heat:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2014-10-16
Changed in heat:
milestone: juno-3 → 2014.2
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related questions