Bug #1160052 “Need a way to retry failed operations” : Bugs : OpenStack Heat

Revision history for this message

Steven Hardy (shardy) wrote on 2013-03-26:

#1

I can understand why this could be useful, but have the following concerns:

- Seems like a corner case, in which case the current behaviour is fine?

- When a resource is in CREATE_FAILED state, the state is unknown, so the only thing we can do is delete it and re-create it (unless we add logic to all resource handle_update which figures out if the failure is recoverable, which seems potentially complex). This is equivalent to mapping resource CREATE_FAILED state to UPDATE_REPLACE in parser.Stack::update(), so we'd need to see if that will work with rollback.

- I see the argument for not deleting everything, and I guess it may be fairly simple with our current serialized resource creation strategy, but what happens when we move to parallel resource creation, is being able to re-start stack creation from a partially created state going to make things much more difficult?

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2013-03-26: Re: [Bug 1160052] Re: Need a way to retry creation

#2

Excerpts from Steven Hardy's message of 2013-03-26 16:44:59 UTC:
> I can understand why this could be useful, but have the following
> concerns:
>
> - Seems like a corner case, in which case the current behaviour is fine?
>

Corner cases are where you are pushing Heat to do something rare. I
don't think mirrors containing packages being slow for a brief period
of time is all that rare. That situation would break a stack which
has a WaitCondition at the end. So I reject the notion that this is a
corner case. Things fail, and that should not cause a whole stack to
be invalidated.

> - When a resource is in CREATE_FAILED state, the state is unknown, so
> the only thing we can do is delete it and re-create it (unless we add
> logic to all resource handle_update which figures out if the failure is
> recoverable, which seems potentially complex). This is equivalent to
> mapping resource CREATE_FAILED state to UPDATE_REPLACE in
> parser.Stack::update(), so we'd need to see if that will work with
> rollback.
>

Am fine with deleting the failed resource. Not the failed stack
though. The failure is, in theory, isolated to those resources that
failed to create, so delete those, and try again from there.

We would have to think through the problem though, as the WaitCondition
that fails is really not the problem.. the problem is further up the
stack. This needs further thought, but I think there is an answer that
isn't "start over from 0".

> - I see the argument for not deleting everything, and I guess it may be
> fairly simple with our current serialized resource creation strategy,
> but what happens when we move to parallel resource creation, is being
> able to re-start stack creation from a partially created state going to
> make things much more difficult?
>

I don't think it makes things difficult at all. We will simply be running
through the same exact graph, but the create and active steps will be
instant because the desired state is already reached. When we get to
a resource that is missing, we carry on. We have to do that in updates
anyway since resources may be added as part of the update.

Excerpts from Steven Hardy's message of 2013-03-26 16:44:59 UTC:
> I can understand why this could be useful, but have the following
> concerns:
> 
> - Seems like a corner case, in which case the current behaviour is fine?
>

Corner cases are where you are pushing Heat to do something rare. I
don't think mirrors containing packages being slow for a brief period
of time is all that rare. That situation would break a stack which
has a WaitCondition at the end. So I reject the notion that this is a
corner case.  Things fail, and that should not cause a whole stack to
be invalidated.

> - When a resource is in CREATE_FAILED state, the state is unknown, so
> the only thing we can do is delete it and re-create it (unless we add
> logic to all resource handle_update which figures out if the failure is
> recoverable, which seems potentially complex).  This is equivalent to
> mapping resource CREATE_FAILED state to UPDATE_REPLACE in
> parser.Stack::update(), so we'd need to see if that will work with
> rollback.
>

Am fine with deleting the failed resource. Not the failed stack
though. The failure is, in theory, isolated to those resources that
failed to create, so delete those, and try again from there.

We would have to think through the problem though, as the WaitCondition
that fails is really not the problem.. the problem is further up the
stack.  This needs further thought, but I think there is an answer that
isn't "start over from 0".

> - I see the argument for not deleting everything, and I guess it may be
> fairly simple with our current serialized resource creation strategy,
> but what happens when we move to parallel resource creation, is being
> able to re-start stack creation from a partially created state going to
> make things much more difficult?
>

I don't think it makes things difficult at all. We will simply be running
through the same exact graph, but the create and active steps will be
instant because the desired state is already reached. When we get to
a resource that is missing, we carry on. We have to do that in updates
anyway since resources may be added as part of the update.

Revision history for this message

Steven Hardy (shardy) wrote on 2013-04-22: Re: Need a way to retry creation

#3

Assigning to reporter so he can post a patch ;)

Changed in heat:
status:	New → Triaged
importance:	Undecided → Low
assignee:	nobody → Clint Byrum (clint-fewbar)
milestone:	none → havana-1

Steven Hardy (shardy) on 2013-05-01

Changed in heat:
milestone:	havana-1 → havana-2

Steven Hardy (shardy) on 2013-06-19

Changed in heat:
milestone:	havana-2 → havana-3

Steven Hardy (shardy) on 2013-09-03

Changed in heat:
milestone:	havana-3 → havana-rc1
importance:	Low → Medium

Steven Hardy (shardy) on 2013-09-24

Changed in heat:
milestone:	havana-rc1 → icehouse-1

Clint Byrum (clint-fewbar) on 2013-10-07

summary:

- Need a way to retry creation
+ Need a way to retry failed operations

Revision history for this message

Cody A.W. Somerville (cody-somerville) wrote on 2013-10-07:

#4

Excerpts from Steven Hardy's message of 2013-03-26 16:44:59 UTC:
> I can understand why this could be useful, but have the following
> concerns:
>
> - Seems like a corner case, in which case the current behaviour is fine?
>

Rate limits will make this very much not a corner case.

Steve Baker (steve-stevebaker) on 2013-10-07

Changed in heat:
importance:	Medium → High

Steve Baker (steve-stevebaker) on 2013-12-04

Changed in heat:
milestone:	icehouse-1 → icehouse-2

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2013-12-31:

#5

Marking in progress as the blueprint is starting to be implemented.

Changed in heat:
status:	Triaged → In Progress

Steve Baker (steve-stevebaker) on 2014-01-14

Changed in heat:
milestone:	icehouse-2 → icehouse-3

Thierry Carrez (ttx) on 2014-03-05

Changed in heat:
milestone:	icehouse-3 → icehouse-rc1

Steve Baker (steve-stevebaker) on 2014-03-24

Changed in heat:
milestone:	icehouse-rc1 → next

Steve Baker (steve-stevebaker) on 2014-04-01

Changed in heat:
milestone:	next → juno-1

Victor HU (huruifeng) on 2014-04-03

Changed in heat:
status:	In Progress → Confirmed

Victor HU (huruifeng) on 2014-04-03

Changed in heat:
status:	Confirmed → In Progress
information type:	Public → Public Security

Victor HU (huruifeng) on 2014-04-03

information type:

Public Security → Public

Thierry Carrez (ttx) on 2014-06-11

Changed in heat:
milestone:	juno-1 → juno-2

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-25: Fix proposed to heat (master)

#6

Fix proposed to branch: master
Review: https://review.openstack.org/102397

Changed in heat:
assignee:	Clint Byrum (clint-fewbar) → Steve Baker (steve-stevebaker)

Revision history for this message

Charles Crouch (ccrouch) wrote on 2014-07-10:

#7

From Clint: "https://review.openstack.org/102397 should not actually be linked with bug 1160052. The bug is about being able to retry when the unexpected happens. The patch is about avoiding a particular unexpected state altogether."

OpenStack Infra (hudson-openstack) on 2014-07-17

Changed in heat:
assignee:	Steve Baker (steve-stevebaker) → Jason Dunsmore (jasondunsmore)

Steven Hardy (shardy) on 2014-07-21

Changed in heat:
milestone:	juno-2 → juno-3

OpenStack Infra (hudson-openstack) on 2014-07-21

Changed in heat:
assignee:	Jason Dunsmore (jasondunsmore) → Steve Baker (steve-stevebaker)

Revision history for this message

Zane Bitter (zaneb) wrote on 2014-08-11:

#8

This is basically the same as the linked blueprint update-failure-recovery.

Changed in heat:
assignee:	Steve Baker (steve-stevebaker) → Zane Bitter (zaneb)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-08-31: Fix merged to heat (master)

#9

Reviewed: https://review.openstack.org/112938
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=c2ffae8cd0c6a9e376063edaba1a57c379d9ecfa
Submitter: Jenkins
Branch: master

commit c2ffae8cd0c6a9e376063edaba1a57c379d9ecfa
Author: Zane Bitter <email address hidden>
Date: Tue Aug 26 18:37:09 2014 -0400

Allow an update after a failure

    Change-Id: I41ce08c33780642c31b81763032d6c089e903c48
    Implements: blueprint update-failure-recovery
    Closes-bug: #1160052

Changed in heat:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2014-09-05

Changed in heat:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2014-10-16

Changed in heat:
milestone:	juno-3 → 2014.2

OpenStack Heat

Need a way to retry failed operations

Bug Description

Other bug subscribers

Related questions

Related blueprints

Remote bug watches