can't cancel wedged stack-create

Bug #1211276 reported by Robert Collins
28
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Jason Dunsmore

Bug Description

my heat api+ engine node got reboot just after I ran stack-create.

After bringing it back up I have:
+--------------------------------------+------------+--------------------+----------------------+
| id | stack_name | stack_status | creation_time |
+--------------------------------------+------------+--------------------+----------------------+
| f981af3d-9b9f-46bf-be81-95f48091ab77 | overcloud | CREATE_IN_PROGRESS | 2013-08-12T11:03:46Z |
+--------------------------------------+------------+--------------------+----------------------+

but no nova instances for it (they got hosed) and it's not starting new ones; and I can't delete it:

 heat stack-delete f981af3d-9b9f-46bf-be81-95f48091ab77
Traceback (most recent call last):
  File "/opt/stack/venvs/heat/local/lib/python2.7/site-packages/eventlet/wsgi.py", line 384, in handle_one_response
    result = self.application(self.environ, start_response)
  File "/opt/stack/venvs/heat/local/lib/python2.7/site-packages/webob/dec.py", line 130, in __call__
    resp = self.call_func(req, *args, **self.kwargs)
  File "/opt/stack/venvs/heat/local/lib/python2.7/site-packages/webob/dec.py", line 195, in call_func
    return self.func(req, *args, **kwargs)
  File "/opt/stack/venvs/heat/local/lib/python2.7/site-packages/heat/common/wsgi.py", line 307, in __call__
    response = req.get_response(self.application)
  File "/opt/stack/venvs/heat/local/lib/python2.7/site-packages/webob/request.py", line 1296, in send
    application, catch_exc_info=False)
  File "/opt/stack/venvs/heat/local/lib/python2.7/site-packages/webob/request.py", line 1260, in call_application
    app_iter = application(self.environ, start_response)
  File "/opt/stack/venvs/heat/local/lib/python2.7/site-packages/keystoneclient/middleware/auth_token.py", line 461, in __call__
    return self.app(env, start_response)
  File "/opt/stack/venvs/heat/local/lib/python2.7/site-packages/webob/dec.py", line 130, in __call__
    resp = self.call_func(req, *args, **self.kwargs)
  File "/opt/stack/venvs/heat/local/lib/python2.7/site-packages/webob/dec.py", line 195, in call_func
    return self.func(req, *args, **kwargs)
  File "/opt/stack/venvs/heat/local/lib/python2.7/site-packages/heat/common/wsgi.py", line 307, in __call__
    response = req.get_response(self.application)
  File "/opt/stack/venvs/heat/local/lib/python2.7/site-packages/webob/request.py", line 1296, in send
    application, catch_exc_info=False)
  File "/opt/stack/venvs/heat/local/lib/python2.7/site-packages/webob/request.py", line 1260, in call_application
    app_iter = application(self.environ, start_response)
  File "/opt/stack/venvs/heat/local/lib/python2.7/site-packages/webob/dec.py", line 144, in __call__
    return resp(environ, start_response)
  File "/opt/stack/venvs/heat/local/lib/python2.7/site-packages/routes/middleware.py", line 131, in __call__
    response = self.app(environ, start_response)
  File "/opt/stack/venvs/heat/local/lib/python2.7/site-packages/webob/dec.py", line 144, in __call__
    return resp(environ, start_response)
  File "/opt/stack/venvs/heat/local/lib/python2.7/site-packages/webob/dec.py", line 130, in __call__
    resp = self.call_func(req, *args, **self.kwargs)
  File "/opt/stack/venvs/heat/local/lib/python2.7/site-packages/webob/dec.py", line 195, in call_func
    return self.func(req, *args, **kwargs)
  File "/opt/stack/venvs/heat/local/lib/python2.7/site-packages/heat/common/wsgi.py", line 603, in __call__
    raise translate_exception(err, request.best_match_language())
ActionInProgress_Remote: Stack overcloud already has an action (CREATE) in progress

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

This behavior was introduced by https://review.openstack.org/#/c/39760/

I would have preferred that this change had an exception for the delete action, so that a delete will be attempted no matter what state the stack is in.

Changed in heat:
milestone: none → havana-3
importance: Undecided → High
status: New → Confirmed
assignee: nobody → Jason Dunsmore (jasondunsmore)
Revision history for this message
Randall Burt (randall-burt) wrote :

Actually, can we add a "force" option to the delete? Deletes can get stuck for other reasons. I'm concerned about adding exceptions because this was one of the primary use-cases around mult-engine (preventing multiple concurrent stack actions). Perhaps another alternative is to "sync" or "restart" the current action, since, in an ideal world, the reported issue would be recoverable by just continuing the stack create after the reboot.

Revision history for this message
Robert Collins (lifeless) wrote :

What if nova simply doesn't have enough resources? I think users should always be able to say 'hey, enough, stop what you're doing'. - whether thats an update/delete/create.

Revision history for this message
Randall Burt (randall-burt) wrote :

Sounds fine to me (re "cancel" operation). That lets you both recover from Heat server problems as well as mis-calculations on the user's part.

Revision history for this message
Steven Hardy (shardy) wrote :

IMO we don't need additional states (sync, restart, cancel etc), we just need to allow the delete to be (re)attempted regardless of the stack state, and fail quietly if we find some stack resources are already deleted (which AFAIK we already do)

Revision history for this message
Randall Burt (randall-burt) wrote :

shardy agreed, but the issue is that once a stack is in the process of doing something we prevent other operations until its done/failed. When the engine processing a stack operation crashes, this leaves the stack(s) in an IN-PROGRESS state and we're preventing any DELETE/RESUME whatever. I propose just saying "STOP" and have that be the exception to any IN-PROGRESS work rather than exceptions for every operation. If there are ways to handle this mult-engine "sticky" issue using oslo RPC or other mechanism, I'm just not familiar enough with that to offer an alternative.

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

I remember discussing the sticky issue during the Havana session, but it looks like we completely failed to capture this in the etherpad ;)

My understanding of channels approach was as follows:
- an extra RPC call is added which responds with which engine to use for a given action request. Whichever engine responds to this request just returns whatever the database says (so the database needs to store which engine is acting on any in-progress actions)
- the api server sends the action request to the engine specified in the previous call. If there is no response then the engine must have died and the request should just go to the next round-robin engine
- an engine that receives an action call can interrupt the current in-progress action if necessary (for delete only?)
- non-action requests always just use the round-robin calls, the state of stacks will eventually be consistent.

There may be a reason why this approach wouldn't work, but it might be worth investigating.

Revision history for this message
Jason Dunsmore (jasondunsmore) wrote :

Steve, here's my tentative plan:

- API sends action to engine (round-robin)
- Engine checks status of stack
- IF stack status == "INPROGRESS":
  + Broadcast "Stop working on stack X" message to all engines
  + Wait for response
  + IF no response:
    * Force update stack status to "ORPHANED"
- Proceed with operation

I expect most of the work to be implementing a new "STOP" action and a new "ORPHANED" status.

Thoughts?

Revision history for this message
Jason Dunsmore (jasondunsmore) wrote :

After discussing this with Randall, I think we can omit the "ORPHANED" status. So it'll be:

- API sends action to engine (round-robin)
- Engine checks status of stack
- IF stack status == "INPROGRESS":
  + Broadcast "Stop working on stack X" message to all engines
  + Wait for response (0.5 sec timeout)
- Proceed with operation

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

#9 works for me, although there is a minor chance of a race where another engine starts an action after the INPROGRESS check.

How about this variant:
- API sends action to engine (round-robin)
- Engine checks status of stack
- Broadcast "Stop working on stack X" message to all engines
- IF stack status == "INPROGRESS":
  + Wait for response (0.5 sec timeout)
- Proceed with operation

So broadcast the stop regardless, and only wait for a response if it is known that another action was in progress

Revision history for this message
Zane Bitter (zaneb) wrote :

This is only for the delete, right? That sounds reasonable. 0.5s sounds a little bit short for a timeout waiting for a response. Also, if there *is* a response, you should wait up to a few seconds for the status to change to not IN_PROGRESS any more, and if that times out repeat from the 'Broadcast "Stop working on stack X"' step.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/42398

Changed in heat:
status: Confirmed → In Progress
Revision history for this message
Jason Dunsmore (jasondunsmore) wrote :

Thanks for the feedback. Here's the latest plan:

Any time a stack goes into "INPROGRESS", start a stack listener on
topic <stack_name>. Stack listener should just respond affirmatively
to any queries. Stop the stack listener whenever the action is done.

- API sends action to engine (round-robin)
- Send "Working on stack?" message on <stack_name> topic
- IF an engine responded with "Yes":
  + Raise "Invalid state" exception
- ELSE:
  + Proceed with operation

Zane, the above logic seems to make sense for each of the following
senarios:

DELETE during CREATE
UPDATE during CREATE
DELETE during UPDATE
UPDATE during DELETE

Revision history for this message
Zane Bitter (zaneb) wrote :

No issues with DELETE during CREATE or DELETE during UPDATE.

As the current update code stands, or even taking into account foreseeable future plans, I don't think that UPDATE during CREATE will do the Right Thing, and UPDATE during DELETE is almost certainly unsafe.

Revision history for this message
Jason Dunsmore (jasondunsmore) wrote :

Zane,

I agree, "UPDATE during CREATE" and "UPDATE during DELETE" should not be allowed. With the above logic, those actions would only be allowed if no engine responded. Does that make sense to you? When an engine dies in the middle of an action, the stack's state will remain unknown until another action is attempted.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/42398
Committed: http://github.com/openstack/heat/commit/6362a81b56bf7d436ee220b5c16d153ac1c7d4a4
Submitter: Jenkins
Branch: master

commit 6362a81b56bf7d436ee220b5c16d153ac1c7d4a4
Author: Jason Dunsmore <email address hidden>
Date: Fri Aug 16 13:05:38 2013 -0500

    Revert "Implement an "Action in progress" error."

    This reverts commit faf984cbb5e62980e025191450a02d30aae8e02d.

    Multi-engine support is being re-designed to account for the use-case in
    bug #1211276.

    Fixes bug #1211276

    Change-Id: Ief89dd6a9c5014752db8065037dd1dd03efe789e

Changed in heat:
status: In Progress → Fix Committed
Changed in heat:
status: Fix Committed → In Progress
Changed in heat:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/43173

Revision history for this message
Zane Bitter (zaneb) wrote :

IMO once you've asked Heat to delete a stack, if the heat-engine then dies there's no only no safe way to do an update on it later, there's not even a valid use case for attempting to support that. You already asked for it to be deleted, no do-overs. The only valid operation should be to try to delete it again.

There's clearly a valid use case for wanting to do an update after a create fails, but I am fairly certain that it will not do the Right Thing at the moment. So, unless you have tested it pretty thoroughly and found that it works, I think for now we should prohibit it and add a blueprint to support that later.

Revision history for this message
Jason Dunsmore (jasondunsmore) wrote :

Zane,

The way this is implemented in https://review.openstack.org/#/c/43173/, if the stack is IN_PROGRESS but the engine doesn't respond confirming that it's still alive, we'll fall back on the current single-engine behavior. Here's a table showing exactly what this patch affects:
http://paste.openstack.org/raw/44891/

In the case of "UPDATE during DELETE", a "State invalid for UPDATE" error would be raised.

Thierry Carrez (ttx)
Changed in heat:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in heat:
milestone: havana-3 → 2013.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.