Stack stuck in DELETE_FAILED due to resource deleted outside of Heat

Bug #1700830 reported by Ben Nemec
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
Medium
Zane Bitter

Bug Description

I will start off by saying I realize deleting Heat-managed resources outside of Heat is not a good thing to be doing. However, it does happen and it used to be okay - the resource was deleted so when you deleted the stack it was basically a noop. Now the stack is stuck in the following state:

| stack_status | DELETE_FAILED |
| stack_status_reason | Resource DELETE failed: NotFound: resources.baremetal_e |
| | nv.resources.openstack_baremetal_servers.resources[0].r |
| | esources.baremetal_server: Unable to find port with |
| | name or id '3fbd424e-3d62-4ea8-9233-3a6beb2c4cdc'

That port was indeed deleted by an external script, so the error is not incorrect. It does cause a problem though in that now I can't do anything with the stack. Delete fails as above, and trying to update to recreate the missing resource isn't allowed either after the deletion has started.

I hacked the code a bit to expose the underlying exception, and it looks like this is happening in the translation framework. I'll attach the traceback to the bug.

Revision history for this message
Ben Nemec (bnemec) wrote :
Revision history for this message
Ben Nemec (bnemec) wrote :

Oh, I forgot to mention this is on a devstack install from a few weeks ago. Specifically commit 93a446a9650c9e7e35a73353840694ae01e73280 It doesn't look to me like anything has changed since then that would affect this, but I could be wrong.

Changed in heat:
assignee: nobody → huangtianhua (huangtianhua)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/480471

Changed in heat:
status: New → In Progress
Rico Lin (rico-lin)
Changed in heat:
importance: Undecided → Medium
milestone: none → pike-3
Revision history for this message
huangtianhua (huangtianhua) wrote :

@Ben Nemec,
would you please to provide the template, and more traced logs?

Rico Lin (rico-lin)
Changed in heat:
milestone: pike-3 → pike-rc1
Revision history for this message
Zane Bitter (zaneb) wrote :

> However, it does happen and it used to be okay

It's not a thing that is broadly OK or not OK. Getting a NotFound exception when you're trying to delete a resource should be considered success, but it's up to every resource type to correctly catch and handle that. If it's not then that is definitely a bug. It would be helpful to know exactly which resource type is failing and see the traceback - the traceback from the parent stack isn't all that useful.

Changed in heat:
assignee: huangtianhua (huangtianhua) → Zane Bitter (zaneb)
Revision history for this message
Ben Nemec (bnemec) wrote :

Sorry for the delay getting back to this. I've been meaning to dig further but I keep getting distracted (and then I was on PTO for a week).

I've attached the heat-engine logs from an attempt to delete the stack. I still haven't really tried to track down the problem, but maybe they'll provide a clue to someone else. I'll try to do some more investigation myself and see if I can get more debug information.

Revision history for this message
Ben Nemec (bnemec) wrote :

And after adding some debug log statements, the stack deleted successfully. o.O

I also tested with the proposed patch and on a fresh stack it deleted correctly even though the port had been deleted externally again.

So unless this is something you want to pursue I guess I'm okay with leaving it at that. If someone does have a stack that got into the bad state then they can probably clean it up by just retrying the delete a few times and stacks with this patch applied should never have the problem in the first place.

Revision history for this message
Zane Bitter (zaneb) wrote :

OK I was wrong, it looks like that is indeed the traceback from the stack we're interested in, and that's all we get.

Indications from the initial issue are that we're failing to catch errors in properties translation (in particular, we're failing to catch *Neutron* NotFound errors, possibly because we're only looking for *Nova* NotFound errors).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/480471
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=fecd7a11a8f4ceb43ce14bdc91dc86725a28a478
Submitter: Jenkins
Branch: master

commit fecd7a11a8f4ceb43ce14bdc91dc86725a28a478
Author: huangtianhua <email address hidden>
Date: Wed Jul 5 17:14:27 2017 +0800

    Do not disassociate floating ip again

    The floating ips will be detached from server when delete
    the server. So we don't have to detach those floating ips
    explicitly before calling server delete.

    Change-Id: I7f85d7f12c58872d790ebfe565931985099ec846
    Partial-Bug: #1700830

Rabi Mishra (rabi)
Changed in heat:
milestone: pike-rc1 → pike-rc2
Rico Lin (rico-lin)
Changed in heat:
milestone: pike-rc2 → queens-1
Rico Lin (rico-lin)
Changed in heat:
milestone: queens-1 → queens-2
Revision history for this message
Zane Bitter (zaneb) wrote :

I think this was fixed by https://review.openstack.org/480471

Changed in heat:
milestone: queens-2 → queens-1
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.