Instances stuck in deleting task_state require n-cpu restart to remove
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
New
|
Undecided
|
Unassigned |
Bug Description
Bug 1248563 "Instance deletion is prevented when another component locks up" provided a partial fix https:/
When doing Tempest 3rd party CI runs we see instances fail to build (could be a scheduling/resource problem, timeout, whatever) and then get stuck in deleting task_state and are never cleaned up.
The patch even says:
"Dealing with delete requests that never got executed is not in scope of this change and will be submitted separately."
That's the bug reported here. For example, this is several hours after our Tempest run finished:
http://
There is also some history after patch 55444 merged, we had this revert of a revert https:/
https:/
So there is a lot of half-baked code here and I haven't been able to get a response from Stan on bug 1248563 but basically it boils down to the original change 55444 depended on some later changes working, and those were ultimately reverted due to race conditions breaking in the gate.
I would propose that at least for icehouse-rc1 we get the original patch reverted since it's not a complete solution and introduces another bug.
Further, this was the patch to cleanup instances stuck in 'deleting' task_state:
https:/ /review. openstack. org/#/c/ 55660/
So the workaround here is you have to restart the compute service, that's not an ideal solution.