Instances deletes should be asynchronous with host availability
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Expired
|
Undecided
|
Unassigned |
Bug Description
Description: When a host goes down and instances on that host are deleted while the host is down, this request should be tracked in the database so that when the host comes back up it can scrub the delete request against the status of the active instance and delete it if necessary. The overall goal is for the delete to be processed so that any attached volumes return to an available state instead of being left in an in-use state. (attached to a non-existent instance)
This is being experienced while using Cloud Foundry. When the host goes down, Bosh attempts to re-create the instance along with associating any previous known volumes.
This has been tested on the following tag:
10.1.18
If this was not an issue, I would expect the delete request to go through while the host was down and the system itself would clean up the mess once the host comes back online freeing up the volume for use immediately.
(This is going to be tested in Liberty as well and further notes will be attached to the bug report)
to help clarify:
bosh is a tool used to deploy Cloud Foundry. It has a feature called the resurrector that watches VMs and if contact is lost, attempts to recreate them.
Currently, this is what happens:
- a physical host with a given instance goes down
- bosh notices the instance missing, issues a delete request
- if the delete request is issued within 1-2 minutes of the host going offline, the delete request never completes and bosh gives up trying to recreate the instance
- if the delete request is issued after that time, the delete request completes and bosh recreates the instance, but fails to reattach the volume to the new instance because the volume is still attached to the old (deleted) instance.
The fabric should accept the delete request, mark any attached volumes as detached, and return something like 204, handling any actual cleanup on physical nodes asynchronously.