Instances deletes should be asynchronous with host availability

Bug #1626702 reported by Darren Carpenter
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

Description: When a host goes down and instances on that host are deleted while the host is down, this request should be tracked in the database so that when the host comes back up it can scrub the delete request against the status of the active instance and delete it if necessary. The overall goal is for the delete to be processed so that any attached volumes return to an available state instead of being left in an in-use state. (attached to a non-existent instance)

This is being experienced while using Cloud Foundry. When the host goes down, Bosh attempts to re-create the instance along with associating any previous known volumes.

This has been tested on the following tag:
10.1.18

If this was not an issue, I would expect the delete request to go through while the host was down and the system itself would clean up the mess once the host comes back online freeing up the volume for use immediately.

(This is going to be tested in Liberty as well and further notes will be attached to the bug report)

Revision history for this message
Craig Bookwalter (craigbookwalter) wrote :

to help clarify:

bosh is a tool used to deploy Cloud Foundry. It has a feature called the resurrector that watches VMs and if contact is lost, attempts to recreate them.

Currently, this is what happens:
- a physical host with a given instance goes down
- bosh notices the instance missing, issues a delete request
- if the delete request is issued within 1-2 minutes of the host going offline, the delete request never completes and bosh gives up trying to recreate the instance
- if the delete request is issued after that time, the delete request completes and bosh recreates the instance, but fails to reattach the volume to the new instance because the volume is still attached to the old (deleted) instance.

The fabric should accept the delete request, mark any attached volumes as detached, and return something like 204, handling any actual cleanup on physical nodes asynchronously.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Without logs I'm having a hard time seeing what's blowing up here, or in what version. This would be better if you could provide recreate steps using devstack with a more recent version of nova since liberty is going to be end of life in less than 2 months.

Also, by 'bosh recreates the instance', do you mean it evacuates the instance from the downed host or creates a new instance using the same image/flavor/volume/etc?

By the way, when the compute host is down during a delete request, the compute API code goes down a 'local delete' path which should detach the volume here:

https://github.com/openstack/nova/blob/43826e458eefc4157b45d8a04422cbdcdec4f7ff/nova/compute/api.py#L1939

In that case the volume should be changed from 'available' to 'in-use'.

There have been known races in the local delete code in the compute API especially with volumes attached but work has gone into fixing that problem in the newton release of nova.

Changed in nova:
status: New → Incomplete
Revision history for this message
Matt Riedemann (mriedem) wrote :

Sorry, correction in comment 2, the volume should be detached and go from 'in-use' to 'available', I had those state transitions backward.

Revision history for this message
Craig Bookwalter (craigbookwalter) wrote :

we're working on reproducing on a current fabric

'bosh recreate' == deletes the vm, creates new VM with same image/flavor/(usually) IP, attaches any volumes that used to be attached to the old VM

we can also grab OpenStack logs -- which ones are best to grab?

Revision history for this message
Augustina Ragwitz (auggy) wrote :

npva-api would be a good start. If you think volumes could be an issue, then cinder-api as well.

tags: removed: 10.1.18
Revision history for this message
George Nieto (gnieto) wrote :
Download full text (22.5 KiB)

Below are the nova / cinder logs as requested:
VOLUME ID: af657eb4-bdca-4d2c-8829-606e3a16e598
INSTANCE ID: edd84425-1a8f-4542-97c2-397374bfd521
OPENSTACK RELEASE: JUNO
################################################
################################################

NOVA_API:
2016-10-03 15:22:07.619 41709 WARNING nova.compute.api [req-39abba41-8023-447b-9964-abddcf936d25 None] [instance: edd84425-1a8f-4542-97c2-397374bfd521] instance's host uclacpu102 is down, deleting from database
2016-10-03 15:22:08.688 41709 INFO nova.osapi_compute.wsgi.server [req-39abba41-8023-447b-9964-abddcf936d25 None] 10.81.203.19,10.81.200.11 "DELETE /v2/ceea62f29f424833b4f7418765f9414f/servers/edd84425-1a8f-4542-97c2-397374bfd521 HTTP/1.1" status: 204 len: 198 time: 1.2727830
2016-10-03 15:22:10.667 41716 INFO nova.osapi_compute.wsgi.server [-] 10.81.200.11 "OPTIONS / HTTP/1.0" status: 200 len: 429 time: 0.0009520
2016-10-03 15:22:11.096 41716 INFO nova.osapi_compute.wsgi.server [req-0958a414-fb17-49d1-a87f-59b7c3b8969d None] 10.55.220.105,10.81.200.11 "GET /v2/ HTTP/1.1" status: 200 len: 620 time: 0.0075588
2016-10-03 15:22:11.379 41716 INFO nova.osapi_compute.wsgi.server [req-28323d74-6810-48ff-90eb-a82f19685dd2 None] 10.55.220.105,10.81.200.11 "GET /v2/ceea62f29f424833b4f7418765f9414f/servers?name=edd84425-1a8f-4542-97c2-397374bfd521 HTTP/1.1" status: 200 len: 206 time: 0.0451400
2016-10-03 15:22:12.505 41716 INFO nova.osapi_compute.wsgi.server [req-62ccd0e8-f8aa-4b22-95e5-0baec584b465 None] 10.81.203.19,10.81.200.11 "GET /v2/ceea62f29f424833b4f7418765f9414f/flavors/detail.json HTTP/1.1" status: 200 len: 6866 time: 0.0270731
2016-10-03 15:22:13.078 41709 INFO nova.osapi_compute.wsgi.server [req-8c3a68fd-ca69-4f41-85c1-b645247d1796 None] 10.81.203.19,10.81.200.11 "GET /v2/ceea62f29f424833b4f7418765f9414f/servers/fd27b75d-379c-47d7-8bdd-c0fd1a52fa5b.json HTTP/1.1" status: 200 len: 1522 time: 0.0836358
2016-10-03 15:22:14.258 41713 INFO nova.api.openstack.wsgi [req-8d09882e-d77c-4d19-b42d-bfa8700ea676 None] HTTP exception thrown: Instance could not be found
2016-10-03 15:22:14.260 41713 INFO nova.osapi_compute.wsgi.server [req-8d09882e-d77c-4d19-b42d-bfa8700ea676 None] 10.55.220.105,10.81.200.11 "GET /v2/ceea62f29f424833b4f7418765f9414f/servers/edd84425-1a8f-4542-97c2-397374bfd521 HTTP/1.1" status: 404 len: 286 time: 0.0586329
2016-10-03 15:22:17.179 41717 INFO nova.osapi_compute.wsgi.server [req-ba990ff3-80cb-4246-b987-a03a8baba908 None] 10.55.220.105,10.81.200.11 "GET /v2/ceea62f29f424833b4f7418765f9414f HTTP/1.1" status: 404 len: 259 time: 0.2219639
2016-10-03 15:22:17.430 41716 INFO nova.osapi_compute.wsgi.server [req-6aa74a7d-7e8b-4e8e-ae7d-ab0c26fc7cc1 None] 10.55.220.105,10.81.200.11 "GET /v2/ceea62f29f424833b4f7418765f9414f/servers?name=edd84425-1a8f-4542-97c2-397374bfd521 HTTP/1.1" status: 200 len: 206 time: 0.0584500
2016-10-03 15:22:18.473 41716 INFO nova.osapi_compute.wsgi.server [req-7cfbc38c-6be6-4cd5-ae5b-9c392c4f9bee None] 10.81.203.19,10.81.200.11 "DELETE /v2/ceea62f29f424833b4f7418765f9414f/servers/fd27b75d-379c-47d7-8bdd-c0fd1a52fa5b HTTP/1.1" status: 204 len: 198 time: 0.2282281
2016-10-03 15:22:20.191 41711 INFO nova.osapi_compute.wsgi.serve...

Revision history for this message
melanie witt (melwitt) wrote :

Note that the Nova version in question here is Juno, which has been EOL since 2015-12-07.

I would expect the recreate steps for the current Nova version would be something like:

  1. Create an instance
  2. Kill the nova-compute process
  3. Delete the instance
  4. Check whether volumes are 'in-use' or 'available'

This may no longer be a bug present in the current version of Nova.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
Revision history for this message
Craig Bookwalter (craigbookwalter) wrote :

We have reproduced this bug in Liberty using the above steps, slightly modified:

1. Create an instance
2. Kill the nova-compute process
=> 3. Wait until the compute shows as down on nova service-list
4. Delete the instance
5. Check whether volumes are 'in-use' or 'available'

The delete request will go through, and the volume will show as still attached to the deleted instance.

Revision history for this message
Craig Bookwalter (craigbookwalter) wrote :

Attaching nova-compute logs for the above

Changed in nova:
status: Expired → New
Revision history for this message
Craig Bookwalter (craigbookwalter) wrote :

Also I realize Liberty is also end of life, we will reproduce on another later fabric ASAP.

Revision history for this message
Andrey Volkov (avolkov) wrote :

Can't reproduce behaviour on the master. See http://ix.io/nR9. Provide additional details, please.

Changed in nova:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.