Race condition in nova compute during snapshot

Bug #1639914 reported by Srinivas Sakhamuri
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Won't Fix
Undecided
Unassigned

Bug Description

When snapshot is created and immediately deleting the instance seems to cause race condition. I was able to re-create it on latest devstack on installed on 8th november

This can be created with following commands.

1. nova boot --flavor m1.large --image 6d4259ce-5873-42cb-8cbe-9873f069c149 testinstance

id | bef22f9b-ade4-48a1-86c4-b9a007897eb3

2. nova image-create bef22f9b-ade4-48a1-86c4-b9a007897eb3 testinstance-snap ; nova delete bef22f9b-ade4-48a1-86c4-b9a007897eb3
Request to delete server bef22f9b-ade4-48a1-86c4-b9a007897eb3 has been accepted.
3. nova image-list doesn't show the snapshot

4. nova list doesn't show the instance

Nova compute log indicates a race condition while executing CLI commands in 2 above

<182>1 2016-10-28T14:46:41.830208+00:00 hyper1 nova-compute 30056 - [40521 levelname="INFO" component="nova-compute" funcname="nova.compute.manager" request_id="req-e9e4e899-e2a7-4bf8-bdf1-c26f5634cfda" user="51fa0172fbdf495e89132f7f4574e750" tenant="00ead348c5f9475f8940ab29cd767c5e" instance="[instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] " lineno="/usr/lib/python2.7/site-packages/nova/compute/manager.py:2249"] nova.compute.manager Terminating instance
<183>1 2016-10-28T14:46:42.057653+00:00 hyper1 nova-compute 30056 - [40521 levelname="DEBUG" component="nova-compute" funcname="nova.compute.manager" request_id="req-1c4cf749-a6a8-46af-b331-f70dc1e9f364" user="51fa0172fbdf495e89132f7f4574e750" tenant="00ead348c5f9475f8940ab29cd767c5e" instance="[instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] " lineno="/usr/lib/python2.7/site-packages/nova/compute/manager.py:420"] nova.compute.manager Cleaning up image ae9ebf4b-7dd6-4615-816f-c2f3c7c08530 decorated_function /usr/lib/python2.7/site-packages/nova/compute/manager.py:420
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] Traceback (most recent call last):
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 416, in decorated_function
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] *args, **kwargs)
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3038, in snapshot_instance
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] task_states.IMAGE_SNAPSHOT)
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3068, in _snapshot_instance
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] update_task_state)
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 1447, in snapshot
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] guest.save_memory_state()
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 363, in save_memory_state
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] self._domain.managedSave(0)
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 183, in doit
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] result = proxy_call(self._autowrap, f, *args, **kwargs)
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 141, in proxy_call
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] rv = execute(f, *args, **kwargs)
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 122, in execute
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] six.reraise(c, e, tb)
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 80, in tworker
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] rv = meth(*args, **kwargs)
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1397, in managedSave
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] if ret == -1: raise libvirtError ('virDomainManagedSave() failed', dom=self)
!!!NL!!! 30056 TRACE nova.compute.manager [instance: bef22f9b-ade4-48a1-86c4-b9a007897eb3] libvirtError: operation failed: domain is no longer running

Nova compute should make sure the save is completed before attempting to delete the domain.

description: updated
Revision history for this message
Srinivas Sakhamuri (srinivas-sakhamuri) wrote :
Download full text (5.9 KiB)

2016-11-08 11:02:45.770 ERROR root [req-a0f277d3-487d-44b2-99be-8bbe45b5fc4a demo demo] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 137, in launch\n return self._domain.createWithFlags(flags)\n', ' File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 186, in doit\n result = proxy_call(self._autowrap, f, *args, **kwargs)\n', ' File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in proxy_call\n rv = execute(f, *args, **kwargs)\n', ' File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in execute\n six.reraise(c, e, tb)\n', ' File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker\n rv = meth(*args, **kwargs)\n', ' File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1065, in createWithFlags\n if ret == -1: raise libvirtError (\'virDomainCreateWithFlags() failed\', dom=self)\n', "libvirtError: Domain not found: no domain with matching uuid '1cf5f735-b97d-40f6-8cef-2af3e3056442' (instance-00000002)\n"]
2016-11-08 11:02:45.940 DEBUG nova.compute.manager [req-a0f277d3-487d-44b2-99be-8bbe45b5fc4a demo demo] [instance: 1cf5f735-b97d-40f6-8cef-2af3e3056442] Cleaning up image 0d345c39-131e-4bee-a4d2-9ff66a867b2a from (pid=7272) decorated_function /opt/stack/nova/nova/compute/manager.py:236
2016-11-08 11:02:45.940 TRACE nova.compute.manager [instance: 1cf5f735-b97d-40f6-8cef-2af3e3056442] Traceback (most recent call last):
2016-11-08 11:02:45.940 TRACE nova.compute.manager [instance: 1cf5f735-b97d-40f6-8cef-2af3e3056442] File "/opt/stack/nova/nova/compute/manager.py", line 232, in decorated_function
2016-11-08 11:02:45.940 TRACE nova.compute.manager [instance: 1cf5f735-b97d-40f6-8cef-2af3e3056442] *args, **kwargs)
2016-11-08 11:02:45.940 TRACE nova.compute.manager [instance: 1cf5f735-b97d-40f6-8cef-2af3e3056442] File "/opt/stack/nova/nova/compute/manager.py", line 3051, in snapshot_instance
2016-11-08 11:02:45.940 TRACE nova.compute.manager [instance: 1cf5f735-b97d-40f6-8cef-2af3e3056442] task_states.IMAGE_SNAPSHOT)
2016-11-08 11:02:45.940 TRACE nova.compute.manager [instance: 1cf5f735-b97d-40f6-8cef-2af3e3056442] File "/opt/stack/nova/nova/compute/manager.py", line 3080, in _snapshot_instance
2016-11-08 11:02:45.940 TRACE nova.compute.manager [instance: 1cf5f735-b97d-40f6-8cef-2af3e3056442] update_task_state)
2016-11-08 11:02:45.940 TRACE nova.compute.manager [instance: 1cf5f735-b97d-40f6-8cef-2af3e3056442] File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 1548, in snapshot
2016-11-08 11:02:45.940 TRACE nova.compute.manager [instance: 1cf5f735-b97d-40f6-8cef-2af3e3056442] state, instance)
2016-11-08 11:02:45.940 TRACE nova.compute.manager [instance: 1cf5f735-b97d-40f6-8cef-2af3e3056442] File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 1586, in _snapshot_domain
2016-11-08 11:02:45.940 TRACE nova.compute.manager [instance: 1cf5f735-b97d-40f6-8cef-2af3e3056442] guest = self._create_domain(domain=virt_dom)
2016-11-08 11:02:45.940 TRACE nova.compute.manager [instance: 1cf5f735-b97d-40f6-8cef-2af3e3056442] File "/opt/stack/nova/n...

Read more...

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

What package version are these?

And more importantly, are you able to reproduce this with latest upstream release, Newton?

Also, please see this for reference: https://wiki.openstack.org/wiki/BugFilingRecommendations

Changed in nova:
status: New → Incomplete
Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Err, I scrolled too fast, and missed your "able to re-create it on latest devstack on installed on 8th november" comment. Disregard comment #3 (Sorry for the noise, there)

Changed in nova:
status: Incomplete → New
Rajesh Tailor (ratailor)
Changed in nova:
assignee: nobody → Rajesh Tailor (ratailor)
Changed in nova:
status: New → In Progress
Revision history for this message
Rajesh Tailor (ratailor) wrote :

I couldn't reproduce the issue by following mentioned steps.

In my case, the image was in active state and the instance was deleted successfully.

Since it is a race condition, it would be difficult to reproduce.

Changed in nova:
assignee: Rajesh Tailor (ratailor) → nobody
Changed in nova:
status: In Progress → New
Revision history for this message
Andrey Volkov (avolkov) wrote :

I believe that it's expected behavior. In nova code there is a special case if something happens during snapshot image will be deleted.

https://github.com/openstack/nova/blob/lm_claims/nova/compute/api.py#L2730

Changed in nova:
status: New → Invalid
Revision history for this message
Srinivas Sakhamuri (srinivas-sakhamuri) wrote :

Yes the something happening is due to the delete of server via API after confirmation that image is created, but in fact nova-compute is still working on snapshot creation while it receives the delete request, hence it deletes the server and aborts the snapshot creates. This scenario uncovered while testing with rally (rally uses nova API)

https://github.com/openstack/rally/blob/master/rally/plugins/openstack/scenarios/nova/servers.py#L281-L282

So I don't think it is a invalid test case, there need to be some guard in compute to handle these conflicting operations.

Changed in nova:
status: Invalid → New
Revision history for this message
Matt Riedemann (mriedem) wrote :

Srinivas - the compute API never blocks a delete request, unless the server is locked. So by design you can attempt to delete a server in any case where it's unlocked (if you're an admin you can bypass the locked state too). So we aren't going to put a conditional on the delete API such that you can't delete the server while it's being snapshot.

At this point I'm not sure what you're looking for as far as a bug or fix. As noted by Andrey, the compute manager will cleanup the snapshot image in glance if the server was deleted during the snapshot:

https://github.com/openstack/nova/blame/15.0.0/nova/compute/manager.py#L3141

https://github.com/openstack/nova/blob/15.0.0/nova/compute/manager.py#L222

Changed in nova:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.