Deleting instance files sometimes fails on NFS

Bug #1040110 reported by Lothar Gesslein
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Michael Still

Bug Description

Sometimes on Instance deletion the following error is logged:

2012-08-22 11:16:10 AUDIT nova.compute.manager [req-ba764e1d-36e0-4479-988e-7dc15fe47c29 bdb0bf5eb88b47f7a8fa8de5123e48ee d4bf9021295a4573aa10ca4900211856] [
instance: 9ea4c222-e36d-468a-85b3-4aafd24eff9a] Terminating instance
2012-08-22 11:16:14 WARNING nova.virt.libvirt.connection [req-ba764e1d-36e0-4479-988e-7dc15fe47c29 bdb0bf5eb88b47f7a8fa8de5123e48ee d4bf9021295a4573aa10ca490
0211856] [instance: 9ea4c222-e36d-468a-85b3-4aafd24eff9a] Error from libvirt during saved instance removal. Code=3 Error=this function is not supported by th
e connection driver: virDomainHasManagedSaveImage
2012-08-22 11:16:14 INFO nova.virt.libvirt.connection [req-ba764e1d-36e0-4479-988e-7dc15fe47c29 bdb0bf5eb88b47f7a8fa8de5123e48ee d4bf9021295a4573aa10ca490021
1856] [instance: 9ea4c222-e36d-468a-85b3-4aafd24eff9a] Deleting instance files /var/lib/nova/instances/f9226d17-1bfe-4819-b440-939e7635ed50/instance-0000000c
2012-08-22 11:16:14 ERROR nova.rpc.amqp [req-ba764e1d-36e0-4479-988e-7dc15fe47c29 bdb0bf5eb88b47f7a8fa8de5123e48ee d4bf9021295a4573aa10ca4900211856] Exceptio
n during message handling
2012-08-22 11:16:14 TRACE nova.rpc.amqp Traceback (most recent call last):
2012-08-22 11:16:14 TRACE nova.rpc.amqp File "/usr/lib64/python2.6/site-packages/nova/rpc/amqp.py", line 253, in _process_data
2012-08-22 11:16:14 TRACE nova.rpc.amqp rval = node_func(context=ctxt, **node_args)
2012-08-22 11:16:14 TRACE nova.rpc.amqp File "/usr/lib64/python2.6/site-packages/nova/exception.py", line 114, in wrapped
2012-08-22 11:16:14 TRACE nova.rpc.amqp return f(*args, **kw)
2012-08-22 11:16:14 TRACE nova.rpc.amqp File "/usr/lib64/python2.6/site-packages/nova/compute/manager.py", line 162, in decorated_function
2012-08-22 11:16:14 TRACE nova.rpc.amqp function(self, context, instance_uuid, *args, **kwargs)
2012-08-22 11:16:14 TRACE nova.rpc.amqp File "/usr/lib64/python2.6/site-packages/nova/compute/manager.py", line 186, in decorated_function
2012-08-22 11:16:14 TRACE nova.rpc.amqp sys.exc_info())
2012-08-22 11:16:14 TRACE nova.rpc.amqp File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__
2012-08-22 11:16:14 TRACE nova.rpc.amqp self.gen.next()
2012-08-22 11:16:14 TRACE nova.rpc.amqp File "/usr/lib64/python2.6/site-packages/nova/compute/manager.py", line 180, in decorated_function
2012-08-22 11:16:14 TRACE nova.rpc.amqp return function(self, context, instance_uuid, *args, **kwargs)
2012-08-22 11:16:14 TRACE nova.rpc.amqp File "/usr/lib64/python2.6/site-packages/nova/compute/manager.py", line 769, in terminate_instance
2012-08-22 11:16:14 TRACE nova.rpc.amqp do_terminate_instance()
2012-08-22 11:16:14 TRACE nova.rpc.amqp File "/usr/lib64/python2.6/site-packages/nova/utils.py", line 945, in inner
2012-08-22 11:16:14 TRACE nova.rpc.amqp retval = f(*args, **kwargs)
2012-08-22 11:16:14 TRACE nova.rpc.amqp File "/usr/lib64/python2.6/site-packages/nova/compute/manager.py", line 762, in do_terminate_instance
2012-08-22 11:16:14 TRACE nova.rpc.amqp self._delete_instance(context, instance)
2012-08-22 11:16:14 TRACE nova.rpc.amqp File "/usr/lib64/python2.6/site-packages/nova/compute/manager.py", line 740, in _delete_instance
2012-08-22 11:16:14 TRACE nova.rpc.amqp self._shutdown_instance(context, instance, 'Terminating')
2012-08-22 11:16:14 TRACE nova.rpc.amqp File "/usr/lib64/python2.6/site-packages/nova/compute/manager.py", line 709, in _shutdown_instance
2012-08-22 11:16:14 TRACE nova.rpc.amqp block_device_info)
2012-08-22 11:16:14 TRACE nova.rpc.amqp File "/usr/lib64/python2.6/site-packages/nova/virt/libvirt/connection.py", line 515, in destroy
2012-08-22 11:16:14 TRACE nova.rpc.amqp cleanup=True)
2012-08-22 11:16:14 TRACE nova.rpc.amqp File "/usr/lib64/python2.6/site-packages/nova/virt/libvirt/connection.py", line 509, in _destroy
2012-08-22 11:16:14 TRACE nova.rpc.amqp self._cleanup(instance)
2012-08-22 11:16:14 TRACE nova.rpc.amqp File "/usr/lib64/python2.6/site-packages/nova/virt/libvirt/connection.py", line 524, in _cleanup
2012-08-22 11:16:14 TRACE nova.rpc.amqp shutil.rmtree(target)
2012-08-22 11:16:14 TRACE nova.rpc.amqp File "/usr/lib64/python2.6/shutil.py", line 221, in rmtree
2012-08-22 11:16:14 TRACE nova.rpc.amqp onerror(os.rmdir, path, sys.exc_info())
2012-08-22 11:16:14 TRACE nova.rpc.amqp File "/usr/lib64/python2.6/shutil.py", line 219, in rmtree
2012-08-22 11:16:14 TRACE nova.rpc.amqp os.rmdir(path)
2012-08-22 11:16:14 TRACE nova.rpc.amqp OSError: [Errno 39] Directory not empty: '/var/lib/nova/instances/f9226d17-1bfe-4819-b440-939e7635ed50/instance-00000
00c'
2012-08-22 11:16:14 TRACE nova.rpc.amqp
2012-08-22 11:16:15 INFO nova.virt.libvirt.connection [-] [instance: 9ea4c222-e36d-468a-85b3-4aafd24eff9a] Instance destroyed successfully.

The instance stays in the "deleting" task_state and another api call then deletes it successfully. We store instance files on a NFS mount, and shutil.rmtree has a known (as in: google is full of it) problem on NFS when the files that are to be deleted are still opened by someone. The nfs-client renames the open file to .nfs0000... and will automatically delete it once it is no longer open. Since there are indeed .nfs000... files in the directory rmdir fails.

The following is shortened output of a repeating directory listing, where you can see how the image files are first renamed and then disappear.

drwxr-xr-x 2 root root 4096 Aug 21 10:44 .
drwxr-xr-x 167 root root 28672 Aug 22 10:35 ..
-rw-rw---- 1 root root 0 Aug 21 10:43 console.log
-rw-r--r-- 1 root root 23622320128 Aug 22 11:16 disk
-rw-r--r-- 1 root root 34359738368 Aug 21 10:44 disk.swap
-rw-r--r-- 1 root root 1689 Aug 21 10:43 libvirt.xml
total 6074632
drwxr-xr-x 2 root root 4096 Aug 21 10:44 .
drwxr-xr-x 167 root root 28672 Aug 22 10:35 ..
-rw-rw---- 1 root root 0 Aug 21 10:43 console.log
-rw-r--r-- 1 root root 23622320128 Aug 22 11:16 disk
-rw-r--r-- 1 root root 34359738368 Aug 21 10:44 disk.swap
-rw-r--r-- 1 root root 1689 Aug 21 10:43 libvirt.xml
total 6074628
drwxr-xr-x 2 root root 4096 Aug 22 11:16 .
drwxr-xr-x 167 root root 28672 Aug 22 10:35 ..
-rw-r--r-- 1 root root 23622320128 Aug 22 11:16 .nfs0000000000004f8400000001
-rw-r--r-- 1 root root 34359738368 Aug 21 10:44 .nfs0000000000004f8600000002
total 6074628
drwxr-xr-x 2 root root 4096 Aug 22 11:16 .
drwxr-xr-x 167 root root 28672 Aug 22 10:35 ..
-rw-r--r-- 1 root root 23622320128 Aug 22 11:16 .nfs0000000000004f8400000001
-rw-r--r-- 1 root root 34359738368 Aug 21 10:44 .nfs0000000000004f8600000002
total 6074628
drwxr-xr-x 2 root root 4096 Aug 22 11:16 .
drwxr-xr-x 167 root root 28672 Aug 22 10:35 ..
-rw-r--r-- 1 root root 34359738368 Aug 21 10:44 .nfs0000000000004f8600000002
total 52
drwxr-xr-x 2 root root 4096 Aug 22 11:16 .
drwxr-xr-x 167 root root 28672 Aug 22 10:35 ..
-rw-r--r-- 1 root root 34359738368 Aug 21 10:44 .nfs0000000000004f8600000002
total 52
drwxr-xr-x 2 root root 4096 Aug 22 11:16 .
drwxr-xr-x 167 root root 28672 Aug 22 10:35 ..
total 36
drwxr-xr-x 2 root root 4096 Aug 22 11:16 .
drwxr-xr-x 167 root root 28672 Aug 22 10:35 ..
total 36

Tags: ops
Revision history for this message
Mark McLoughlin (markmc) wrote :

These .nfs files are apparently created when you delete a file on NFS that is still held open by a process:

http://librelist.com/browser//libgit2/2012/4/3/file-locking-in-libgit2/#7a19b8c37c5235ddfd60ff8a137c94ef

So, we need to figure out why the file is still open.

This is interesting:

  Error from libvirt during saved instance removal. Code=3 Error=this function is not supported by the connection driver: virDomainHasManagedSaveImage

and could indicate another problem

What exact versions of OpenStack components are you using? What version of libvirt and what distro?

Could you enable debugging (debug=true in nova.conf) and attach the full compute log?

Changed in nova:
status: New → Incomplete
Revision history for this message
Lothar Gesslein (lgesslein) wrote :
Download full text (23.5 KiB)

We are running nova 2012.1.2, libvirt 0.9.6, hypervisor is XEN (which i think is the only cause for the function not supported thing), and all this is happening on SLES 11 SP2. lsof +D does not list open files on the instance storage path while the VM is running, I was not yet quick enough to capture which process has open file handles while terminating. Here is the full debug log:

2012-08-27 10:14:07 DEBUG nova.compute.manager [req-572dadf2-15be-45f4-87db-2c3c20f4f36f None None] [instance: 4783ef04-e521-451b-9ace-743c5ecc326c] Checking state from (pid=21989) _get_power_state /usr/lib64/python2.6/site-packages/nova/compute/manager.py:282
2012-08-27 10:14:07 DEBUG nova.virt.libvirt.connection [req-572dadf2-15be-45f4-87db-2c3c20f4f36f None None] Connecting to libvirt: xen:/// from (pid=21989) _get_connection /usr/lib64/python2.6/site-packages/nova/virt/libvirt/connection.py:296
2012-08-27 10:14:13 DEBUG nova.compute.manager [req-572dadf2-15be-45f4-87db-2c3c20f4f36f None None] [instance: 4783ef04-e521-451b-9ace-743c5ecc326c] Current state is 2, state in DB is 2. from (pid=21989) init_host /usr/lib64/python2.6/site-packages/nova/compute/manager.py:250
2012-08-27 10:14:13 DEBUG nova.compute.manager [req-572dadf2-15be-45f4-87db-2c3c20f4f36f None None] [instance: 6fdb2052-0c07-4371-a5e9-900ea9c783dc] Checking state from (pid=21989) _get_power_state /usr/lib64/python2.6/site-packages/nova/compute/manager.py:282
2012-08-27 10:14:13 DEBUG nova.compute.manager [req-572dadf2-15be-45f4-87db-2c3c20f4f36f None None] [instance: 6fdb2052-0c07-4371-a5e9-900ea9c783dc] Current state is 2, state in DB is 1. from (pid=21989) init_host /usr/lib64/python2.6/site-packages/nova/compute/manager.py:250
2012-08-27 10:14:13 DEBUG nova.compute.manager [req-572dadf2-15be-45f4-87db-2c3c20f4f36f None None] [instance: bf1fa3e8-cc82-444e-9205-7e65834aa540] Checking state from (pid=21989) _get_power_state /usr/lib64/python2.6/site-packages/nova/compute/manager.py:282
2012-08-27 10:14:13 DEBUG nova.compute.manager [req-572dadf2-15be-45f4-87db-2c3c20f4f36f None None] [instance: bf1fa3e8-cc82-444e-9205-7e65834aa540] Current state is 2, state in DB is 1. from (pid=21989) init_host /usr/lib64/python2.6/site-packages/nova/compute/manager.py:250
2012-08-27 10:14:13 DEBUG nova.compute.manager [req-572dadf2-15be-45f4-87db-2c3c20f4f36f None None] [instance: f98f88ad-6f7a-4109-adb8-c5376f6d639d] Checking state from (pid=21989) _get_power_state /usr/lib64/python2.6/site-packages/nova/compute/manager.py:282
2012-08-27 10:14:13 DEBUG nova.compute.manager [req-572dadf2-15be-45f4-87db-2c3c20f4f36f None None] [instance: f98f88ad-6f7a-4109-adb8-c5376f6d639d] Current state is 2, state in DB is 1. from (pid=21989) init_host /usr/lib64/python2.6/site-packages/nova/compute/manager.py:250
2012-08-27 10:14:13 DEBUG nova.compute.manager [req-572dadf2-15be-45f4-87db-2c3c20f4f36f None None] [instance: 2d814f2b-5739-408b-b85d-1366853700a6] Checking state from (pid=21989) _get_power_state /usr/lib64/python2.6/site-packages/nova/compute/manager.py:282
2012-08-27 10:14:14 DEBUG nova.compute.manager [req-572dadf2-15be-45f4-87db-2c3c20f4f36f None None] [instance: 2d814f2b-5739-408b-b85d-13668537...

Revision history for this message
Lothar Gesslein (lgesslein) wrote :

From reading the kernel nfs-client code i get that nfs does not only rename open files to .nfs000.. but every file it is about to unlink. Files that are not held open just then disappear so fast that it normally does not matter and you do not see the renamed file. I'm guessing in our case the files are not actually open, but the nfs just takes a while to send back an ack for the rpc call.

Revision history for this message
Lothar Gesslein (lgesslein) wrote :

This is what change i have tested now, seems to be working around the issue reliably so far:

--- connection.py.orig 2012-08-27 12:10:47.000000000 +0200
+++ /usr/lib64/python2.6/site-packages/nova/virt/libvirt/connection.py 2012-08-27 13:02:31.000000000 +0200
@@ -215,6 +215,13 @@
 def _get_eph_disk(ephemeral):
     return 'disk.eph' + str(ephemeral['num'])

+def _rmtree_wait_nfs(function, path, excinfo):
+ LOG.warning(_("Cloud not delete %s, exception: %s\n"), path, excinfo)
+ while any(".nfs" in element for element in os.listdir(path) ):
+ LOG.warning(_(".nfs files found, waiting 2 seconds"))
+ time.sleep(2)
+ shutil.rmtree(path)
+

 class LibvirtConnection(driver.ComputeDriver):

@@ -521,7 +528,7 @@
         if FLAGS.libvirt_type == 'lxc':
             disk.destroy_container(self.container)
         if os.path.exists(target):
- shutil.rmtree(target)
+ shutil.rmtree(target,onerror=_rmtree_wait_nfs)

     def get_volume_connector(self, instance):
         if not self._initiator:
@@ -538,7 +545,7 @@
         target = os.path.join(self.get_instances_path(instance),
                               instance['name'] + "_resize")
         if os.path.exists(target):
- shutil.rmtree(target)
+ shutil.rmtree(target,onerror=_rmtree_wait_nfs)

     def volume_driver_method(self, method_name, connection_info,
                              *args, **kwargs):

Revision history for this message
Mark McLoughlin (markmc) wrote :

Ok, thanks for all the info. Hopefully we can figure out something better than the workaround you're using.

Changed in nova:
status: Incomplete → Confirmed
importance: Undecided → Medium
Revision history for this message
Michael Still (mikal) wrote :

I think we could resolve this by adding a periodic task which did retries of deletes, and then each file which isn't deleted gets added to the queue for that periodic task to process. However, its way too late in grizzly to implement something like that, so I'm going to assign this bug to myself to I remember to revisit this in Havana.

Changed in nova:
assignee: nobody → Michael Still (mikalstill)
Michael Still (mikal)
Changed in nova:
milestone: none → havana-1
Michael Still (mikal)
tags: added: ops
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/26315

Changed in nova:
status: Confirmed → In Progress
Thierry Carrez (ttx)
Changed in nova:
milestone: havana-1 → havana-2
Changed in nova:
milestone: havana-2 → havana-3
Revision history for this message
Michael Still (mikal) wrote :

Sorry, this bug was fixed by I895869e18f92ed57bd2f8c1683cf9a398d88b680, but that wasn't linked properly to this bug.

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: havana-3 → 2013.2
Revision history for this message
Yang Yu (yuyangbj) wrote :

It looks like this bug is not fixed by the change I895869e18f92ed57bd2f8c1683cf9a398d88b680. There is still exception thrown using NFS backend. And there is also a bug reported by other stacker. https://bugs.launchpad.net/nova/+bug/1248019

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.