The libvirt get_available_resource method calls _get_instance_disk_info on each running VM.
This method queries the libvirt guest XML to get a list of disks present on the VM. It queries the database to get a list of attached cinder volumes. It then filters the list of disks from the XML removing the list of cinder volumes
None the less at 19:32:53 we see the code try to run blockdev on iSCSI LUN with serial number 6f88ab3e-9edd-4fbf-857e-2d6d87f98e39:
this path is the Cinder volume we just started to hotunplug at 19:32:32 (eg 20 seconds earlier).
What this says is that from Nova's POV the block device is no longer attached, but from libvirt's XML point the block device is still attached.
At 19:34:39 the blockdev command times out
2014-09-22 19:34:39.385 31561 ERROR nova.openstack.common.periodic_task [-] Error during ComputeManager.update_available_resource: Unexpected error while running command.
Command: sudo nova-rootwrap /etc/nova/rootwrap.conf blockdev --getsize64 /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.2010-10.org.openstack:volume-6f88ab3e-9edd-4fbf-857e-2d6d87f98e39-lun-1
Exit code: 1
Stdout: u''
Stderr: u'blockdev: cannot open /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.2010-10.org.openstack:volume-6f88ab3e-9edd-4fbf-857e-2d6d87f98e39-lun-1: No such device or address\n'
This indicates that the iscsi volume has now been removed, presumably because the earlier cinder detach operation has succeeded.
It is conceivable that if the 'blockdev' command tries to access the iSCSI lun at precisely the wrong moment it might hang as the iSCSI connection is torn down, rather than getting the 'No such device or address' error.
So to me the root cause problem here is the _get_instance_disk_info method. The way it is written to query the libvirt XML and filter from known cinder volumes is clearly racy. If we can re-write the method to avoid that race, then we'd in turn avoid runing blockdev command on cinder volumes & thus avoid the error and the hangs/delays.
I've looking at the following logfile
http:// logs.openstack. org/23/ 119523/ 9/check/ check-tempest- dsvm-neutron- full/41505f7/ logs/screen- n-cpu.txt. gz?level= TRACE#_ 2014-09- 22_19_34_ 39_385
We can see a hot-unplug operation start for a cinder volume 6f88ab3e- 9edd-4fbf- 857e-2d6d87f98e 39 at 19:32:32
2014-09-22 19:32:32.963 AUDIT nova.compute. manager [req-840df9bc- aee9-48d9- baa3-d58392954b 0d ServerRescueNeg ativeTestJSON- 161189933 ServerRescueNeg ativeTestJSON- 667937194] [instance: 7b20e76a- 48a0-4dd4- b2b6-bb3b1056b7 11] Detach volume 6f88ab3e- 9edd-4fbf- 857e-2d6d87f98e 39 from mountpoint /dev/vdb
At 19:32:52 the periodic task starts to update resources
2014-09-22 19:32:52.747 31561 DEBUG nova.openstack. common. periodic_ task [-] Running periodic task ComputeManager. update_ available_ resource run_periodic_tasks /opt/stack/ new/nova/ nova/openstack/ common/ periodic_ task.py: 193
The libvirt get_available_ resource method calls _get_instance_ disk_info on each running VM.
This method queries the libvirt guest XML to get a list of disks present on the VM. It queries the database to get a list of attached cinder volumes. It then filters the list of disks from the XML removing the list of cinder volumes
None the less at 19:32:53 we see the code try to run blockdev on iSCSI LUN with serial number 6f88ab3e- 9edd-4fbf- 857e-2d6d87f98e 39:
2014-09-22 19:32:53.025 31561 DEBUG nova.openstack. common. processutils [-] Running cmd (subprocess): sudo nova-rootwrap /etc/nova/ rootwrap. conf blockdev --getsize64 /dev/disk/ by-path/ ip-127. 0.0.1:3260- iscsi-iqn. 2010-10. org.openstack: volume- 6f88ab3e- 9edd-4fbf- 857e-2d6d87f98e 39-lun- 1 execute /opt/stack/ new/nova/ nova/openstack/ common/ processutils. py:161
this path is the Cinder volume we just started to hotunplug at 19:32:32 (eg 20 seconds earlier).
What this says is that from Nova's POV the block device is no longer attached, but from libvirt's XML point the block device is still attached.
At 19:34:39 the blockdev command times out
2014-09-22 19:34:39.385 31561 ERROR nova.openstack. common. periodic_ task [-] Error during ComputeManager. update_ available_ resource: Unexpected error while running command. rootwrap. conf blockdev --getsize64 /dev/disk/ by-path/ ip-127. 0.0.1:3260- iscsi-iqn. 2010-10. org.openstack: volume- 6f88ab3e- 9edd-4fbf- 857e-2d6d87f98e 39-lun- 1 by-path/ ip-127. 0.0.1:3260- iscsi-iqn. 2010-10. org.openstack: volume- 6f88ab3e- 9edd-4fbf- 857e-2d6d87f98e 39-lun- 1: No such device or address\n'
Command: sudo nova-rootwrap /etc/nova/
Exit code: 1
Stdout: u''
Stderr: u'blockdev: cannot open /dev/disk/
This indicates that the iscsi volume has now been removed, presumably because the earlier cinder detach operation has succeeded.
It is conceivable that if the 'blockdev' command tries to access the iSCSI lun at precisely the wrong moment it might hang as the iSCSI connection is torn down, rather than getting the 'No such device or address' error.
So to me the root cause problem here is the _get_instance_ disk_info method. The way it is written to query the libvirt XML and filter from known cinder volumes is clearly racy. If we can re-write the method to avoid that race, then we'd in turn avoid runing blockdev command on cinder volumes & thus avoid the error and the hangs/delays.