reboot of a nova instance appears to leave attached volumes in a bad state

Bug #1440201 reported by Jay Bryant
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Cinder
Confirmed
Medium
Unassigned
OpenStack Compute (nova)
Confirmed
Undecided
Mitsuhiro Tanino

Bug Description

Steps to recreate:

1) attach a volume to a VM
2) reboot VM
3) detach the volume from VM, right after the volume status changes from "In-use" to "available", attach the volume again, upper error occurs and the volume can't be attached to the VM any more. BTW, in step3, detach the volume from VM, wait 3 minutes or so, attach the volume again, can be attached successfully.

This is being seen in Kilo using LVM/iSCSI.

Looking at the logs it appears that something is happening during the reboot that leave the iscsi volume in a bad state. Right after the reboot I see:

2015-04-03 14:42:33.319 19415 INFO nova.compute.manager [req-4995e5ea-c845-4b5b-a6d4-8a214e0ea87f 157e5fddcca340a2a9ec6cd5a1216b4f df004f3072784994870e46ee997ea70f - - -] [instance: 0235ba32-323d-4f59-91ea-2f2f7ef2bd04] Detach volume 3cc769bb-d585-4812-a8fd-2888fedda58d from mountpoint /dev/vdb
2015-04-03 14:42:56.268 19415 INFO nova.compute.manager [req-0037930a-9d20-4b8c-a150-74b0b4411530 157e5fddcca340a2a9ec6cd5a1216b4f df004f3072784994870e46ee997ea70f - - -] [instance: 0235ba32-323d-4f59-91ea-2f2f7ef2bd04] Attaching volume 3cc769bb-d585-4812-a8fd-2888fedda58d to /dev/vdb
2015-04-03 14:43:01.647 19415 ERROR nova.virt.libvirt.driver [req-0037930a-9d20-4b8c-a150-74b0b4411530 157e5fddcca340a2a9ec6cd5a1216b4f df004f3072784994870e46ee997ea70f - - -] [instance: 0235ba32-323d-4f59-91ea-2f2f7ef2bd04] Failed to attach volume at mountpoint: /dev/vdb

After this I start seeing the following in the logs:

2015-04-03 14:43:03.157 19415 INFO nova.scheduler.client.report [req-0037930a-9d20-4b8c-a150-74b0b4411530 157e5fddcca340a2a9ec6cd5a1216b4f df004f3072784994870e46ee997ea70f - - -] Compute_service record updated for ('devo-n02-kvm.rch.kstart.ibm.com', 'devo-n02-kvm.rch.kstart.ibm.com')
2015-04-03 14:43:03.638 19415 ERROR oslo_messaging.rpc.dispatcher [req-0037930a-9d20-4b8c-a150-74b0b4411530 157e5fddcca340a2a9ec6cd5a1216b4f df004f3072784994870e46ee997ea70f - - -] Exception during message handling: The supplied device (vdb) is busy.
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher Traceback (most recent call last):
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher executor_callback))
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher executor_callback)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 130, in _do_dispatch
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher result = func(ctxt, **new_args)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 430, in decorated_function
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/exception.py", line 88, in wrapped
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher payload)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 82, in __exit__
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/exception.py", line 71, in wrapped
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher return f(self, context, *args, **kw)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 314, in decorated_function
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher LOG.warning(msg, e, instance_uuid=instance_uuid)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 82, in __exit__
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 290, in decorated_function
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 342, in decorated_function
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher kwargs['instance'], e, sys.exc_info())
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 82, in __exit__
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 330, in decorated_function
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4719, in attach_volume
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher do_attach_volume(context, instance, driver_bdm)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 431, in inner
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher return f(*args, **kwargs)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 82, in __exit__
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4714, in do_attach_volume
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher return self._attach_volume(context, instance, driver_bdm)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4737, in _attach_volume
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher self.volume_api.unreserve_volume(context, bdm.volume_id)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 82, in __exit__
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4729, in _attach_volume
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher do_check_attach=False, do_driver_attach=True)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/virt/block_device.py", line 48, in wrapped
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher ret_val = method(obj, context, *args, **kwargs)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/virt/block_device.py", line 258, in attach
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher connector)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 82, in __exit__
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/virt/block_device.py", line 249, in attach
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher device_type=self['device_type'], encryption=encryption)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 1062, in attach_volume
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher raise exception.DeviceIsBusy(device=disk_dev)
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher DeviceIsBusy: The supplied device (vdb) is busy.
2015-04-03 14:43:03.638 19415 TRACE oslo_messaging.rpc.dispatcher

After this the update of resources start failing:

2015-04-03 14:45:02.507 19415 ERROR nova.openstack.common.periodic_task [req-fe14e38c-8f29-46b0-884b-3b24f7ff0397 - - - - -] Error during ComputeManager.update_available_resource: Unexpected error while running command.
Command: sudo nova-rootwrap /etc/nova/rootwrap.conf blockdev --getsize64 /dev/disk/by-path/ip-9.114.170.202:3260-iscsi-iqn.2010-10.org.openstack:volume-3cc769bb-d585-4812-a8fd-2888fedda58d-lun-0
Exit code: 1
Stdout: u''
Stderr: u'blockdev: cannot open /dev/disk/by-path/ip-9.114.170.202:3260-iscsi-iqn.2010-10.org.openstack:volume-3cc769bb-d585-4812-a8fd-2888fedda58d-lun-0: No such device or address\n'
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task Traceback (most recent call last):
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py", line 224, in run_periodic_tasks
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task task(self, context)
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6166, in update_available_resource
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task rt.update_available_resource(context)
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 317, in update_available_resource
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task resources = self.driver.get_available_resource(self.nodename)
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4872, in get_available_resource
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task disk_over_committed = self._get_disk_over_committed_size_total()
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6054, in _get_disk_over_committed_size_total
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task self._get_instance_disk_info(dom.name(), xml))
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6007, in _get_instance_disk_info
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task dk_size = lvm.get_volume_size(path)
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/lvm.py", line 172, in get_volume_size
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task run_as_root=True)
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/utils.py", line 55, in execute
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task return utils.execute(*args, **kwargs)
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/nova/utils.py", line 206, in execute
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task return processutils.execute(*cmd, **kwargs)
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py", line 224, in execute
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task cmd=sanitized_cmd)
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task ProcessExecutionError: Unexpected error while running command.
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task Command: sudo nova-rootwrap /etc/nova/rootwrap.conf blockdev --getsize64 /dev/disk/by-path/ip-9.114.170.202:3260-iscsi-iqn.2010-10.org.openstack:volume-3cc769bb-d585-4812-a8fd-2888fedda58d-lun-0
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task Exit code: 1
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task Stdout: u''
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task Stderr: u'blockdev: cannot open /dev/disk/by-path/ip-9.114.170.202:3260-iscsi-iqn.2010-10.org.openstack:volume-3cc769bb-d585-4812-a8fd-2888fedda58d-lun-0: No such device or address\n'
2015-04-03 14:45:02.507 19415 TRACE nova.openstack.common.periodic_task

The file under /dev/disk/by-path/ exists but but you get the 'No such device or address' error when attempting to access it.

In /var/log/messages I see this:

Apr 3 14:43:00 devo-n02-kvm kernel: [731664.038599] connection8:0: detected conn error (1020)
Apr 3 14:43:01 devo-n02-kvm iscsid: connection18:0 is operational after recovery (7 attempts)
Apr 3 14:43:01 devo-n02-kvm iscsid: conn 0 login rejected: initiator failed authorization with target
Apr 3 14:43:01 devo-n02-kvm libvirtd[61554]: operation failed: target vdb already exists
Apr 3 14:43:02 devo-n02-kvm kernel: [731665.358832] connection18:0: detected conn error (1020)
Apr 3 14:43:03 devo-n02-kvm iscsid: Kernel reported iSCSI connection 18:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
Apr 3 14:43:04 devo-n02-kvm iscsid: conn 0 login rejected: initiator failed authorization with target
Apr 3 14:43:04 devo-n02-kvm kernel: [731667.948507] connection18:0: detected conn error (1020)

I am wondering if this might be related to this bug: https://bugs.dogfood.paddev.net/nova/+bug/1367189 Are the chap credentials lost with the reboot or something like that?

Tags: lvm
Jay Bryant (jsbryant)
Changed in cinder:
importance: Undecided → Medium
Jay Bryant (jsbryant)
tags: added: lvm
Revision history for this message
Mitsuhiro Tanino (mitsuhiro-tanino) wrote :

Hi, Jay,

I confirmed this bug happened in my devstack environment using LVM/iSCSI driver.
From my investigation of this issue, this might be bug of Libvirt.

When Nova calls detach_volume() at step (3), Libvirt tries to remove the block device information from an instance's
xml file and then calls disconnect_volume() to logout an "iSCSI session".

In the disconnect_volume(), Nova checks all block device list from all VM's xml files at _get_all_block_devices(), and
if there are some block devices related to the "iSCSI session", Nova does not logout from the iSCSI session.

Normally, Libvirt removes the block device info from instance's xml quickly, but in this case I found the block device
info remained at the disconnect_volume(). I suppose reboot VM via Libvirt affects some bad impact to remove
block device info. So the "iSCSI session" logout was skipped. As a result, the iSCSI session remained.

After that, if user tries to attach a volume again, Cinder creates new user and password and passes these to Nova.
Nova tries to use these new user and password but the old iSCSI session still remains, this connection is failed
and you can see many error log of iscsid in the /var/log/messages.

I think this is not only affect LVM but also other iSCSI backend storages.

Changed in nova:
assignee: nobody → Mitsuhiro Tanino (mitsuhiro-tanino)
Jay Bryant (jsbryant)
Changed in nova:
status: New → Confirmed
Changed in cinder:
status: New → Confirmed
Revision history for this message
Jay Bryant (jsbryant) wrote :

Mitsuhiro,

Thank you for confirming this. Did you have to do anything special to get this to recreate in devstack? I haven't had any luck getting it to recreate in my devstack environment.

So you are thinking we are hitting some sort of timing issue between when the libvirt removes the block device info and when disconnect_volume() is called?

I wonder if we could add a retry in disconnect_volume around 'devices = self.connection._get_all_block_devices()' to put a sleep in and retry for cases where that does actually return devices.

The other question is if there is something we can do to detect and disconnect the orphaned iscsi session when this situation is encountered at attach time. Looking at _connect_to_iscsi_portal, I guess that be easier said than done given that we don't have access to the XML indicating what volumes are actually being used.

Will keep trying to recreate here so I can play around with this more. Let me know what you figure out.

Thanks for your help!

Revision history for this message
Mitsuhiro Tanino (mitsuhiro-tanino) wrote :

Hi Jay,

Thank you for your reply.

>> I haven't had any luck getting it to recreate in my devstack environment.

I followed your recreation steps, so;
(1) create VM
(2) create Volume
(3) attach Volume to the VM
(4) "nova reboot instance_id" and then execute "nova volume-detach" immediately and several times until the detach is executed without error. Sometimes I saw following error.

    ERROR (Conflict): Cannot 'detach_volume' instance 1234 while it is in task_state reboot_started (HTTP 409) (Request-ID: req-c8884047-13aa-474e-9a18-a101deea9278)

(5) "nova volume-attach" again.

I think this bug is duplicate of https://bugs.launchpad.net/nova/+bug/1374508.

>>So you are thinking we are hitting some sort of timing issue between when the libvirt removes the block device info and when disconnect_volume() is called?

Yes.
The point is that virt_dom.detachDeviceFlags(xml, flags) in detach_volume() seems successful but in fact, this detach does not work properly when we execute detach volume immediately after reboot VM .
From my debug test, I could get device information when I added debug output after virt_dom.detachDeviceFlags(xml, flags) was called.

As git-hary mentioned at #1374508, I'm also guessing there is a race between the reboot(unrescue,or other libvirt operations) and the detach within libvirt.
I tried same debug test as follows.If I put "sleep 10" before virt_dom.detachDeviceFlags(xml, flags), the detach works properly. However, if I put sleep 10 after virt_dom.detachDeviceFlags(xml, flags), that was not any effect too. If we can get some status of VM to judge adding sleep,
But, I checked the power status was "Running" and instance.task_state was None. These statuses are not

---- snip ---
In nova.virt.libvirt.driver.LibvirtDriver.detach_volume if I put in a sleep before virt_dom.detachDeviceFlags(xml, flags) the detach appears to work properly however if I sleep after that line it does not appear to have any effect.
---- snip ---

>>I wonder if we could add a retry in disconnect_volume around 'devices = self.connection._get_all_block_devices()' to put a sleep in and retry for cases where that does actually return devices.

Adding sleep around _get_all_block_devices() is same situation of adding sleep after virt_dom.detachDeviceFlags(xml, flags) above. So this might not help for this issue, in my understanding.

>>The other question is if there is something we can do to detect and disconnect the orphaned iscsi session when this situation is encountered at attach time. Looking at _connect_to_iscsi_portal, I guess that be easier said than done given that we don't have access to the XML indicating what volumes are actually being used.

This might be possible to logout during next volume attach phase, but need further investigation.

Thanks,
Mitsuhiro Tanino

Revision history for this message
Mitsuhiro Tanino (mitsuhiro-tanino) wrote :

Oops, I posted before finished.

If we can get some status of VM to judge whether we should add sleep or not, it might be workaround.
But, when I checked the power status and task state, the power status was "Running" and instance.task_state was "None". These statuses are not helpful to judge for adding sleep...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.