LibvirtISCSIVolumeDriver: device size mismatch when LUN is reused
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Medium
|
Jason Dillaman |
Bug Description
Short problem summary:
=======
When LUN id is reused by SCSI provider, it may cause device size mismatch on the compute node. Host may report to guest the device size corresponding to volume previously mapped to this LUN id, not the device that is mapped there now. This happens for SCSI providers that use one target with many LUNs (eg Netapp).
Detailed problem description:
=======
Openstack iSCSI client in disconnect_volume() will call iscsiadm with --logout only if nobody else is using LUNs from that target. Otherwise, it will do nothing. Device stays there..
# ls -l /dev/disk/
lrwxrwxrwx. 1 root root 9 Feb 1 11:06 /dev/disk/
Later, nova-volume will unmap LUN from the initiator. This devices becomes invalid. Example "sanlun" output:
# sanlun lun show
controller(7mode)/ device host lun
vserver(Cmode) lun-pathname filename adapter protocol size mode
-------
1081809-413161-N2 <unknown> /dev/sdg host7 iSCSI 7
At some point, a different volume needs to be made available to the same compute node. Remote SCSI provider may choose to recycle an unused LUN id. From client's point of view, a different Openstack volume is visible under the same target and LUN id (as used before). After nova-volume completed LUN mapping, nova-compute's connect_volume() is called. Note that, at this point, iSCSI session to the target is up and device symlink (/dev/disk/
Access to the re-mapped device will produce
Feb 1 11:06:45 prod-cmp10 kernel: sd 7:0:0:1: [sdh] Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments.
Feb 1 11:06:45 prod-cmp10 kernel: sd 7:0:0:0: [sdg] Result: hostbyte=DID_OK driverbyte=
Feb 1 11:06:45 prod-cmp10 kernel: sd 7:0:0:0: [sdg] Sense Key : Illegal Request [current]
Feb 1 11:06:45 prod-cmp10 kernel: Info fld=0x0
For some strange reason, kernel reports the warning on device that did NOT change ("sdh" vs "sdg"). Possible bug in Linux iSCSI client ?
This issue affects SCSI systems where there are targets with multiple LUNs (eg Netapp). Openstack implementation on LVM/tgtd backend is not affected because there are multiple targets with single LUN. When the LUN becomes unused, driver will close the whole session.
Steps to reproduce:
================
1) create tree volumes with different sizes (1, 2, 3GB)
# euca-describe-
VOLUME vol-00000551 1 na.dev-netapp available 2013-02-
VOLUME vol-00000552 2 na.dev-netapp available 2013-02-
VOLUME vol-00000553 3 na.dev-netapp available 2013-02-
2) attach volumes 3G, 2G to an instance
compute node# virsh domblklist i-000005dc
Target Source
-------
...
vdc /dev/disk/
vdd /dev/disk/
instance# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
...
vdc 252:32 0 3G 0 disk
vdd 252:48 0 2G 0 disk
3) detach volume 3G (LUN0 becomes unused)
Device still exists
# ls -l /dev/disk/by-path/
...
lrwxrwxrwx. 1 root root 9 Feb 1 10:46 ip-172.
lrwxrwxrwx. 1 root root 9 Feb 1 10:44 ip-172.
# sanlun lun show
controller(7mode)/ device host lun
vserver(Cmode) lun-pathname filename adapter protocol size mode
-------
1081809-413161-N2 /vol/OpenStack_
1081809-413161-N2 <unknown> /dev/sdg host7 iSCSI 7
4) attach volume 1G to the same instance (LUN0 is reused for different volume)
Expected result:
Instance can see new 1G device attached
Actual result:
Instance is reporting the size to be 3G.
Host OS is also reporting 3G. SCSI tools report correct size (1G).
instance# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
...
vdc 252:32 0 3G 0 disk
vdd 252:48 0 2G 0 disk
compute# virsh domblklist i-000005dc
Target Source
-------
...
vdc /dev/disk/
vdd /dev/disk/
compute# ls -l /dev/disk/
lrwxrwxrwx. 1 root root 9 Feb 1 10:47 /dev/disk/
lrwxrwxrwx. 1 root root 9 Feb 1 10:47 /dev/disk/
compute# lsblk
...
sdg 8:96 0 3G 0 disk
sdh 8:112 0 2G 0 disk
compute# sanlun lun show
controller(7mode)/ device host lun
vserver(Cmode) lun-pathname filename adapter protocol size mode
-------
1081809-413161-N2 /vol/OpenStack_
1081809-413161-N2 /vol/OpenStack_
I'm attaching also more outputs with preserved formatting (outputs.txt) ..
Regards,
Brano Zarnovican
Changed in nova: | |
status: | Incomplete → Confirmed |
Changed in nova: | |
status: | Confirmed → In Progress |
assignee: | nobody → Jason Dillaman (jdillaman) |
Changed in nova: | |
importance: | Undecided → Medium |
tags: | added: havana-backport-potential |
tags: | added: grizzly-backport-potential |
Changed in nova: | |
milestone: | none → icehouse-2 |
Changed in nova: | |
status: | Fix Committed → Fix Released |
tags: | removed: grizzly-backport-potential |
Changed in nova: | |
milestone: | icehouse-2 → 2014.1 |
I forgot to add sw versions..
ScientificLinux 6.3 2.6.32- 279.11. 1.el6.x86_ 64 -utils- 6.2.0.872- 41.el6. x86_64
kernel-
iscsi-initiator
Openstack Essex 2012.1.3 (most likely affects also Folsom, master)
Netapp OnTAP 7.3.6P5
One option to fix this problem:
During disconnect_ volume( )..
1) if this was the last LUN in a session, close the session (as it is doing now)
2) otherwise delete that single device
echo 1 > /sys/block/ sdX/device/ delete
This is easier said than done, because 'rootwrap' module does not natively support "echo 1 > /something" :(