Comment 16 for bug 1567602

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-04-15 16:42 EDT-------
(In reply to comment #27)
> Meanwhile, with kernel 4.4.0-18 and installler version 447, the situation is
> different:
> While the problem originally occured in installer, is happens now, when the
> target system is being IPLed.
>
> The "switch" (to avoid the word "reason") is the kernel module scsi_dh_alua.
> When scsi_dh_alua is loaded, the auto LUN scanning upon activating a zfcp
> device with NPIV LUNs causes that not all LUNs are detected. I assume, that
> not the kernel module itself is the problem, but t seems to trigger the
> problem.
>
> A very easy way to reproduce this is as follows:
> have a system installed with kernel 4.4.0-18 (almost GA). Have two NPIV
> enabled zfcp adaptors (on different PCHIDs) that can access the same (few)
> LUNs each. Then:
>
> root@s8330005:~# lscss |grep 1732/03
> 0.0.1905 0.0.0012 1732/03 1731/03 80 80 ff 60000000 00000000
> 0.0.1945 0.0.0013 1732/03 1731/03 80 80 ff 61000000 00000000
> root@s8330005:~# lsmod |grep scsi_dh
> root@s8330005:~# chccwdev -e 1905
> Setting device 0.0.1905 online
> Done
> root@s8330005:~# lszfcp -D
> 0.0.1905/0x50050763070845e3/0x4082405300000000 0:0:0:1079197826
> 0.0.1905/0x50050763070845e3/0x4083405300000000 0:0:0:1079197827
> 0.0.1905/0x50050763070845e3/0x4084405300000000 0:0:0:1079197828
> 0.0.1905/0x50050763070845e3/0x4085405300000000 0:0:0:1079197829
> root@s8330005:~# lsmod |grep scsi_dh
> root@s8330005:~# modprobe scsi_dh_alua
> root@s8330005:~# chccwdev -e 1945
> Setting device 0.0.1945 online
> Done
> root@s8330005:~# lszfcp -D
> 0.0.1905/0x50050763070845e3/0x4082405300000000 0:0:0:1079197826
> 0.0.1905/0x50050763070845e3/0x4083405300000000 0:0:0:1079197827
> 0.0.1905/0x50050763070845e3/0x4084405300000000 0:0:0:1079197828
> 0.0.1905/0x50050763070845e3/0x4085405300000000 0:0:0:1079197829
> 0.0.1945/0x50050763071845e3/0x4082405300000000 1:0:0:1079197826
> /sbin/lszfcp: line 244:
> /sys/bus/ccw/drivers/zfcp/0.0.1945/host1/rport-1:0-0/target1:0:0/1:0:0:
> 1079197827//hba_id: No such file or directory
> /sbin/lszfcp: line 245:
> /sys/bus/ccw/drivers/zfcp/0.0.1945/host1/rport-1:0-0/target1:0:0/1:0:0:
> 1079197827//wwpn: No such file or directory
> /sbin/lszfcp: line 246:
> /sys/bus/ccw/drivers/zfcp/0.0.1945/host1/rport-1:0-0/target1:0:0/1:0:0:
> 1079197827//fcp_lun: No such file or directory
> 0.0.1945/0x50050763071845e3/0x4082405300000000 1:0:0:1079197827
> root@s8330005:~#
>
> When a zfcp attached SCSI LUN is used for installation, (almost?) all
> scsi_dh modules are loaded upon activation of the zfcp device. This causes
> some LUNs not to be detected.
> If I get acces to that system an blacklist the scsi_dh_alua module, the IPL
> works as expected ( i.e. all the LUNS are detected properly).

Hello, I just wanted to add to Thorsten's comment: we are also still working on analyzing this bug. To the best of my knowledge, the ALUA device-handler is not to blame here, like Thorsten also already suggested.

After looking at debug-data collected during reproductions, it seems like a SCSI-Command sent by the ALUA device-handler gets lost and thus causes a timeout after one Minute. This then causes the SCSI-scan to abort. Its not yet known what causes this, but I have not found any indication, that it is caused by parts of the Linux kernel (including our zFCP driver). I am still investigating this and when I find anything substantial, I gonna update this bug.

Also, we can't disable the ALUA device-handler completely - even if that sounds like a possible fix/workaround -, because storage-products like the IBM Storwize v7000 or the IBM SAN Volume Controller use those commands. It might be an individual workaround though, depending on the storage used.