FCP devices are not detected correctly nor deterministically

Bug #1567602 reported by bugproxy on 2016-04-07
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Release Notes for Ubuntu
Undecided
Unassigned
Ubuntu on IBM z Systems
High
Unassigned
linux (Ubuntu)
High
Canonical Kernel
Xenial
High
Stefan Bader

Bug Description

Release notes:
Usage of SCSI LUNs on DS8870 Storage server with μCode Bundles 87.51.xx.0 (LMC 7.7.51.xx) via NPIV enabled zfcp adaptors causes detection issues. LP #1567602 In that case do not use NPIV enabled zfcp adaptors.

Scenario:
Using Installer 432.
No DASD devices, just two FCP CHPIDs with two LUNs each (configured for NPIV), provided via parmfile.
I expect the installer to probe and detect the LUNs automatically once I enable the FCP CHPIDs.

Repeated five times, I got different results each time. Detected LUNs vary between 2 and 4. One time 3 LUNs appear on SCSI1 and one on SCSI2 which looks especially odd to me.

Subsequent steps to create partitions and file systems in the installer fail.

Following is one of the five cases with more details:
After enabling CHPID 192b:
~ # cat /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 1077493890
  Vendor: IBM Model: 2107900 Rev: 1.27
  Type: Direct-Access ANSI SCSI revision: 05

dmesg showing:
[ 221.738816] qdio: 0.0.192b ZFCP on SC f using AI:1 QEBSM:1 PRI:1 TDD:1 SIGA: W A
[ 226.895139] scsi 0:0:0:1077493890: Direct-Access IBM 2107900 1.27 PQ: 0 ANSI: 5
[ 226.895816] sd 0:0:0:1077493890: alua: supports implicit TPGS
[ 226.896673] sd 0:0:0:1077493890: [sda] 41943040 512-byte logical blocks: (21.5 GB/20.0 GiB)
[ 226.897825] sd 0:0:0:1077493890: [sda] Write Protect is off
[ 226.897827] sd 0:0:0:1077493890: [sda] Mode Sense: ed 00 00 08
[ 226.898145] sd 0:0:0:1077493890: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 226.902571] sd 0:0:0:1077493890: [sda] Attached SCSI disk
[ 287.117057] sd 0:0:0:1077493890: alua: evpd inquiry failed with 30000
[ 287.117062] sd 0:0:0:1077493890: alua: Attach failed (-22)
[ 287.117064] sd 0:0:0:1077493890: failed to add device handler: -22
[ 287.147303] scsi 0:0:0:0: Unexpected response from lun 1077493890 while scanning, scan aborted

As second step, I additionally enable CHPID 196b:
~ # cat /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 1077493890
  Vendor: IBM Model: 2107900 Rev: 1.27
  Type: Direct-Access ANSI SCSI revision: 05
Host: scsi1 Channel: 00 Id: 00 Lun: 1077493890
  Vendor: IBM Model: 2107900 Rev: 1.27
  Type: Direct-Access ANSI SCSI revision: 05

dmesg showing:
[ 384.277394] scsi host1: zfcp
[ 384.286516] qdio: 0.0.196b ZFCP on SC 10 using AI:1 QEBSM:1 PRI:1 TDD:1 SIGA: W A
[ 385.377511] scsi 1:0:0:1077493890: Direct-Access IBM 2107900 1.27 PQ: 0 ANSI: 5
[ 385.378120] sd 1:0:0:1077493890: alua: supports implicit TPGS
[ 385.378781] sd 1:0:0:1077493890: [sdb] 41943040 512-byte logical blocks: (21.5 GB/20.0 GiB)
[ 385.380097] sd 1:0:0:1077493890: [sdb] Write Protect is off
[ 385.380099] sd 1:0:0:1077493890: [sdb] Mode Sense: ed 00 00 08
[ 385.380408] sd 1:0:0:1077493890: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 385.384969] sd 1:0:0:1077493890: [sdb] Attached SCSI disk
[ 446.117041] sd 1:0:0:1077493890: alua: evpd inquiry failed with 30000
[ 446.117046] sd 1:0:0:1077493890: alua: Attach failed (-22)
[ 446.117048] sd 1:0:0:1077493890: failed to add device handler: -22
[ 446.147158] scsi 1:0:0:0: Unexpected response from lun 1077493890 while scanning, scan aborted

Next, the installer warns that no disk drives were detected.

I checked this on a system (installer 445) with two NPIV enabled FCP devices that have 4 LUNS each. I only get the first of four LUNs per adaptor:

Apr 6 12:57:08 anna[2607]: DEBUG: retrieving zipl-installer 0.0.33ubuntu2
Apr 6 12:57:08 anna[2607]: 2016-04-06 12:57:08 URL:http://ports.ubuntu.com//pool/main/z/zipl-installer/zipl-installer_0.0.33ubuntu2_s390x.udeb [4994/4994] -> "/var/cache/anna/zipl-installer_0.0.33ubuntu2_s390x.udeb" [1]
Apr 6 12:57:09 main-menu[384]: INFO: Menu item 'driver-injection-disk-detect' selected
Apr 6 12:57:10 main-menu[384]: INFO: Menu item 'user-setup-udeb' selected
Apr 6 12:57:10 main-menu[384]: INFO: Menu item 'clock-setup' selected
Apr 6 12:57:10 main-menu[384]: INFO: Menu item 's390-dasd' selected
Apr 6 12:57:10 s390-dasd[6877]: INFO: s390-dasd: no channel found
Apr 6 12:57:10 main-menu[384]: INFO: Menu item 's390-zfcp' selected
Apr 6 12:57:10 s390-zfcp[6891]: DEBUG: DETECT: The zfcp device driver is not loaded
Apr 6 12:57:10 s390-zfcp[6891]: WARNING **: Could not open directory: /sys/bus/ccw/drivers/zfcp/: No such file or directory
Apr 6 12:57:11 main-menu[384]: INFO: Menu item 'disk-detect' selected
Apr 6 12:57:11 anna-install: Installing mmc-modules
Apr 6 12:57:11 disk-detect: insmod /lib/modules/4.4.0-17-generic/kernel/drivers/scsi/device_handler/scsi_dh_rdac.ko
Apr 6 12:57:11 disk-detect: insmod /lib/modules/4.4.0-17-generic/kernel/drivers/scsi/device_handler/scsi_dh_alua.ko
Apr 6 12:57:11 disk-detect: insmod /lib/modules/4.4.0-17-generic/kernel/drivers/scsi/device_handler/scsi_dh_hp_sw.ko
Apr 6 12:57:11 disk-detect: insmod /lib/modules/4.4.0-17-generic/kernel/drivers/scsi/device_handler/scsi_dh_emc.ko
Apr 6 12:57:11 kernel: [ 40.509881] rdac: device handler registered
Apr 6 12:57:11 kernel: [ 40.513219] alua: device handler registered
Apr 6 12:57:11 kernel: [ 40.515517] hp_sw: device handler registered
Apr 6 12:57:11 kernel: [ 40.517572] emc: device handler registered
Apr 6 12:57:11 net/hw-detect.hotplug: Detected hotpluggable network interface encf5f0
Apr 6 12:57:11 net/hw-detect.hotplug: Detected hotpluggable network interface lo
Apr 6 12:57:11 apt-install: Queueing package udev for later installation
Apr 6 12:57:11 apt-install: Queueing package pciutils for later installation
Apr 6 12:57:12 check-missing-firmware: looking at dmesg for the first time
Apr 6 12:57:12 check-missing-firmware: saving timestamp for a later use: [ 40.517572]
Apr 6 12:57:12 check-missing-firmware: /dev/.udev/firmware-missing does not exist, skipping
Apr 6 12:57:12 check-missing-firmware: /run/udev/firmware-missing does not exist, skipping
Apr 6 12:57:12 check-missing-firmware: no missing firmware in loaded kernel modules
Apr 6 12:58:29 main-menu[384]: (process:6908): unknown udeb mmc-modules
Apr 6 12:58:29 main-menu[384]: INFO: Menu item 'disk-detect' succeeded but requested to be left unconfigured.
Apr 6 12:58:31 main-menu[384]: INFO: Menu item 's390-zfcp' selected
Apr 6 12:58:31 s390-zfcp[7752]: DEBUG: DETECT: Added FCP device: 0.0.1904: online=0 npiv=0
Apr 6 12:58:31 s390-zfcp[7752]: DEBUG: DETECT: Added FCP device: 0.0.1944: online=0 npiv=0
Apr 6 12:58:31 s390-zfcp[7752]: DEBUG: DETECT: Automatic LUN scanning is enabled
Apr 6 12:58:31 s390-zfcp[7752]: DEBUG: PRESEED: No preseed data available
Apr 6 12:58:40 s390-zfcp[7752]: DEBUG: SELECT: Using FCP device 0.0.1904
Apr 6 12:58:40 kernel: [ 129.591146] scsi host0: scsi_eh_0: sleeping
Apr 6 12:58:40 kernel: [ 129.591163] scsi host0: zfcp
Apr 6 12:58:40 kernel: [ 129.598187] qdio: 0.0.1904 ZFCP on SC f using AI:1 QEBSM:1 PRI:1 TDD:1 SIGA: W A
Apr 6 12:58:41 s390-zfcp[7752]: DEBUG: ENABLE: Activated FCP device 0.0.1904 (npiv=1)
Apr 6 12:58:41 kernel: [ 130.873257] scsi 0:0:0:0: scsi scan: INQUIRY pass 1 length 36
Apr 6 12:58:41 kernel: [ 130.873581] scsi 0:0:0:0: scsi scan: INQUIRY successful with code 0x0
Apr 6 12:58:41 kernel: [ 130.873593] scsi 0:0:0:0: scsi scan: INQUIRY pass 2 length 164
Apr 6 12:58:41 kernel: [ 130.873802] scsi 0:0:0:0: scsi scan: INQUIRY successful with code 0x0
Apr 6 12:58:41 kernel: [ 130.873807] scsi 0:0:0:0: scsi scan: peripheral device type of 31, no device added
Apr 6 12:58:41 kernel: [ 130.874148] scsi 0:0:0:0: scsi scan: Sending REPORT LUNS to (try 0)
Apr 6 12:58:41 kernel: [ 130.875003] scsi 0:0:0:0: scsi scan: REPORT LUNS successful (try 0) result 0x0
Apr 6 12:58:41 kernel: [ 130.875005] scsi 0:0:0:0: scsi scan: REPORT LUN scan
Apr 6 12:58:41 kernel: [ 130.875176] scsi 0:0:0:1074937986: scsi scan: INQUIRY pass 1 length 36
Apr 6 12:58:41 kernel: [ 130.875395] scsi 0:0:0:1074937986: scsi scan: INQUIRY successful with code 0x0
Apr 6 12:58:41 kernel: [ 130.875399] scsi 0:0:0:1074937986: scsi scan: INQUIRY pass 2 length 164
Apr 6 12:58:41 kernel: [ 130.875608] scsi 0:0:0:1074937986: scsi scan: INQUIRY successful with code 0x0
Apr 6 12:58:41 kernel: [ 130.875613] scsi 0:0:0:1074937986: Direct-Access IBM 2107900 1.27 PQ: 0 ANSI: 5
Apr 6 12:58:41 kernel: [ 130.876181] sd 0:0:0:1074937986: alua: supports implicit TPGS
Apr 6 12:58:41 kernel: [ 130.876323] sd 0:0:0:1074937986: tag#1 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Apr 6 12:58:41 kernel: [ 130.876326] sd 0:0:0:1074937986: tag#1 CDB: Test Unit Ready 00 00 00 00 00 00
Apr 6 12:58:41 kernel: [ 130.876328] sd 0:0:0:1074937986: tag#1 Sense Key : Unit Attention [current]
Apr 6 12:58:41 kernel: [ 130.876333] sd 0:0:0:1074937986: tag#1 Add. Sense: Power on, reset, or bus device reset occurred
Apr 6 12:58:41 kernel: [ 130.876722] sd 0:0:0:1074937986: [sda] 41943040 512-byte logical blocks: (21.5 GB/20.0 GiB)
Apr 6 12:58:41 kernel: [ 130.877759] sd 0:0:0:1074937986: [sda] Write Protect is off
Apr 6 12:58:41 kernel: [ 130.877761] sd 0:0:0:1074937986: [sda] Mode Sense: ed 00 00 08
Apr 6 12:58:41 kernel: [ 130.878085] sd 0:0:0:1074937986: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 6 12:58:41 kernel: [ 130.878295] sd 0:0:0:1074937986: tag#1 Done: SUCCESS Result: hostbyte=DID_TARGET_FAILURE driverbyte=DRIVER_OK
Apr 6 12:58:41 kernel: [ 130.878297] sd 0:0:0:1074937986: tag#1 CDB: Report supported operation codes a3 0c 01 12 00 00 00 00 02 00 00 00
Apr 6 12:58:41 kernel: [ 130.878298] sd 0:0:0:1074937986: tag#1 Sense Key : Illegal Request [current]
Apr 6 12:58:41 kernel: [ 130.878300] sd 0:0:0:1074937986: tag#1 Add. Sense: Invalid field in cdb
Apr 6 12:58:41 kernel: [ 130.882710] sd 0:0:0:1074937986: [sda] Attached SCSI disk
Apr 6 12:58:43 s390-zfcp[7752]: DEBUG: POPULATE LUN TREE: 0.0.1904:0x50050763070845e3:0x4082401200000000 [0:0:0:1074937986]
Apr 6 12:58:43 s390-zfcp[7752]: DEBUG: WRITE CONFIG: /etc/sysconfig/hardware/config-ccw-0.0.1904
Apr 6 12:58:47 s390-zfcp[7752]: DEBUG: SELECT: Using FCP device 0.0.1944
Apr 6 12:58:47 kernel: [ 136.279385] scsi host1: scsi_eh_1: sleeping
Apr 6 12:58:47 kernel: [ 136.279396] scsi host1: zfcp
Apr 6 12:58:47 kernel: [ 136.290784] qdio: 0.0.1944 ZFCP on SC 10 using AI:1 QEBSM:1 PRI:1 TDD:1 SIGA: W A
Apr 6 12:58:52 s390-zfcp[7752]: DEBUG: ENABLE: Activated FCP device 0.0.1944 (npiv=1)
Apr 6 12:58:52 kernel: [ 141.278081] scsi 1:0:0:0: scsi scan: INQUIRY pass 1 length 36
Apr 6 12:58:52 kernel: [ 141.278691] scsi 1:0:0:0: scsi scan: INQUIRY successful with code 0x0
Apr 6 12:58:52 kernel: [ 141.278698] scsi 1:0:0:0: scsi scan: INQUIRY pass 2 length 164
Apr 6 12:58:52 kernel: [ 141.278910] scsi 1:0:0:0: scsi scan: INQUIRY successful with code 0x0
Apr 6 12:58:52 kernel: [ 141.278915] scsi 1:0:0:0: scsi scan: peripheral device type of 31, no device added
Apr 6 12:58:52 kernel: [ 141.279221] scsi 1:0:0:0: scsi scan: Sending REPORT LUNS to (try 0)
Apr 6 12:58:52 kernel: [ 141.280115] scsi 1:0:0:0: scsi scan: REPORT LUNS successful (try 0) result 0x0
Apr 6 12:58:52 kernel: [ 141.280119] scsi 1:0:0:0: scsi scan: REPORT LUN scan
Apr 6 12:58:52 kernel: [ 141.280296] scsi 1:0:0:1074937986: scsi scan: INQUIRY pass 1 length 36
Apr 6 12:58:52 kernel: [ 141.280507] scsi 1:0:0:1074937986: scsi scan: INQUIRY successful with code 0x0
Apr 6 12:58:52 kernel: [ 141.280511] scsi 1:0:0:1074937986: scsi scan: INQUIRY pass 2 length 164
Apr 6 12:58:52 kernel: [ 141.280693] scsi 1:0:0:1074937986: scsi scan: INQUIRY successful with code 0x0
Apr 6 12:58:52 kernel: [ 141.280698] scsi 1:0:0:1074937986: Direct-Access IBM 2107900 1.27 PQ: 0 ANSI: 5
Apr 6 12:58:52 kernel: [ 141.281519] sd 1:0:0:1074937986: alua: supports implicit TPGS
Apr 6 12:58:52 kernel: [ 141.281798] sd 1:0:0:1074937986: tag#1 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Apr 6 12:58:52 kernel: [ 141.281801] sd 1:0:0:1074937986: tag#1 CDB: Test Unit Ready 00 00 00 00 00 00
Apr 6 12:58:52 kernel: [ 141.281802] sd 1:0:0:1074937986: tag#1 Sense Key : Unit Attention [current]
Apr 6 12:58:52 kernel: [ 141.281804] sd 1:0:0:1074937986: tag#1 Add. Sense: Power on, reset, or bus device reset occurred
Apr 6 12:58:52 kernel: [ 141.282130] sd 1:0:0:1074937986: [sdb] 41943040 512-byte logical blocks: (21.5 GB/20.0 GiB)
Apr 6 12:58:52 kernel: [ 141.283353] sd 1:0:0:1074937986: [sdb] Write Protect is off
Apr 6 12:58:52 kernel: [ 141.283355] sd 1:0:0:1074937986: [sdb] Mode Sense: ed 00 00 08
Apr 6 12:58:52 kernel: [ 141.283637] sd 1:0:0:1074937986: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 6 12:58:52 kernel: [ 141.283758] sd 1:0:0:1074937986: tag#1 Done: SUCCESS Result: hostbyte=DID_TARGET_FAILURE driverbyte=DRIVER_OK
Apr 6 12:58:52 kernel: [ 141.283760] sd 1:0:0:1074937986: tag#1 CDB: Report supported operation codes a3 0c 01 12 00 00 00 00 02 00 00 00
Apr 6 12:58:52 kernel: [ 141.283761] sd 1:0:0:1074937986: tag#1 Sense Key : Illegal Request [current]
Apr 6 12:58:52 kernel: [ 141.283763] sd 1:0:0:1074937986: tag#1 Add. Sense: Invalid field in cdb
Apr 6 12:58:52 kernel: [ 141.288209] sd 1:0:0:1074937986: [sdb] Attached SCSI disk
Apr 6 12:58:53 s390-zfcp[7752]: DEBUG: POPULATE LUN TREE: 0.0.1944:0x50050763071845e3:0x4082401200000000 [1:0:0:1074937986]
Apr 6 12:58:53 s390-zfcp[7752]: DEBUG: WRITE CONFIG: /etc/sysconfig/hardware/config-ccw-0.0.1944

It looks like the LUN scan is being terminated.
Kernel parms with: scsi_mod.scsi_logging_level=4605 zfcp.dbflevel=6 zfcp.dbfsize=100

Re-testing with installer version 445 + udeb packages from 2016-04-07 s390-dasd 0.0.36ubuntu1, s390-zfcp 1.0.2ubuntu1, multipath-udeb 0.5.0+git1.656f8865-5ubuntu2, disk-detect 1.117ubuntu1

Some observations:
1. With one path only, the scanning/detection of all LUNs worked deterministically and correctly (including the final re-IPL after successful installation).
2. With two paths, but without the parmfile option 'disk-detect/multipath/enable=true', the LUNs were all detected, but once for every path. I did not want to continue installation that way.
3. With two paths and with the option 'disk-detect/multipath/enable=true', the multipath program was loaded and multipathing was set up correctly. ...
4. ... installation completed fine then...
5. .... but the final re-IPL after successful installation failed. Attaching last console messages of the IPL:
6. I could also see a difference in filesystem creation time for multipath and single path:

with multipathing (2 paths)
Apr 7 15:27:33 partman: mke2fs 1.42.13 (17-May-2015)
Apr 7 15:28:34 kernel: [ 559.624851] EXT4-fs (dm-4): mounted filesystem with ordered data mode. Opts: errors=remount-ro

without multipathing
Apr 7 15:48:17 partman: mke2fs 1.42.13 (17-May-2015)
Apr 7 15:48:50 kernel: [ 413.789054] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: errors=remount-ro

bugproxy (bugproxy) wrote : syslog

Default Comment by Bridge

tags: added: architecture-s39064 bugnameltc-139096 severity-high targetmilestone-inin1604

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
Luciano Chavez (lnx1138) on 2016-04-07
affects: ubuntu → debian-installer (Ubuntu)

------- Comment From <email address hidden> 2016-04-15 13:18 EDT-------
Meanwhile, with kernel 4.4.0-18 and installler version 447, the situation is different:
While the problem originally occured in installer, is happens now, when the target system is being IPLed.

The "switch" (to avoid the word "reason") is the kernel module scsi_dh_alua.
When scsi_dh_alua is loaded, the auto LUN scanning upon activating a zfcp device with NPIV LUNs causes that not all LUNs are detected. I assume, that not the kernel module itself is the problem, but t seems to trigger the problem.

A very easy way to reproduce this is as follows:
have a system installed with kernel 4.4.0-18 (almost GA). Have two NPIV enabled zfcp adaptors (on different PCHIDs) that can access the same (few) LUNs each. Then:

root@s8330005:~# lscss |grep 1732/03
0.0.1905 0.0.0012 1732/03 1731/03 80 80 ff 60000000 00000000
0.0.1945 0.0.0013 1732/03 1731/03 80 80 ff 61000000 00000000
root@s8330005:~# lsmod |grep scsi_dh
root@s8330005:~# chccwdev -e 1905
Setting device 0.0.1905 online
Done
root@s8330005:~# lszfcp -D
0.0.1905/0x50050763070845e3/0x4082405300000000 0:0:0:1079197826
0.0.1905/0x50050763070845e3/0x4083405300000000 0:0:0:1079197827
0.0.1905/0x50050763070845e3/0x4084405300000000 0:0:0:1079197828
0.0.1905/0x50050763070845e3/0x4085405300000000 0:0:0:1079197829
root@s8330005:~# lsmod |grep scsi_dh
root@s8330005:~# modprobe scsi_dh_alua
root@s8330005:~# chccwdev -e 1945
Setting device 0.0.1945 online
Done
root@s8330005:~# lszfcp -D
0.0.1905/0x50050763070845e3/0x4082405300000000 0:0:0:1079197826
0.0.1905/0x50050763070845e3/0x4083405300000000 0:0:0:1079197827
0.0.1905/0x50050763070845e3/0x4084405300000000 0:0:0:1079197828
0.0.1905/0x50050763070845e3/0x4085405300000000 0:0:0:1079197829
0.0.1945/0x50050763071845e3/0x4082405300000000 1:0:0:1079197826
/sbin/lszfcp: line 244: /sys/bus/ccw/drivers/zfcp/0.0.1945/host1/rport-1:0-0/target1:0:0/1:0:0:1079197827//hba_id: No such file or directory
/sbin/lszfcp: line 245: /sys/bus/ccw/drivers/zfcp/0.0.1945/host1/rport-1:0-0/target1:0:0/1:0:0:1079197827//wwpn: No such file or directory
/sbin/lszfcp: line 246: /sys/bus/ccw/drivers/zfcp/0.0.1945/host1/rport-1:0-0/target1:0:0/1:0:0:1079197827//fcp_lun: No such file or directory
0.0.1945/0x50050763071845e3/0x4082405300000000 1:0:0:1079197827
root@s8330005:~#

When a zfcp attached SCSI LUN is used for installation, (almost?) all scsi_dh modules are loaded upon activation of the zfcp device. This causes some LUNs not to be detected.
If I get acces to that system an blacklist the scsi_dh_alua module, the IPL works as expected ( i.e. all the LUNS are detected properly).

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Download full text (3.7 KiB)

------- Comment From <email address hidden> 2016-04-15 16:42 EDT-------
(In reply to comment #27)
> Meanwhile, with kernel 4.4.0-18 and installler version 447, the situation is
> different:
> While the problem originally occured in installer, is happens now, when the
> target system is being IPLed.
>
> The "switch" (to avoid the word "reason") is the kernel module scsi_dh_alua.
> When scsi_dh_alua is loaded, the auto LUN scanning upon activating a zfcp
> device with NPIV LUNs causes that not all LUNs are detected. I assume, that
> not the kernel module itself is the problem, but t seems to trigger the
> problem.
>
> A very easy way to reproduce this is as follows:
> have a system installed with kernel 4.4.0-18 (almost GA). Have two NPIV
> enabled zfcp adaptors (on different PCHIDs) that can access the same (few)
> LUNs each. Then:
>
> root@s8330005:~# lscss |grep 1732/03
> 0.0.1905 0.0.0012 1732/03 1731/03 80 80 ff 60000000 00000000
> 0.0.1945 0.0.0013 1732/03 1731/03 80 80 ff 61000000 00000000
> root@s8330005:~# lsmod |grep scsi_dh
> root@s8330005:~# chccwdev -e 1905
> Setting device 0.0.1905 online
> Done
> root@s8330005:~# lszfcp -D
> 0.0.1905/0x50050763070845e3/0x4082405300000000 0:0:0:1079197826
> 0.0.1905/0x50050763070845e3/0x4083405300000000 0:0:0:1079197827
> 0.0.1905/0x50050763070845e3/0x4084405300000000 0:0:0:1079197828
> 0.0.1905/0x50050763070845e3/0x4085405300000000 0:0:0:1079197829
> root@s8330005:~# lsmod |grep scsi_dh
> root@s8330005:~# modprobe scsi_dh_alua
> root@s8330005:~# chccwdev -e 1945
> Setting device 0.0.1945 online
> Done
> root@s8330005:~# lszfcp -D
> 0.0.1905/0x50050763070845e3/0x4082405300000000 0:0:0:1079197826
> 0.0.1905/0x50050763070845e3/0x4083405300000000 0:0:0:1079197827
> 0.0.1905/0x50050763070845e3/0x4084405300000000 0:0:0:1079197828
> 0.0.1905/0x50050763070845e3/0x4085405300000000 0:0:0:1079197829
> 0.0.1945/0x50050763071845e3/0x4082405300000000 1:0:0:1079197826
> /sbin/lszfcp: line 244:
> /sys/bus/ccw/drivers/zfcp/0.0.1945/host1/rport-1:0-0/target1:0:0/1:0:0:
> 1079197827//hba_id: No such file or directory
> /sbin/lszfcp: line 245:
> /sys/bus/ccw/drivers/zfcp/0.0.1945/host1/rport-1:0-0/target1:0:0/1:0:0:
> 1079197827//wwpn: No such file or directory
> /sbin/lszfcp: line 246:
> /sys/bus/ccw/drivers/zfcp/0.0.1945/host1/rport-1:0-0/target1:0:0/1:0:0:
> 1079197827//fcp_lun: No such file or directory
> 0.0.1945/0x50050763071845e3/0x4082405300000000 1:0:0:1079197827
> root@s8330005:~#
>
> When a zfcp attached SCSI LUN is used for installation, (almost?) all
> scsi_dh modules are loaded upon activation of the zfcp device. This causes
> some LUNs not to be detected.
> If I get acces to that system an blacklist the scsi_dh_alua module, the IPL
> works as expected ( i.e. all the LUNS are detected properly).

Hello, I just wanted to add to Thorsten's comment: we are also still working on analyzing this bug. To the best of my knowledge, the ALUA device-handler is not to blame here, like Thorsten also already suggested.

After looking at debug-data collected during reproductions, it seems like a SCSI-Command sent by the ALUA device-handler gets lost and thus causes a timeout after one Minute. This then...

Read more...

tags: removed: bugnameltc-139096 severity-high
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-04-19 13:38 EDT-------
The problem could not be seen under Debian kernel 4.2.0-1-s390x #1 SMP Debian 4.2.6-3 (2015-12-06) s390x, but under Ubuntu Kernel 4.3.0-7 and newer kernels:
With Debian kernel 4.4.0-1-s390x #1 SMP Debian 4.4.6-1 (2016-03-17) (on a debian userspace) it also occurs.

tags: added: bugnameltc-139096 severity-high
Dimitri John Ledkov (xnox) wrote :

So this is an upstream kernel regression with certain hardware configuration? I'm looking into scans as seen on our hardware, but it appears to be consistently discovered on our combinations of hardware, but will stress test that more.

affects: debian-installer (Ubuntu) → linux (Ubuntu)
Dimitri John Ledkov (xnox) wrote :

Also w.r.t. multipath, disk-detect/multipath/enable is now enabled by default on Ubuntu on s390x, thus multipath installer codepath should be the one offered by the installer when the devices are scanned correctly. Moving this bug report to kernel package, as it seems to me this is a kernel issue and it's not the installer logic that is at fault.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-04-20 06:45 EDT-------
The problem of LUNs not being properly discovered happened only on the following combination:
Storage Server DS8870 AND ( auto LUN scan when onlining an NPIV enabled device OR scripted parallelized addition of LUNs) AND recent kernels > 4.3 AND scsi_dh_alua is loaded.

bugproxy (bugproxy) on 2016-04-20
tags: added: targetmilestone-inin16041
removed: targetmilestone-inin1604
Stefan Bader (smb) wrote :

I also tried to re-create this on one of the lpars we got. No luck either. However I was using the latest kernel which was uploaded just ... I think yesterday (4.4.0-21). Thorsten, maybe you can try that version, too. Just to be on the safe side.

The dmesg I get also shows that alua probing is done. So that sounds to be going through the exactly same sequence. We got NPIV enabled, too. Only here I also got dasd which we use to ipl from. Mainly speculating here but it feels a bit like maybe newer kernels might send out inquiry commands quicker or in parallel and something in the environment fails to cope with that and silently drops some requests...

ChristianEhrhardt (paelzer) wrote :
Download full text (4.1 KiB)

Just to be sure I did a cross check on my lpar:

Cleaning my system to get an empty condition at the start
sudo chzdev zfcp-lun --all --disable --force
sudo chzdev zfcp-host --all --disable --force
sudo chzdev zfcp --all --disable --force

Now it looks like the one from Thorsten:

ubuntu@s1lp5:~$ lscss |grep 1732/03
0.0.e000 0.0.0d4d 1732/03 1731/03 80 80 ff 20000000 00000000
0.0.e001 0.0.0d4e 1732/03 1731/03 80 80 ff 20000000 00000000
0.0.e002 0.0.0d4f 1732/03 1731/03 80 80 ff 20000000 00000000
0.0.e003 0.0.0d50 1732/03 1731/03 80 80 ff 20000000 00000000
0.0.e004 0.0.0d51 1732/03 1731/03 80 80 ff 20000000 00000000
0.0.e005 0.0.0d52 1732/03 1731/03 80 80 ff 20000000 00000000
0.0.e006 0.0.0d53 1732/03 1731/03 80 80 ff 20000000 00000000
0.0.e007 0.0.0d54 1732/03 1731/03 80 80 ff 20000000 00000000
0.0.e008 0.0.0d55 1732/03 1731/03 80 80 ff 20000000 00000000
0.0.e009 0.0.0d56 1732/03 1731/03 80 80 ff 20000000 00000000
0.0.e00a 0.0.0d57 1732/03 1731/03 80 80 ff 20000000 00000000
0.0.e00b 0.0.0d58 1732/03 1731/03 80 80 ff 20000000 00000000
0.0.e00c 0.0.0d59 1732/03 1731/03 80 80 ff 20000000 00000000
0.0.e00d 0.0.0d5a 1732/03 1731/03 80 80 ff 20000000 00000000
0.0.e00e 0.0.0d5b 1732/03 1731/03 80 80 ff 20000000 00000000
0.0.e00f 0.0.0d5c 1732/03 1731/03 80 80 ff 20000000 00000000
0.0.e100 0.0.0d5d 1732/03 1731/03 80 80 ff 21000000 00000000
0.0.e101 0.0.0d5e 1732/03 1731/03 80 80 ff 21000000 00000000
0.0.e102 0.0.0d5f 1732/03 1731/03 80 80 ff 21000000 00000000
0.0.e103 0.0.0d60 1732/03 1731/03 80 80 ff 21000000 00000000
0.0.e104 0.0.0d61 1732/03 1731/03 80 80 ff 21000000 00000000
0.0.e105 0.0.0d62 1732/03 1731/03 80 80 ff 21000000 00000000
0.0.e106 0.0.0d63 1732/03 1731/03 80 80 ff 21000000 00000000
0.0.e107 0.0.0d64 1732/03 1731/03 80 80 ff 21000000 00000000
0.0.e108 0.0.0d65 1732/03 1731/03 80 80 ff 21000000 00000000
0.0.e109 0.0.0d66 1732/03 1731/03 80 80 ff 21000000 00000000
0.0.e10a 0.0.0d67 1732/03 1731/03 80 80 ff 21000000 00000000
0.0.e10b 0.0.0d68 1732/03 1731/03 80 80 ff 21000000 00000000
0.0.e10c 0.0.0d69 1732/03 1731/03 80 80 ff 21000000 00000000
0.0.e10d 0.0.0d6a 1732/03 1731/03 80 80 ff 21000000 00000000
0.0.e10e 0.0.0d6b 1732/03 1731/03 80 80 ff 21000000 00000000
0.0.e10f 0.0.0d6c 1732/03 1731/03 80 80 ff 21000000 00000000
ubuntu@s1lp5:~$ lsluns -a
ubuntu@s1lp5:~$ lsmod |grep scsi_dh

Then following Thorstens guide to trigger on live system
1. the one without alua being ok:
ubuntu@s1lp5:~$ sudo chccwdev -e 0.0.e000
Setting device 0.0.e000 online
Done
ubuntu@s1lp5:~$ lszfcp -D
0.0.e000/0x50050763061b16b6/0x4024400200000000 1:0:0:1073889316
0.0.e000/0x50050763061b16b6/0x4024400300000000 1:0:0:1073954852
0.0.e000/0x50050763060b16b6/0x4024400200000000 1:0:1:1073889316
0.0.e000/0x50050763060b16b6/0x4024400300000000 1:0:1:1073954852

2. the one after loading alua with issues:
ubuntu@s1lp5:~$ sud...

Read more...

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-04-20 11:35 EDT-------
(In reply to comment #37)
> I also tried to re-create this on one of the lpars we got. No luck either.
> However I was using the latest kernel which was uploaded just ... I think
> yesterday (4.4.0-21). Thorsten, maybe you can try that version, too. Just to
> be on the safe side.

I interpret that as having the same problem/sympton as I have. Correct?

I also tried that today with kernel 4.4.0-21, same problem. Then we got a uCode update on our machine, uCode now brand-new from 04/18. Problem persists.

And I involved a tester in U.S. to reproduce this on their machine. Waiting for results.

ChristianEhrhardt (paelzer) wrote :

FYI - Storage Server is on older version 5.7.41.1028

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-04-20 12:03 EDT-------
Problem persists here with DS8700 uCode version 5.7.51.1041 (Code Bundle 87.51.38.0)

Stefan Bader (smb) wrote :

Sorry no, what I meant was that I do not see any problem. All LUNs get detected, no error message about alua.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-04-21 07:05 EDT-------
I saw these problems today also on another machine / switch / DS8870 combination:
DS8870 uCode was 5.7.51.1009, which is also the newer stream.
Obviously the older stream 5.7.41.xx didn't show these problems.

At least the normal installation with non NPIV enabled SCSI LUNs on our DS8870 worked.
If I use NPIV enabled LUNs AND add the kernel parameter zfcp.allow_lun_scan=0 AND add the LUNs manually, it fails, if I have more than one LUN.
If I have NPIV enabled LUNs on DS8870 storage with microcode bundle 87.51.xx.x and use the LUN detection via NPIV, the installation fails.

no longer affects: ubuntu-release-notes
description: updated
Changed in ubuntu-release-notes:
status: New → Fix Released
bugproxy (bugproxy) wrote : syslog

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

------- Comment From <email address hidden> 2016-04-21 11:35 EDT-------
(In reply to comment #39)
>
> Thorsten please try to reproduce on a fully updated system, if not resolved
> for you we have to start to sort out the remaining differences in our
> environment.

Yes, I did that during the last two days. It occurs independently of any xenial kernel.

Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu):
importance: High → Undecided
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu):
status: New → Confirmed
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-04-25 06:07 EDT-------
@Canonical: Do you have the z13 connected via 16GB link via a switch to DS8870 - or via 8Gb link? - We could not reproduce this on 8Gb link.

ChristianEhrhardt (paelzer) wrote :

16GB to the switch and from there 8GB to the DS8K.
The DS8K was only delivered with 8G DAs

Changed in ubuntu-z-systems:
status: New → Triaged
Changed in ubuntu-z-systems:
status: Triaged → Confirmed
Changed in linux (Ubuntu):
assignee: Skipper Bug Screeners (skipper-screen-team) → Canonical Kernel (canonical-kernel)
Changed in ubuntu-z-systems:
importance: Undecided → High
Stefan Bader (smb) on 2016-06-10
Changed in linux (Ubuntu Xenial):
assignee: nobody → Stefan Bader (smb)
importance: Undecided → High
status: New → In Progress
Changed in ubuntu-z-systems:
status: Confirmed → In Progress
kaniggl (kaniggl) on 2016-06-17
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Changed in linux (Ubuntu):
status: Fix Released → Confirmed
importance: Undecided → High
Stefan Bader (smb) on 2016-06-27
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
98 comments hidden view all 178 comments

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Changed in ubuntu-z-systems:
status: In Progress → Fix Committed
Kamal Mostafa (kamalmostafa) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial

------- Comment From <email address hidden> 2016-07-04 06:42 EDT-------
Verified succesfully with kernel 4.4.0-30-generic from xenial-proposed. Please promote to xenial now.

Stefan Bader (smb) on 2016-07-05
tags: added: verification-done-xenial
removed: verification-needed-xenial
1 comments hidden view all 178 comments
Launchpad Janitor (janitor) wrote :
Download full text (6.1 KiB)

This bug was fixed in the package linux - 4.4.0-31.50

---------------
linux (4.4.0-31.50) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1602449

  * nouveau: boot hangs at blank screen with unsupported graphics cards
    (LP: #1602340)
    - SAUCE: drm: check for supported chipset before booting fbdev off the hw

linux (4.4.0-30.49) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1597897

  * FCP devices are not detected correctly nor deterministically (LP: #1567602)
    - scsi_dh_alua: Disable ALUA handling for non-disk devices
    - scsi_dh_alua: Use vpd_pg83 information
    - scsi_dh_alua: improved logging
    - scsi_dh_alua: sanitze sense code handling
    - scsi_dh_alua: use standard logging functions
    - scsi_dh_alua: return standard SCSI return codes in submit_rtpg
    - scsi_dh_alua: fixup description of stpg_endio()
    - scsi_dh_alua: use flag for RTPG extended header
    - scsi_dh_alua: use unaligned access macros
    - scsi_dh_alua: rework alua_check_tpgs() to return the tpgs mode
    - scsi_dh_alua: simplify sense code handling
    - scsi: Add scsi_vpd_lun_id()
    - scsi: Add scsi_vpd_tpg_id()
    - scsi_dh_alua: use scsi_vpd_tpg_id()
    - scsi_dh_alua: Remove stale variables
    - scsi_dh_alua: Pass buffer as function argument
    - scsi_dh_alua: separate out alua_stpg()
    - scsi_dh_alua: Make stpg synchronous
    - scsi_dh_alua: call alua_rtpg() if stpg fails
    - scsi_dh_alua: switch to scsi_execute_req_flags()
    - scsi_dh_alua: allocate RTPG buffer separately
    - scsi_dh_alua: Use separate alua_port_group structure
    - scsi_dh_alua: use unique device id
    - scsi_dh_alua: simplify alua_initialize()
    - revert commit a8e5a2d593cb ("[SCSI] scsi_dh_alua: ALUA handler attach should
      succeed while TPG is transitioning")
    - scsi_dh_alua: move optimize_stpg evaluation
    - scsi_dh_alua: remove 'rel_port' from alua_dh_data structure
    - scsi_dh_alua: Use workqueue for RTPG
    - scsi_dh_alua: Allow workqueue to run synchronously
    - scsi_dh_alua: Add new blacklist flag 'BLIST_SYNC_ALUA'
    - scsi_dh_alua: Recheck state on unit attention
    - scsi_dh_alua: update all port states
    - scsi_dh_alua: Send TEST UNIT READY to poll for transitioning
    - scsi_dh_alua: do not fail for unknown VPD identification

linux (4.4.0-29.48) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1597015

  * Wireless hotkey fails on Dell XPS 15 9550 (LP: #1589886)
    - intel-hid: new hid event driver for hotkeys
    - intel-hid: fix incorrect entries in intel_hid_keymap
    - intel-hid: allocate correct amount of memory for private struct
    - intel-hid: add a workaround to ignore an event after waking up from S4.
    - [Config] CONFIG_INTEL_HID_EVENT=m

  * cgroupfs mounts can hang (LP: #1588056)
    - Revert "UBUNTU: SAUCE: (namespace) mqueue: Super blocks must be owned by the
      user ns which owns the ipc ns"
    - Revert "UBUNTU: SAUCE: kernfs: Do not match superblock in another user
      namespace when mounting"
    - Revert "UBUNTU: SAUCE: cgroup: Use a new super block when mounting in a
      cgroup namespace"
    - (name...

Read more...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
1 comments hidden view all 178 comments
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-07-04 06:42 EDT-------
Verified succesfully with kernel 4.4.0-30-generic from xenial-proposed. Please promote to xenial now.

Changed in ubuntu-z-systems:
status: Fix Committed → Fix Released
1 comments hidden view all 178 comments
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-07-04 06:42 EDT-------
Verified succesfully with kernel 4.4.0-30-generic from xenial-proposed. Please promote to xenial now.

------- Comment From <email address hidden> 2016-07-18 08:09 EDT-------
Verified successfully running an installation in an NPIV-multipath environment with installer 451.3 from xenial-proposed. Repeated three times.

SCSI devices are probed correctly and detected deterministically. Multipathing is detected and set up automatically.

root@s8315043:~# lszdev
TYPE ID ON PERS NAMES
[some stuff deleted]
zfcp-host 0.0.192b yes yes
zfcp-host 0.0.196b yes yes
zfcp-lun 0.0.192b:0x50050763070845e3:0x4082403900000000 yes yes sda sg0
zfcp-lun 0.0.192b:0x50050763070845e3:0x4083403900000000 yes yes sdb sg1
zfcp-lun 0.0.192b:0x50050763070845e3:0x4084403900000000 yes yes sdc sg2
zfcp-lun 0.0.192b:0x50050763070845e3:0x4085403900000000 yes yes sdd sg3
zfcp-lun 0.0.196b:0x50050763071845e3:0x4082403900000000 yes yes sde sg4
zfcp-lun 0.0.196b:0x50050763071845e3:0x4083403900000000 yes yes sdf sg5
zfcp-lun 0.0.196b:0x50050763071845e3:0x4084403900000000 yes yes sdg sg6
zfcp-lun 0.0.196b:0x50050763071845e3:0x4085403900000000 yes yes sdh sg7

root@s8315043:~# lsscsi
[0:0:0:1077493890]disk IBM 2107900 .470 /dev/sda
[0:0:0:1077493891]disk IBM 2107900 .470 /dev/sdb
[0:0:0:1077493892]disk IBM 2107900 .470 /dev/sdc
[0:0:0:1077493893]disk IBM 2107900 .470 /dev/sdd
[1:0:0:1077493890]disk IBM 2107900 .470 /dev/sde
[1:0:0:1077493891]disk IBM 2107900 .470 /dev/sdf
[1:0:0:1077493892]disk IBM 2107900 .470 /dev/sdg
[1:0:0:1077493893]disk IBM 2107900 .470 /dev/sdh

root@s8315043:~# multipath -l
mpathd (36005076307ffc5e30000000000008539) dm-4 IBM,2107900
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
|- 0:0:0:1077493893 sdd 8:48 active undef running
`- 1:0:0:1077493893 sdh 8:112 active undef running
mpathc (36005076307ffc5e30000000000008439) dm-0 IBM,2107900
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
|- 0:0:0:1077493892 sdc 8:32 active undef running
`- 1:0:0:1077493892 sdg 8:96 active undef running
mpathb (36005076307ffc5e30000000000008339) dm-1 IBM,2107900
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
|- 0:0:0:1077493891 sdb 8:16 active undef running
`- 1:0:0:1077493891 sdf 8:80 active undef running
mpatha (36005076307ffc5e30000000000008239) dm-3 IBM,2107900
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
|- 0:0:0:1077493890 sda 8:0 active undef running
`- 1:0:0:1077493890 sde 8:64 active undef running

Displaying first 40 and last 40 comments. View all 178 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments