Trying to online dasd drive results in invalid input/output from the kernel on z/VM

Bug #1845323 reported by Dimitri John Ledkov on 2019-09-25
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Critical
Unassigned
linux (Ubuntu)
High
Canonical Kernel

Bug Description

SRU Justification:
==================

[Impact]

* Trying to online dasd drive results in invalid input/output from the kernel on z/VM

[Fix]

* Ignore errors from non-essential metadata reads, which may or may not be implemented by the storage server or z/VM.

[Test Case]

* Ubuntu on z/VM guest installation and selecting at least one DASD device (that's not defined as dedicated).

* Alternatively doing an Ubuntu on z/VM guest installation on zFCP/SCSI disk and manually activating a DASD device post-install with 'chzdev -e <device-number>'.

[Regression Potential]

* The kernel (aka DASD module) currently just fails on activating a (non dedicated) DASD.

* But regressions might be introduced in the DASD stack so that is fails later on LPAR installation, too - but this can easily be tested (and will be).

* The chance that zFCP/SCSI disks are harmed by accident is quite low, since this is a very different stack.

[Other Info]

* This is a regression that was introduced with the thin dasd provisioning feature that landed upstream with kernel 5.2/5.3, so affects Eoan only.
__________

Sep 25 12:06:39 s390-dasd[4637]: ECKD DASD 0.0.0200 configure failed
Sep 25 12:06:39 s390-dasd[4637]: Error: Could not write file /sys/bus/ccw/drivers/dasd-eckd/0.0.0200/online: Input/output error
Sep 25 12:06:39 s390-dasd[4637]: Configuring devices in the active configuration only
Sep 25 12:06:39 main-menu[421]: WARNING **: Configuring 's390-dasd' failed with error code 1
Sep 25 12:06:39 main-menu[421]: WARNING **: Menu item 's390-dasd' failed.
Sep 25 12:06:39 kernel: [ 137.472853] dasd-eckd.401b68: 0.0.0200: A channel path to the device has become operational
Sep 25 12:06:39 kernel: [ 137.473914] dasd-eckd.6b7759: 0.0.0200: Reading the volume storage information failed with rc=-5
Sep 25 12:06:39 kernel: [ 137.473917] dasd.3e7d29: 0.0.0200 Setting the DASD online with discipline ECKD failed with rc=-5
Sep 25 12:06:39 kernel: [ 137.473918] ------------[ cut here ]------------
Sep 25 12:06:39 kernel: [ 137.473943] WARNING: CPU: 0 PID: 4638 at kernel/module.c:1137 module_put.part.0+0xe2/0xe8
Sep 25 12:06:39 kernel: [ 137.473944] Modules linked in: lcs ctcm fsm zfcp scsi_transport_fc dasd_fba_mod dasd_eckd_mod dasd_mod qeth_l2 pkey crc32_vx_s390 qeth qdio zcrypt_cex4 ccwgroup zcrypt
Sep 25 12:06:39 kernel: [ 137.473953] CPU: 0 PID: 4638 Comm: chzdev Not tainted 5.3.0-10-generic #11-Ubuntu
Sep 25 12:06:39 kernel: [ 137.473954] Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
Sep 25 12:06:39 kernel: [ 137.473955] Krnl PSW : 0704c00180000000 000000002b2c3372 (module_put.part.0+0xe2/0xe8)
Sep 25 12:06:39 kernel: [ 137.473958] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
Sep 25 12:06:39 kernel: [ 137.473959] Krnl GPRS: 0000000000000004 0000000000000006 0000000000000024 0000000000000007
Sep 25 12:06:39 kernel: [ 137.473960] 0000000000000007 000000007f2ce800 fffffffffffffffb 000003ff80151578
Sep 25 12:06:39 kernel: [ 137.473961] 00000000fffffffb 000003ff80074df8 000003ff80151900 0000000078105800
Sep 25 12:06:39 kernel: [ 137.473962] 000000007e17e600 0000000000000bf8 000000002b2c336e 000003e000c5fb08
Sep 25 12:06:39 kernel: [ 137.473969] Krnl Code: 000000002b2c3362: c0200048859a larl %r2,2bbd3e96
Sep 25 12:06:39 kernel: [ 137.473969] 000000002b2c3368: c0e5fffe16a4 brasl %r14,2b2860b0
Sep 25 12:06:39 kernel: [ 137.473969] #000000002b2c336e: a7f40001 brc 15,2b2c3370
Sep 25 12:06:39 kernel: [ 137.473969] >000000002b2c3372: a7f4ffb1 brc 15,2b2c32d4
Sep 25 12:06:39 kernel: [ 137.473969] 000000002b2c3376: 0707 bcr 0,%r7
Sep 25 12:06:39 kernel: [ 137.473969] 000000002b2c3378: c00400000000 brcl 0,2b2c3378
Sep 25 12:06:39 kernel: [ 137.473969] 000000002b2c337e: ec280006007c cgij %r2,0,8,2b2c338a
Sep 25 12:06:39 kernel: [ 137.473969] 000000002b2c3384: c0f4ffffff86 brcl 15,2b2c3290
Sep 25 12:06:39 kernel: [ 137.473980] Call Trace:
Sep 25 12:06:39 kernel: [ 137.473982] ([<000000002b2c336e>] module_put.part.0+0xde/0xe8)
Sep 25 12:06:39 kernel: [ 137.473994] [<000003ff80074df8>] dasd_generic_free_discipline+0x68/0x80 [dasd_mod]
Sep 25 12:06:39 kernel: [ 137.473998] [<000003ff8007f5ba>] dasd_delete_device+0x122/0x1c8 [dasd_mod]
Sep 25 12:06:39 kernel: [ 137.474001] [<000003ff8007bcfa>] dasd_generic_set_online+0x2ea/0x310 [dasd_mod]
Sep 25 12:06:39 kernel: [ 137.474006] [<000000002b84db48>] ccw_device_set_online+0x110/0x528
Sep 25 12:06:39 kernel: [ 137.474008] [<000000002b84dfb0>] online_store_recog_and_online+0x50/0x130
Sep 25 12:06:39 kernel: [ 137.474009] [<000000002b84f042>] online_store+0x1b2/0x2e0
Sep 25 12:06:39 kernel: [ 137.474013] [<000000002b50b7a0>] kernfs_fop_write+0xd8/0x1f8
Sep 25 12:06:39 kernel: [ 137.474017] [<000000002b4585f8>] vfs_write+0xb0/0x1b8
Sep 25 12:06:39 kernel: [ 137.474018] [<000000002b459eb8>] ksys_write+0x68/0xf8
Sep 25 12:06:39 kernel: [ 137.474023] [<000000002ba4c4b8>] system_call+0xdc/0x2c8
Sep 25 12:06:39 kernel: [ 137.474024] Last Breaking-Event-Address:
Sep 25 12:06:39 kernel: [ 137.474025] [<000000002b2c336e>] module_put.part.0+0xde/0xe8
Sep 25 12:06:39 kernel: [ 137.474026] ---[ end trace 94b382399e0f8876 ]---

Cannot online a dasd drive with $ chzdev -e 0.0.0200

Possibly regression in the new kernel, due to thin provisioning patches to dasd? just a wild guess.

$ uname -a
Linux hwe0006 5.3.0-10-generic #11-Ubuntu SMP Mon Sep 9 15:08:10 UTC 2019 s390x GNU/Linux

Dimitri John Ledkov (xnox) wrote :
Frank Heimes (fheimes) on 2019-09-25
tags: added: s390x

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1845323

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: eoan
Frank Heimes (fheimes) on 2019-09-26
Changed in linux (Ubuntu):
status: Incomplete → New

Today I (re-)tested Eoan daily ISO image (form 26th)
(http://cdimage.ubuntu.com/ubuntu-server/daily/20190926/eoan-server-s390x.iso)
using the standard (d-i) installer kernel/initrd as shipped in /boot (not proposed)
(5.3.0-10-generic)
and can confirm that the problem with setting DASD devices online
(using d-i or using chzdev in d-i shell)
still occurs on z/VM 6.4 (tried RSU levels 1601, 1701 and 1901/latest).
Interestingly this does NOT happen on LPAR installations!
Involved IBM, too and shared this ticket ...

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1845323

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Frank Heimes (fheimes) on 2019-09-26
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
summary: Trying to online dasd drive results in invalid input/output from the
- kernel
+ kernel on z/VM
tags: added: bot-stop-nagging
Changed in linux (Ubuntu):
assignee: nobody → Canonical Kernel (canonical-kernel)
Frank Heimes (fheimes) wrote :

Got feedback that this problems occurs not only on z/VM 6.4 but also on 7.1, and it indeed seems to be an issue with thin provisioning (LP 1830731 #3). It seems to happen with virtual devices and MDISKS, but not with dedicated DASDs. Fix is in progress by IBM.

Frank Heimes (fheimes) on 2019-09-26
description: updated
Frank Heimes (fheimes) on 2019-09-27
Changed in ubuntu-z-systems:
status: New → Confirmed
importance: Undecided → Critical

We are working on a fix. Can you give this testpatch a spin?

description: updated
Frank Heimes (fheimes) wrote :

Christian, I took the patch and applied it to what's currently in Eoan (19.10) master-next.
It applied cleanly and it compiled fine and I installed it in z/VM guest that I installed on FCP.
Afterwards I was able to enable add. DASDs from there with chzdev:

buntu@hwe0006:~$ sudo chzdev -e 200
ECKD DASD 0.0.0200 configured

ubuntu@hwe0006:~$ lszdev dasd | grep yes
dasd-eckd 0.0.0200 yes yes dasda

I thought I go ahead and create a proper file-system:

ubuntu@hwe0006:~$ sudo dasdfmt -b 4096 /dev/dasda
Drive Geometry: 10016 Cylinders * 15 Heads = 150240 Tracks
Device Type: Fully Provisioned

I am going to format the device /dev/dasda in the following way:
   Device number of device : 0x200
   Labelling device : yes
   Disk label : VOL1
   Disk identifier : 0X0200
   Extent start (trk no) : 0
   Extent end (trk no) : 150239
   Compatible Disk Layout : yes
   Blocksize : 4096
   Mode : Full

--->> ATTENTION! <<---
All data of that device will be lost.
Type "yes" to continue, no will leave the disk untouched: yes
Formatting the device. This may take a while (get yourself a coffee).

Finished formatting the device.
Rereading the partition table... ok
ubuntu@hwe0006:~$ sudo fdasd -a /dev/dasda
reading volume label ..: VOL1
reading vtoc ..........: ok

auto-creating one partition for the whole disk...
writing volume label...
writing VTOC...
rereading partition table...
ubuntu@hwe0006:~$ sudo mkfs.ext4 /dev/dasda
dasda dasda1
ubuntu@hwe0006:~$ sudo mkfs.ext4 /dev/dasda1
mke2fs 1.45.3 (14-Jul-2019)
Creating filesystem with 1802856 4k blocks and 451584 inodes
Filesystem UUID: 63b8ea83-a707-42a7-bec2-16ee956eea30
Superblock backups stored on blocks:
 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done

ubuntu@hwe0006:~$

So it looks pretty good.
(Now need to regression test on further environments ...)

Thanks for the quick turn-around Jan, Stefan and Christian!

Dimitri John Ledkov (xnox) wrote :
Changed in linux (Ubuntu):
importance: Undecided → High
Frank Heimes (fheimes) on 2019-09-27
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Changed in ubuntu-z-systems:
status: Confirmed → In Progress
tags: added: id-5d8d330bb2bfed880c80c1da

Hi,

here is the patch as I have just sent it for upstream integration to Jens Axboe. No functional changes to the testpatch posted by Christian. Just my sign-off attached after some regression test.

Regards,
Stefan

Frank Heimes (fheimes) wrote :

I just checked kernel 5.3.0.17.19 from eoan proposed (that should incl. all patches/reverts) on a z/VM guest that's installed on zFCP and has access to DASDs and problem is no longer there when one chzdev's a DASD.
Hence changing status to Fix Committed.

Now need to wait until this kernel get's released and finds it's way into d-i...

Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Changed in ubuntu-z-systems:
status: In Progress → Fix Committed
Frank Heimes (fheimes) wrote :
Launchpad Janitor (janitor) wrote :
Download full text (7.6 KiB)

This bug was fixed in the package linux - 5.3.0-17.18

---------------
linux (5.3.0-17.18) eoan; urgency=medium

  * eoan/linux: 5.3.0-17.18 -proposed tracker (LP: #1846641)

  * CVE-2019-17056
    - nfc: enforce CAP_NET_RAW for raw sockets

  * CVE-2019-17055
    - mISDN: enforce CAP_NET_RAW for raw sockets

  * CVE-2019-17054
    - appletalk: enforce CAP_NET_RAW for raw sockets

  * CVE-2019-17053
    - ieee802154: enforce CAP_NET_RAW for raw sockets

  * CVE-2019-17052
    - ax25: enforce CAP_NET_RAW for raw sockets

  * CVE-2019-15098
    - ath6kl: fix a NULL-ptr-deref bug in ath6kl_usb_alloc_urb_from_pipe()

  * xHCI on AMD Stoney Ridge cannot detect USB 2.0 or 1.1 devices.
    (LP: #1846470)
    - x86/PCI: Avoid AMD FCH XHCI USB PME# from D0 defect

  * Re-enable linux-libc-dev build on i386 (LP: #1846508)
    - [Packaging] Build only linux-libc-dev for i386
    - [Debian] final-checks -- ignore archtictures with no binaries

  * arm64: loop on boot after installing linux-generic-hwe-18.04-edge/bionic-
    proposed (LP: #1845820)
    - [Config] Disable CONFIG_ARM_SMMU_DISABLE_BYPASS_BY_DEFAULT

  * Revert ESE DASD discard support (LP: #1846219)
    - SAUCE: Revert "s390/dasd: Add discard support for ESE volumes"

  * Miscellaneous Ubuntu changes
    - update dkms package versions

linux (5.3.0-16.17) eoan; urgency=medium

  * eoan/linux: 5.3.0-16.17 -proposed tracker (LP: #1846204)

  * zfs fails to build on s390x with debug symbols enabled (LP: #1846143)
    - SAUCE: s390: Mark atomic const ops always inline

linux (5.3.0-15.16) eoan; urgency=medium

  * eoan/linux: 5.3.0-15.16 -proposed tracker (LP: #1845987)

  * Drop i386 build for 19.10 (LP: #1845714)
    - [Packaging] Remove x32 arch references from control files
    - [Debian] final-checks -- Get arch list from debian/control

  * ZFS kernel modules lack debug symbols (LP: #1840704)
    - [Debian] Fix conditional for setting zfs debug package path

  * Use pyhon3-sphinx instead of python-sphinx for building html docs
    (LP: #1845808)
    - [Packaging] Update sphinx build dependencies to python3 packages

  * Kernel panic with 19.10 beta image (LP: #1845454)
    - efi/tpm: Don't access event->count when it isn't mapped.
    - efi/tpm: don't traverse an event log with no events
    - efi/tpm: only set efi_tpm_final_log_size after successful event log parsing

linux (5.3.0-14.15) eoan; urgency=medium

  * eoan/linux: 5.3.0-14.15 -proposed tracker (LP: #1845728)

  * Drop i386 build for 19.10 (LP: #1845714)
    - [Debian] Remove support for producing i386 kernels
    - [Debian] Don't use CROSS_COMPILE for i386 configs

  * udevadm trigger will fail when trying to add /sys/devices/vio/
    (LP: #1845572)
    - SAUCE: powerpc/vio: drop bus_type from parent device

  * Trying to online dasd drive results in invalid input/output from the kernel
    on z/VM (LP: #1845323)
    - SAUCE: s390/dasd: Fix error handling during online processing

  * intel-lpss driver conflicts with write-combining MTRR region (LP: #1845584)
    - SAUCE: mfd: intel-lpss: add quirk for Dell XPS 13 7390 2-in-1

  * Support Hi1620 zip hw accelerator (LP: #1845355)
    - [Config] Enable HiSilicon QM/ZIP as module...

Read more...

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Frank Heimes (fheimes) on 2019-10-10
Changed in ubuntu-z-systems:
status: Fix Committed → Fix Released

All autopkgtests for the newly accepted linux-gcp-5.3 (5.3.0-1008.9~18.04.1) for bionic have finished running.
The following regressions have been reported in tests triggered by the package:

linux-gcp-5.3/unknown (amd64)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/bionic/update_excuses.html#linux-gcp-5.3

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers