Hot-unplug of disks leaves broken block devices around in Hirsute on s390x

Bug #1925211 reported by Christian Ehrhardt 
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
High
bugproxy
linux (Ubuntu)
Fix Released
High
Unassigned
Hirsute
Fix Released
High
Unassigned

Bug Description

SRU Justification

[Impact]

Hot removal of disks under kvm on s390 does not result in the kernel removing the block device, which can lead to hung tasks and other issues.

[Test Plan]

See steps to reproduce the bug in the original description below. To test, execute these steps and confirm that the block device gets removed as expected.

[Where problems could occur]

The fix is a revert of the changes which introduced this regression. The original commit was a removal of supposedly unused code, but it seems a mistake was made in the logic around unregistering of disks. Reverting the changes could have potential to introduce bugs related to other virt devices, especially if it interacts badly with subsequent driver changes. However, the patch reverted cleanly, and reverting restores the code to the state which has been working well in previous kernels and seems like the lowest risk option until a proper fix is available upstream.

---

Repro:
#1 Get a guest
$ uvt-kvm create --disk 5 --password=ubuntu h release=hirsute arch=s390x label=daily
$ uvt-kvm wait h release=hirsute arch=s390x label=daily

#2 Attach and Detach disk
$ sudo qemu-img create -f qcow2 /var/lib/libvirt/images/test.qcow2 10M
$ virsh attach-disk h /var/lib/libvirt/images/test.qcow2 vdc
$ virsh detach-disk h vdc

From libvirts POV it is gone at this point
$ virsh domblklist h
 Target Source
------------------------------------------------------------------
 vda /var/lib/uvtool/libvirt/images/hirsute-2nd-zfs.qcow
 vdb /var/lib/uvtool/libvirt/images/hirsute-2nd-zfs-ds.qcow

But the guest thinks still it is present
$ uvt-kvm ssh --insecure hirsute-2nd-zfs lsblk
  ...
  vdc 252:32 0 20M 0 disk

This even remains a while after (not a race).

Any access to it in the guest will hang (as you'd expect of a non-existing blockdev)
4 0 1758 1739 20 0 12140 4800 - S+ pts/0 0:00 | \_ sudo mkfs.ext4 /dev/vdc
4 0 1759 1758 20 0 6924 1044 - D+ pts/0 0:00 | \_ mkfs.ext4 /dev/vdc

The result above was originally found with hirsute-guest@hirsute-host on s390x

I do NOT see the same with groovy-guest@hirsute-host on s390x
I DO see the same with hirsute-guest@groovy-host on s390x
  => Guest version dependent not Host/Hipervisor dependent
I DO see the same with ZFS disks AND LVM disks being added&removed
  => not type dependent
I do NOT see the same on x86.
  => Arch dependent ??

... the evidence slowly points towards an issue in the guest, damn we are so
close to release - but non-fully detaching disks are critical in my POV :-/

Filing this as-is for awareness, but certainly this will need more debugging.
Unsure where this is going to eventually I'll now file it for kernel/udev/systemd.
If there are any known issues/components that are related let me know please!
---
ProblemType: Bug
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access '/dev/snd/': No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu65
Architecture: s390x
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
CRDA: N/A
CasperMD5CheckResult: unknown
DistroRelease: Ubuntu 21.04
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lspci:

Lspci-vt: -[0000:00]-
Lsusb: Error: command ['lsusb'] failed with exit code 1:
Lsusb-t: Error: command ['lsusb', '-t'] failed with exit code 1: /sys/bus/usb/devices: No such file or directory
Lsusb-v: Error: command ['lsusb', '-v'] failed with exit code 1:
Package: udev
PackageArchitecture: s390x
PciMultimedia:

ProcFB:

ProcKernelCmdLine: root=LABEL=cloudimg-rootfs
ProcVersionSignature: User Name 5.11.0-14.15-generic 5.11.12
RelatedPackageVersions:
 linux-restricted-modules-5.11.0-14-generic N/A
 linux-backports-modules-5.11.0-14-generic N/A
 linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
Tags: hirsute uec-images
Uname: Linux 5.11.0-14-generic s390x
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm audio cdrom dialout dip floppy lxd netdev plugdev sudo video
_MarkForUpload: True
acpidump:

Changed in udev (Ubuntu):
importance: Undecided → Critical
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1925211

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: hirsute
Revision history for this message
Christian Ehrhardt  (paelzer) wrote : Re: Hot-unplug of disks leaves broken block devices around in Hirsute

This even happens with the most common image-file backed disks which further simplifies the repro:

$ sudo qemu-img create -f qcow2 /var/lib/libvirt/images/test.qcow2 10M
$ virsh attach-disk h /var/lib/libvirt/images/test.qcow2 vdc
$ virsh detach-disk h vdc

description: updated
tags: added: apport-collected uec-images
description: updated
Revision history for this message
Christian Ehrhardt  (paelzer) wrote : AudioDevicesInUse.txt

apport information

Revision history for this message
Christian Ehrhardt  (paelzer) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Christian Ehrhardt  (paelzer) wrote : Dependencies.txt

apport information

Revision history for this message
Christian Ehrhardt  (paelzer) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Christian Ehrhardt  (paelzer) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Christian Ehrhardt  (paelzer) wrote : ProcEnviron.txt

apport information

Revision history for this message
Christian Ehrhardt  (paelzer) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Christian Ehrhardt  (paelzer) wrote : ProcModules.txt

apport information

Revision history for this message
Christian Ehrhardt  (paelzer) wrote : SystemdDelta.txt

apport information

Revision history for this message
Christian Ehrhardt  (paelzer) wrote : UdevDb.txt

apport information

Revision history for this message
Christian Ehrhardt  (paelzer) wrote : WifiSyslog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → New
status: New → Confirmed
Revision history for this message
Christian Ehrhardt  (paelzer) wrote : Re: Hot-unplug of disks leaves broken block devices around in Hirsute

Hirsute - dmesg
# attach
[ 264.065866] crw_info : CRW reports slct=0, oflw=0, chn=1, rsc=3, anc=0, erc=4, rsid=5
[ 264.065906] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=3, anc=0, erc=4, rsid=0
[ 264.099347] virtio_blk virtio5: [vdc] 385 512-byte logical blocks (197 kB/193 KiB)
# detach
[ 289.702243] crw_info : CRW reports slct=0, oflw=0, chn=1, rsc=3, anc=0, erc=4, rsid=5
[ 289.702267] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=3, anc=0, erc=4, rsid=0

Groovy - dmesg
# attach
[ 719.712747] crw_info : CRW reports slct=0, oflw=0, chn=1, rsc=3, anc=0, erc=4, rsid=5
[ 719.712758] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=3, anc=0, erc=4, rsid=0
[ 719.745538] virtio_blk virtio5: [vdc] 385 512-byte logical blocks (197 kB/193 KiB)
[ 719.745542] vdc: detected capacity change from 0 to 197120
# detach
[ 780.425222] crw_info : CRW reports slct=0, oflw=0, chn=1, rsc=3, anc=0, erc=4, rsid=5
[ 780.425233] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=3, anc=0, erc=4, rsid=0

There is a difference in hirsute missing the capacity change. The newer kernel
might immediately do that instead of detecting as zero and then bumping it up.

The sizes reported in the Hirsute guest are correct (e.g. 20M for a 20M zfs case).
So this might be a red herring unless the missing message makes you very suspicious

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (3.4 KiB)

Hirsute - udevadm

# attach
KERNEL[319.020043] add /devices/css0/0.0.0005 (css)
KERNEL[319.020088] add /devices/css0/0.0.0005/0.0.0005 (ccw)
KERNEL[319.020103] bind /devices/css0/0.0.0005/0.0.0005 (ccw)
UDEV [319.022297] add /devices/css0/0.0.0005 (css)
UDEV [319.022802] add /devices/css0/0.0.0005/0.0.0005 (ccw)
UDEV [319.023039] bind /devices/css0/0.0.0005/0.0.0005 (ccw)
KERNEL[319.025073] add /devices/css0/0.0.0005/0.0.0005/virtio5 (virtio)
UDEV [319.025527] add /devices/css0/0.0.0005/0.0.0005/virtio5 (virtio)
KERNEL[319.027524] add /devices/virtual/bdi/252:32 (bdi)
UDEV [319.028389] add /devices/virtual/bdi/252:32 (bdi)
KERNEL[319.048685] add /devices/css0/0.0.0005/0.0.0005/virtio5/block/vdc (block)
KERNEL[319.048743] bind /devices/css0/0.0.0005/0.0.0005/virtio5 (virtio)
UDEV [319.072936] add /devices/css0/0.0.0005/0.0.0005/virtio5/block/vdc (block)
UDEV [319.075862] bind /devices/css0/0.0.0005/0.0.0005/virtio5 (virtio)

# detach
<no entry>

Groovy - udevadm

# attach
KERNEL[719.986637] add /devices/css0/0.0.0005 (css)
KERNEL[719.986669] add /devices/css0/0.0.0005/0.0.0005 (ccw)
KERNEL[719.986685] bind /devices/css0/0.0.0005/0.0.0005 (ccw)
UDEV [719.988667] add /devices/css0/0.0.0005 (css)
KERNEL[719.992750] add /devices/css0/0.0.0005/0.0.0005/virtio5 (virtio)
KERNEL[719.992757] add /devices/virtual/bdi/252:32 (bdi)
UDEV [719.993298] add /devices/css0/0.0.0005/0.0.0005 (ccw)
UDEV [719.993520] bind /devices/css0/0.0.0005/0.0.0005 (ccw)
UDEV [719.994097] add /devices/virtual/bdi/252:32 (bdi)
UDEV [719.995568] add /devices/css0/0.0.0005/0.0.0005/virtio5 (virtio)
KERNEL[720.009523] add /devices/css0/0.0.0005/0.0.0005/virtio5/block/vdc (block)
KERNEL[720.009544] bind /devices/css0/0.0.0005/0.0.0005/virtio5 (virtio)
UDEV [720.058301] add /devices/css0/0.0.0005/0.0.0005/virtio5/block/vdc (block)
UDEV [720.059128] bind /devices/css0/0.0.0005/0.0.0005/virtio5 (virtio)

# detach
KERNEL[780.673663] remove /devices/virtual/bdi/252:32 (bdi)
KERNEL[780.673928] remove /devices/css0/0.0.0005/0.0.0005/virtio5/block/vdc (block)
UDEV [780.676469] remove /devices/css0/0.0.0005/0.0.0005/virtio5/block/vdc (block)
UDEV [780.678185] remove /devices/virtual/bdi/252:32 (bdi)
KERNEL[780.708055] unbind /devices/css0/0.0.0005/0.0.0005/virtio5 (virtio)
KERNEL[780.708078] remove /devices/css0/0.0.0005/0.0.0005/virtio5 (virtio)
KERNEL[780.708088] unbind /devices/css0/0.0.0005/0.0.0005 (ccw)
KERNEL[780.708101] remove /devices/css0/0.0.0005/0.0.0005 (ccw)
KERNEL[780.708109] unbind /devices/css0/0.0.0005 (css)
KERNEL[780.708117] remove /devices/css0/0.0.0005 (css)
UDEV [780.708779] unbind /devices/css0/0.0.0005/0.0.0005/virtio5 (virtio)
UDEV [780.709099] remove /devices/css0/0.0.0005/0.0.0005/virtio5 (virtio)
UDEV [780.709397] unbind /devices/css0/0.0.0005/0.0.0005 (ccw)
UDEV [780.709670] remove /devices/css0/0.0.0005/0.0.0005 (ccw)
UDEV [780.709971] unbind /devices/css0/0.0.0005 (css)
UDEV [780.710194] remove /devices/css0/0.0.0005 (css)

The events on attach are exactly the same, but in sl...

Read more...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

FYI
[15:43] <rbalint> cpaelzer, this may be related https://github.com/systemd/systemd/blob/4d484e14bb9864cef1d124885e625f33bf31e91c/NEWS#L5

And indeed it is an interesting read, but for now the question is why only with 5.11@s390x then? Maybe there is something small to get this back in line? No one is proposing to change "all rules" in the last minute :-)

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I was trying mainline builds [1] to pinpoint the kernel change (if any).
I had a few totally failing ones which stalled me a bit, but eventually I can show
this helpful list:

v5.8 - working
v5.10 - working
v5.10.31 - working
v5.11 - failing
v5.11.15 - failing
v5.12-rc8 - failing

So it isn't a >5.10 change that is backported into 5.10 stable.
And it isn't something that was fixed in later master or 5.11 stable releases.

If you want you could bisect the kernel for this, my env is great for testing
but very bad for kernel builds. Do you want to throw kernels my way or should I
try to build my own in another place?

[1]: https://kernel.ubuntu.com/~kernel-ppa/mainline/

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Ok build env set up and tested the starting points.
From that I also see:
v5.10 - working
v5.11 - failing

So from here I think I can try a bisect

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (3.8 KiB)

commit 8cc0dcfdc1c0e0be107d0288f9c0cf1f4201be62
Author: Vineeth Vijayan <email address hidden>
Date: Fri Nov 20 09:36:38 2020 +0100

    s390/cio: remove pm support from ccw bus driver

    As part of removing broken pm-support from s390 arch, remove
    the pm callbacks from ccw-bus driver.The power-management functions
    are unused since the 'commit 394216275c7d ("s390: remove broken
    hibernate / power management support")'.

    Signed-off-by: Vineeth Vijayan <email address hidden>
    Reviewed-by: Peter Oberparleiter <email address hidden>
    Signed-off-by: Heiko Carstens <email address hidden>

 arch/s390/include/asm/ccwdev.h | 10 --
 drivers/s390/cio/cmf.c | 5 -
 drivers/s390/cio/device.c | 247 +----------------------------------------
 drivers/s390/cio/device.h | 1 -
 drivers/s390/cio/device_fsm.c | 6 -
 drivers/s390/cio/io_sch.h | 1 -

It seems it wasn't as unused/broken as they thought :-/
BTW the referenced 394216275c7d ("s390: remove broken hibernate / power management support") was in v5.7

Found by:

$ git bisect log
git bisect start
# good: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10
git bisect good 2c85ebc57b3e1817b6ce1a6b703928e113a90442
# bad: [f40ddce88593482919761f74910f42f4b84c004b] Linux 5.11
git bisect bad f40ddce88593482919761f74910f42f4b84c004b
# bad: [538fcf57aaee6ad78a05f52b69a99baa22b33418] Merge branches 'acpi-scan', 'acpi-pnp' and 'acpi-sleep'
git bisect bad 538fcf57aaee6ad78a05f52b69a99baa22b33418
# bad: [15b447361794271f4d03c04d82276a841fe06328] mm/lru: revise the comments of lru_lock
git bisect bad 15b447361794271f4d03c04d82276a841fe06328
# good: [b10733527bfd864605c33ab2e9a886eec317ec39] Merge tag 'amd-drm-next-5.11-2020-12-09' of git://people.freedesktop.org/~agd5f/linux into drm-next
git bisect good b10733527bfd864605c33ab2e9a886eec317ec39
# good: [2c075f38a708c578a752b738a45e8c26923eac2e] Merge branch 'radeon-fixes' (Radeon and amdgpu fixes)
git bisect good 2c075f38a708c578a752b738a45e8c26923eac2e
# bad: [76d4acf22b4847f6c7b2f9042366fbdc3d20f578] Merge tag 'perf-kprobes-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 76d4acf22b4847f6c7b2f9042366fbdc3d20f578
# bad: [f9b4240b074730f41c1ef8e0d695d10fb5bb1e27] Merge tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
git bisect bad f9b4240b074730f41c1ef8e0d695d10fb5bb1e27
# good: [d889797530c66f699170233474eab3361471e808] Merge remote-tracking branch 'arm64/for-next/fixes' into for-next/core
git bisect good d889797530c66f699170233474eab3361471e808
# good: [2f6ea6fb88ab9d517644a098fc670b4d5dd1735e] s390/tape: remove unsupported PM functions
git bisect good 2f6ea6fb88ab9d517644a098fc670b4d5dd1735e
# bad: [586592478b1fa8bb8cd6875a9191468e9b1a8b13] Merge tag 's390-5.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
git bisect bad 586592478b1fa8bb8cd6875a9191468e9b1a8b13
# good: [0b03beface02d519693edb8020f9811c67d5c88f] Merge tag 'm68k-for-v5.11-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
git bisect good 0b03beface02d519693edb8020f9811c67d5c88f
# bad: [613775d62ec60202f98d2c5f520e6e9ba6dd4ac4] s390/...

Read more...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

But it makes sense in regard to the broken use case and explains why we see it only on s390x.

@sforshee - what should we do about it now that we know which commit it was - just revert it asap or are there bad dependencies to it?

Revision history for this message
Seth Forshee (sforshee) wrote :

The commit reverts cleanly. We need to confirm that reverting the commit does fix the issue. I put a test build here, please test.

https://people.canonical.com/~sforshee/lp1925211/

I doubt we can get a new kernel into the release. If it's extremely urgent we can consider a day 0 SRU kernel for hirsute, otherwise we can make sure it gets into the first normal SRU kernel.

summary: - Hot-unplug of disks leaves broken block devices around in Hirsute
+ Hot-unplug of disks leaves broken block devices around in Hirsute on
+ s390x
Changed in linux (Ubuntu Hirsute):
milestone: none → hirsute-updates
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks Seth.
I have verified 5.11.0-16-generic #17+lp1925211v202104201520 from the PPA and can confirm that the issue is gone.
=> The revert works as a fix \o/

For the severity/urgency at least we now know that it is s390x only (not "good" but reduces the amount of affected people).
I'll later (after I'm actually awake) if it also affects non-KVM disks (e.g. channel I/O detaches) then we can decide if 0-day or the next normal round will be ok.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

The check was for resuming flag, but now it's inverted. Please test this patch.

Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
assignee: nobody → bugproxy (bugproxy)
tags: added: reverse-proxy-bugzilla
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks Frank for adding the mirror request to this, because either way we sooner or later want a discussion with the s390x developers on this.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Kaihenfeng,
Thanks for your patch suggestion! I'm semantically not sure it is the right thing - to clarify your theory is that before it checked !resuming and before had the check for !cdev maybe just to avoid a deference error. And now you assume that instead of !cdev it should check if there is a cdev there.
I'm unsure - if !cdev was indeed just to protect the dereference then maybe no check at all might be better. Which would then read "if the event is IO_SCH_ORPH_UNREG or IO_SCH_UNREG then do css_sch_device_unregister.

But that I'm not immediately convinced doesn't mean much and it is easy to test and surely worth a try, so I ran v5.11 (bad) plus your patch and the result will be useful to know in any case. It is working fine, that much I can tell you.

But if my thought above was right (it was only there to avoid the potential deference error), then why check it at all. If the condition cdev==NULL is possible it would now skip to to fully remove it - we might not need that at all.
And Since I brought up the idea of dropping the cdev check entirely that was worth a try as well. So now the third check of this morning is for:
--- a/drivers/s390/cio/device.c
+++ b/drivers/s390/cio/device.c
@@ -1525,8 +1525,7 @@ static int io_subchannel_sch_event(struct subchannel *sch, int process)
        switch (action) {
        case IO_SCH_ORPH_UNREG:
        case IO_SCH_UNREG:
- if (!cdev)
- css_sch_device_unregister(sch);
+ css_sch_device_unregister(sch);
                break;
        case IO_SCH_ORPH_ATTACH:
        case IO_SCH_UNREG_ATTACH:

My patch with that change - in my test - is working as well.
Neither of the solutions has triggered other regressions in my setup - but then there are so many potential use-cases that I can't be sure without a further revew by subject matter experts.

So a summary of the recent tests:

5.11.0-16-generic #17+lp1925211v202104201520 (Seths full revert) - working
5.11.0lp1925211-patch-kaihengfeng-dirty - working
5.11.0nocdevcheck-paelzer-dirty - working

I think we'd want an answer from the IBM devs which solution (full revert, kaihenfeng patch, cpaelzer patch, another approach) they would prefer - then we can submit it upstream for them to include officially and we can carry it as delta until we rebase onto a version that has it applied anyway.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8cc0dcfdc1c0e0be107d0288f9c0cf1f4201be62

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Yes, !resuming always evaluates to true because obviously the hot-unplug test was not done in any system PM operations.

I am also unsure whether cdev can be NULL in that context so I left it there. Other functions have similar cdev check too. I think IBM devs will have more insights on this.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (4.0 KiB)

I was wondering if I could trigger the same issue on an lpar as it would raise the severity IMHO. I have no claim on completeness of these tests in regard to all that could happen. I tried what I considered low hanging fruits in regard to this cross check.

Pre-condition each time
- a dasd attached to the system
- not used e.g. as a FS
- no aliases enabled
=> this (more or less) matches our former KVM based test case

$ lscss | grep 1523; lsdasd 0.0.1523; ll /dev/dasdc
0.0.1523 0.0.0183 3390/0c 3990/e9 yes f0 f0 ff 10111213 00000000
Bus-ID Status Name Device Type BlkSz Size Blocks
================================================================================
0.0.1523 active dasdc 94:8 ECKD 4096 7043MB 1803060
brw-rw---- 1 root disk 94, 8 Apr 21 06:21 /dev/dasdc

I was tracking the same state after the removing action and ran udevadm monitor to see is a unbind happened.

---

#1 cio purge
$ sudo cio_ignore -a 0.0.1523; sudo cio_ignore --purge

=> can't take away online devices, and I'm not interested in initial blocking ..

---

#2 chzdev
$ sudo chzdev --disable 0.0.1523

=> properly removed

---

#3 remove the dasds on the storage server
"LSS 08 SRV_SS0_0823" is mapped to s1lp5 0.0.1523 - removing that on the storage server

By default that fails:

Error - delete of volume SRV_SS0_0823 failed.
8:28 AM
Error: CMUN02948E IBM.2107-75DXP71/0823 The Delete logical volume task cannot be initiated because the Allow Host Pre-check Control Switch is set to true and the volume that you have specified is online to a host.

In the old UI the force option is available as checkbox - trying via that.
Done.

The system does not realize that the disk is gone, I/O on it (e.g. dasdfmt) goes into a deadlock.
After a while in that hang the system realizes it is in trouble:

dmesg:
Apr 21 06:42:32 s1lp5 kernel: dasd(eckd): I/O status report for device 0.0.1523:
                              dasd(eckd): in req: 00000000e903a5ac CC:00 FC:00 AC:00 SC:00 DS:00 CS:00 RC:-11
                              dasd(eckd): device 0.0.1523: Failing CCW: 0000000000000000
                              dasd(eckd): SORRY - NO VALID SENSE AVAILABLE
Apr 21 06:42:32 s1lp5 kernel: dasd(eckd): Related CP in req: 00000000e903a5ac
                              dasd(eckd): CCW 00000000c3e100c4: 2760000C 014C5FF0 DAT: 18000000 08231c00 00000000
                              dasd(eckd): CCW 00000000335dd238: 3E20401A 00A40000 DAT: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Apr 21 06:42:32 s1lp5 kernel: dasd(eckd):......
Apr 21 06:42:32 s1lp5 kernel: dasd-eckd.adb621: 0.0.1523: ERP failed for the DASD

udevadm:
KERNEL[1313.022835] remove /devices/css0/0.0.0183/0.0.1523/block/dasdc/dasdc1 (block)
UDEV [1313.024648] remove /devices/css0/0.0.0183/0.0.1523/block/dasdc/dasdc1 (block)

Even after the above - the disk is still "present":
$ lscss | grep 1523; lsdasd 0.0.1523; ll /dev/dasdc
0.0.1523 0.0.0183 3390/0c 3990/e9 yes f0 f0 0f 10111213 00000000
Bus-ID Status Name Device Type BlkSz Size Blocks
==================================================================...

Read more...

Changed in udev (Ubuntu Hirsute):
status: New → Invalid
Changed in systemd (Ubuntu Hirsute):
status: New → Invalid
Changed in linux (Ubuntu Hirsute):
status: Confirmed → Triaged
importance: Undecided → High
Changed in udev (Ubuntu Hirsute):
importance: Critical → Undecided
Changed in ubuntu-z-systems:
status: New → Triaged
importance: Undecided → High
bugproxy (bugproxy)
tags: added: architecture-s39064 bugnameltc-192463 severity-high targetmilestone-inin2104
tags: added: patch
Revision history for this message
Seth Forshee (sforshee) wrote :

The condition for css_sch_device_unregister(sch) also caught my eye, calling it unconditionally is probably closer to right because it was called in the !cdev case before, and in the attached patch it would no longer be called in this case. However I think in the short term the revert is the safest option, since the code will match what we already know was working in the groovy kernel. Once a fix is committed upstream, we can trade out the revert for that patch.

Seth Forshee (sforshee)
description: updated
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2021-04-23 01:50 EDT-------
---snip---
--- a/drivers/s390/cio/device.c
+++ b/drivers/s390/cio/device.c
@@ -1525,8 +1525,7 @@ static int io_subchannel_sch_event(struct subchannel *sch, int process)
switch (action) {
case IO_SCH_ORPH_UNREG:
case IO_SCH_UNREG:
- if (!cdev)
- css_sch_device_unregister(sch);
+ css_sch_device_unregister(sch);
break;
case IO_SCH_ORPH_ATTACH:
case IO_SCH_UNREG_ATTACH:

I think we'd want an answer from the IBM devs which solution (full revert, kaihenfeng patch, cpaelzer patch, another approach) they would prefer - then we can submit it upstream for them to include officially and we can carry it as delta until we rebase onto a version that has it applied anyway.
---snip---

Thank you very much for reporting this. Yes. This is a leftover from the pm-remove patch and the right solution is as mentioned above here. We shall prepare the patch and share it to the external mailing-list.

Stefan Bader (smb)
Changed in linux (Ubuntu Hirsute):
status: Triaged → Fix Committed
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Triaged → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-hirsute' to 'verification-done-hirsute'. If the problem still exists, change the tag 'verification-needed-hirsute' to 'verification-failed-hirsute'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-hirsute
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Testing former status with 5.11.0-17-generic

Journal:
Add:
May 17 05:56:26 h kernel: crw_info : CRW reports slct=0, oflw=0, chn=1, rsc=3, anc=0, erc=4, rsid=5
May 17 05:56:26 h kernel: crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=3, anc=0, erc=4, rsid=0
May 17 05:56:26 h kernel: virtio_blk virtio5: [vdc] 385 512-byte logical blocks (197 kB/193 KiB)
Remove:
May 17 05:56:35 h kernel: crw_info : CRW reports slct=0, oflw=0, chn=1, rsc=3, anc=0, erc=4, rsid=5
May 17 05:56:35 h kernel: crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=3, anc=0, erc=4, rsid=0

ubuntu@s1lp5:~$ virsh attach-disk h /var/lib/libvirt/images/test.qcow2 vdc
Disk attached successfully

ubuntu@s1lp5:~$ virsh detach-disk h vdc
Disk detached successfully

ubuntu@s1lp5:~$ virsh domblklist h
 Target Source
----------------------------------------------------
 vda /var/lib/uvtool/libvirt/images/h.qcow
 vdb /var/lib/uvtool/libvirt/images/h-ds.qcow

ubuntu@s1lp5:~$ uvt-kvm ssh --insecure hirsute-2nd-zfs lsblk
uvt-kvm: error: libvirt: Domain not found: no domain with matching name 'hirsute-2nd-zfs'
ubuntu@s1lp5:~$ uvt-kvm ssh --insecure h lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 47.9M 1 loop /snap/core18/1998
loop1 7:1 0 29.9M 1 loop /snap/snapd/11838
loop2 7:2 0 63.1M 1 loop /snap/lxd/20396
vda 252:0 0 8G 0 disk
└─vda1 252:1 0 8G 0 part /
vdb 252:16 0 372K 0 disk
vdc 252:32 0 192.5K 0 disk

The disk is still present despite not really being in the guest anymore.

Now testing the proposed kernel 5.11.0-18-generic

Upgrades fine, well triggering an unrelated bug 1928625 but fine in regard to the issue here.

Then journal (as expected) looks the same:
# Attach
May 17 06:09:22 h kernel: crw_info : CRW reports slct=0, oflw=0, chn=1, rsc=3, anc=0, erc=4, rsid=5
May 17 06:09:22 h kernel: crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=3, anc=0, erc=4, rsid=0
May 17 06:09:22 h kernel: virtio_blk virtio5: [vdc] 385 512-byte logical blocks (197 kB/193 KiB)
# Detach
May 17 06:09:29 h kernel: crw_info : CRW reports slct=0, oflw=0, chn=1, rsc=3, anc=0, erc=4, rsid=5
May 17 06:09:29 h kernel: crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=3, anc=0, erc=4, rsid=0

But lsblk confirms that the new kernel now works:
ubuntu@s1lp5:~$ virsh attach-disk h /var/lib/libvirt/images/test.qcow2 vdc
Disk attached successfully

ubuntu@s1lp5:~$ virsh detach-disk h vdc
Disk detached successfully

ubuntu@s1lp5:~$ virsh domblklist h
 Target Source
----------------------------------------------------
 vda /var/lib/uvtool/libvirt/images/h.qcow
 vdb /var/lib/uvtool/libvirt/images/h-ds.qcow

ubuntu@s1lp5:~$ uvt-kvm ssh --insecure h lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 63.1M 1 loop /snap/lxd/20396
loop1 7:1 0 29.9M 1 loop /snap/snapd/11838
loop2 7:2 0 47.9M 1 loop /snap/core18/1998
vda 252:0 0 8G 0 disk
└─vda1 252:1 0 8G 0 part /
vdb 252:16 0 372K 0 disk

^^ no more vdc after the detach.
Setting verified

tags: added: verification-done-hirsute
removed: verification-needed-hirsute
Revision history for this message
Stefan Bader (smb) wrote :

Just as a heads up: The upstream stable update 5.11.20 for hirsute un-reverts
- "s390/cio: remove pm support from ccw bus driver"
and adds
- "s390/cio: remove invalid condition on IO_SCH_UNREG"

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2021-05-19 07:18 EDT-------
IBM Response:
Yes. this makes sense. Previously Canonical reverted the "s390/cio: remove pm support from ccw bus driver" patch because of the issue. Right solution is to unrevert the revert-patch and add "s390/cio: remove invalid condition on IO_SCH_UNREG" patch. Which makes both available in the kernel.

So, we agree on this.

In stable 5.13-rc1, we have both the patches. Which is same as 5.11.20 for hirsute.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (29.0 KiB)

This bug was fixed in the package linux - 5.11.0-18.19

---------------
linux (5.11.0-18.19) hirsute; urgency=medium

  * hirsute/linux: 5.11.0-18.19 -proposed tracker (LP: #1927578)

  * Packaging resync (LP: #1786013)
    - update dkms package versions

  * Introduce the 465 driver series, fabric-manager, and libnvidia-nscq
    (LP: #1925522)
    - debian/dkms-versions -- add NVIDIA 465 and migrate 450 to 460

  * linux-image-5.0.0-35-generic breaks checkpointing of container
    (LP: #1857257)
    - SAUCE: overlayfs: fix incorrect mnt_id of files opened from map_files

  * Hirsute update: v5.11.17 upstream stable release (LP: #1927535)
    - vhost-vdpa: protect concurrent access to vhost device iotlb
    - Revert "UBUNTU: SAUCE: ovl: Restore vm_file value when lower fs mmap fails"
    - ovl: fix reference counting in ovl_mmap error path
    - coda: fix reference counting in coda_file_mmap error path
    - amd/display: allow non-linear multi-planar formats
    - drm/amdgpu: reserve fence slot to update page table
    - drm/amdgpu: fix GCR_GENERAL_CNTL offset for dimgrey_cavefish
    - gpio: omap: Save and restore sysconfig
    - KEYS: trusted: Fix TPM reservation for seal/unseal
    - vdpa/mlx5: Set err = -ENOMEM in case dma_map_sg_attrs fails
    - pinctrl: lewisburg: Update number of pins in community
    - block: return -EBUSY when there are open partitions in blkdev_reread_part
    - pinctrl: core: Show pin numbers for the controllers with base = 0
    - arm64: dts: allwinner: Revert SD card CD GPIO for Pine64-LTS
    - bpf: Allow variable-offset stack access
    - bpf: Refactor and streamline bounds check into helper
    - bpf: Tighten speculative pointer arithmetic mask
    - perf/x86/intel/uncore: Remove uncore extra PCI dev HSWEP_PCI_PCU_3
    - perf/x86/kvm: Fix Broadwell Xeon stepping in isolation_ucodes[]
    - perf auxtrace: Fix potential NULL pointer dereference
    - perf map: Fix error return code in maps__clone()
    - HID: google: add don USB id
    - HID: asus: Add support for 2021 ASUS N-Key keyboard
    - HID: alps: fix error return code in alps_input_configured()
    - HID cp2112: fix support for multiple gpiochips
    - HID: wacom: Assign boolean values to a bool variable
    - soc: qcom: geni: shield geni_icc_get() for ACPI boot
    - dmaengine: xilinx: dpdma: Fix descriptor issuing on video group
    - dmaengine: xilinx: dpdma: Fix race condition in done IRQ
    - ARM: dts: Fix swapped mmc order for omap3
    - m68k: fix flatmem memory model setup
    - net: geneve: check skb is large enough for IPv4/IPv6 header
    - dmaengine: tegra20: Fix runtime PM imbalance on error
    - s390/entry: save the caller of psw_idle
    - arm64: kprobes: Restore local irqflag if kprobes is cancelled
    - xen-netback: Check for hotplug-status existence before watching
    - cavium/liquidio: Fix duplicate argument
    - csky: change a Kconfig symbol name to fix e1000 build error
    - ia64: fix discontig.c section mismatches
    - ia64: tools: remove duplicate definition of ia64_mf() on ia64
    - x86/crash: Fix crash_setup_memmap_entries() out-of-bounds access
    - net: hso: fix NULL-deref on disconnect regression
    - USB: CDC-ACM...

Changed in linux (Ubuntu Hirsute):
status: Fix Committed → Fix Released
Revision history for this message
Frank Heimes (fheimes) wrote :

With that we can close the entire bug, since hirsute / 5.11 is now Fix Released (#34), groovy / 5.8 is not affected (#17) and the patches are upstream with 5.13 (#33), hence will end up sooner or later in impish.

Changed in ubuntu-z-systems:
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
Frank Heimes (fheimes) wrote :

To unblock the SRU process for this bug I'm updating the tag from verification-needed-focal to verification-done-focal, however this ticket does not affect focal and it was never marked as affecting focal (see title and bug description), hence I think this tag should not have occurred here.

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2021-06-14 08:34 EDT-------
IBM bugzilla status->closed, Fix Released by all requested distros.

Mathew Hodson (mhodson)
no longer affects: systemd (Ubuntu)
no longer affects: systemd (Ubuntu Hirsute)
no longer affects: udev (Ubuntu)
no longer affects: udev (Ubuntu Hirsute)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers