jammy 5.15 kernel soft lockup when zfs.ko is loaded on s390x w/ gcc >= 11.2.0-10ubuntu1 / gcc-11 PLT regression on s390x

Bug #1954676 reported by Andrea Righi
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
Critical
bugproxy
gcc-11 (Ubuntu)
Fix Released
Critical
Unassigned
Impish
Invalid
Undecided
Unassigned
Jammy
Fix Released
Critical
Unassigned
linux (Ubuntu)
Fix Released
Critical
Unassigned
Impish
Fix Released
High
Unassigned
Jammy
Fix Released
Critical
Unassigned

Bug Description

[Impact]

Installing zfs-dkms seems to trigger a soft lockup issue as soon as zfs.ko is loaded. When the soft lockup happens the system isn't reachable anymore via ssh and on the console we can see some exlicit rcu stall warnings.

This seems to happen only when zfs is compile with gcc >= 11.2.0-10ubuntu1. Downgrading gcc to previous versions doesn't show the problem and zfs is working just fine.

[Test case]

On s390x install the latest 5.15 Jammy kernel and run:
apt install zfs-dkms

[Fix]

Temporary workaround is to build zfs with gcc-10 on s390x.

[Regression potential]

Building a kernel module with a different compiler is never 100% safe, we may experience potential crashes / panics if the ABI is not compatible. Moreover, if we build zfs with gcc-10 we are going to lose some of the performance benefits provided by gcc-11.

However these regressions are limited to zfs on s390x and without this change zfs is broken anyway on this architecture.

[Further analysis]

@IBM

Issue report to zfs upstream at https://github.com/openzfs/zfs/issues/12942

Reverting https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=2335aa8771acd06b082d3e15d9f21ae0a802afd7 appears to make everything work, the compiled kernel and zfs.ko are loadable/unloadable once again.

Please bring this issue to the attention of Ilya Leoshkevich

Potentially this means we may have missbuilt userspace binaries in the archive for s390x.

Revision history for this message
Andrea Righi (arighi) wrote :

Moving to Debian zfs 2.1.1 + applying the debdiff in attach seems to fix the problem.

description: updated
tags: added: patch
Revision history for this message
Andrea Righi (arighi) wrote :

New debdiff in attach, that also includes the fix for a potential data corruption with zfs-2.1.1 (https://github.com/openzfs/zfs/issues/12762).

Revision history for this message
Andrea Righi (arighi) wrote :

New debdiff to move to zfs 2.1.2 in attach.

Revision history for this message
Andrea Righi (arighi) wrote :

Update: we're still getting the same failure even with zfs 2.1.2. I've opened an issue upstream:
https://github.com/openzfs/zfs/issues/12942

Revision history for this message
Andrea Righi (arighi) wrote :

Quick summary: it really looks like a compiler issue, because everything works using the same version of kernel + zfs, but on Impish. Also trying to compile zfs with gcc-10 doesn't show any problem.

Revision history for this message
Andrea Righi (arighi) wrote :

I've tried multiple versions of gcc and here's the result:

 - gcc 11.2.0-13ubuntu1 : bad
 - gcc 11.2.0-12ubuntu1 : bad
 - gcc 11.2.0-10ubuntu1 : bad
 - gcc 11.2.0-7ubuntu2 : good
 - gcc 10.3.0-13ubuntu1 : good

Apparently the last version of gcc that seems to work with zfs + kernel 5.15 is 11.2.0-7ubuntu2 (the one currently in impish).

The same behavior happens both with a vanilla 5.15 and our Ubuntu kernel 5.15 (only on s390x, all the other architectures are working just fine).

Andrea Righi (arighi)
description: updated
description: updated
Andrea Righi (arighi)
summary: - jammy 5.15 soft lockup when zfs.ko is loaded on s390x
+ jammy 5.15 kernel soft lockup when zfs.ko is loaded on s390x w/ gcc >=
+ 11.2.0-10ubuntu1
Changed in gcc-11 (Ubuntu Jammy):
importance: Undecided → Critical
Changed in zfs-linux (Ubuntu Jammy):
importance: Undecided → Critical
assignee: nobody → Andrea Righi (arighi)
Revision history for this message
Dimitri John Ledkov (xnox) wrote : Re: jammy 5.15 kernel soft lockup when zfs.ko is loaded on s390x w/ gcc >= 11.2.0-10ubuntu1

We are suspecting https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=2335aa8771acd06b082d3e15d9f21ae0a802afd7 and trying to do builds with it reverted.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :
Revision history for this message
Andrea Righi (arighi) wrote :

New debdiff in attach that enforces gcc-10 to build the dkms part on s390x.

description: updated
Revision history for this message
Andrea Righi (arighi) wrote :

New debdiff, because v2 was missing the gcc-10 dependency for zfs-dkms in debian/control.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

The revert of the gcc-11 patch mentioned above results in working kernel builds, which load zfs.ko built-into-kernel and built via zfs-dkms as well.

description: updated
summary: jammy 5.15 kernel soft lockup when zfs.ko is loaded on s390x w/ gcc >=
- 11.2.0-10ubuntu1
+ 11.2.0-10ubuntu1 / gcc-11 PLT regression on s390x
description: updated
Frank Heimes (fheimes)
tags: added: s390x
Changed in ubuntu-z-systems:
assignee: nobody → bugproxy (bugproxy)
importance: Undecided → Critical
Revision history for this message
Ilya Leoshkevich (iii-i) wrote :

Sorry about that, I'm investigating.

Revision history for this message
Ilya Leoshkevich (iii-i) wrote :

I don't seem to be able to reproduce this. What I tried so far:

1. Installed a clean VM using an image from http://ftp.uni-kl.de/pub/linux/ubuntu-dvd/ubuntu-server/daily-live/pending/, ran apt install zfs-dkms - zfs.ko initialization successful.

2. Built zfs master myself and loaded the modules manually - zfs.ko initialization is successful.

3. Installed v5.15 kernel from https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15/ and ran apt install zfs-dkms - zfs.ko initialization successful.

4. Built zfs master myself again (for v5.15 this time) and loaded the modules manually - zfs.ko initialization is successful.

Am I missing something?

One thing is that zfs.ko from https://kernel.ubuntu.com/~arighi/zfs/ has vermagic=5.15.0-15-generic, whereas mine has vermagic=5.15.0-051500-generic. Is there another v5.15 kernel build that I can try? The default repos seem to have only v5.13:

$ apt search linux-image-generic
...
linux-image-generic/jammy,now 5.13.0.19.30 s390x [installed,automatic]
  Generic Linux kernel image
...

Revision history for this message
Frank Heimes (fheimes) wrote :

Hi Ilya, there is a 5.15 kernel for jammy aka 22.04 in the proposed section of the archive.
Just enable proposed with:
"sudo add-apt-repository "deb http://us.ports.ubuntu.com/ubuntu-ports/ $(lsb_release -sc)-proposed main"
(for the src use "sudo add-apt-repository "deb-src http://us.ports.ubuntu.com/ubuntu-ports/ $(lsb_release -sc)-proposed main")
Do an "sudo apt update" (if not automatically triggered) and you will find new packages listed by:
"apt list --upgradable"
You can just install all of them (e.g. with "sudo apt full-upgrade"), or just selectively install an updated package, like the kernel (e.g. "sudo apt install linux-generic").
With that you will get the Ubuntu jammy version of the 5.15 kernel.

Revision history for this message
Ilya Leoshkevich (iii-i) wrote (last edit ):

Thanks, Frank! I managed to install the jammy 5.15 kernel, but the zfs module still loads successfully. Here is some information from the manually built module:

$ strings module/zfs/zfs.ko|grep -e vermagic= -e GCC -e ^version=|sort -u
GCC: (Ubuntu 11.2.0-14ubuntu1) 11.2.0
vermagic=5.15.0-16-generic SMP mod_unload modversions
version=2.1.99-678_gda9c6c033

We could probably try to track down the difference between our environments, but I'm wondering if you could just share your VM image with me?

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Hi, modules built with gcc-11 14ubuntu1 toolchain have the gcc-11 stable ibmz PLT patch already reverted to make things work again.

One needs to downgrade gcc-11 to 13ubuntu1 to experience the breakage.

You can fetch older gcc-11 from https://launchpad.net/ubuntu/+source/gcc-11/11.2.0-13ubuntu1/+build/22629238 and you will most likely also need older binutils from https://launchpad.net/ubuntu/+source/binutils/2.37-10ubuntu1/+build/22423687

You can pull those debs with:

$ pull-lp-debs --arch s390x binutils jammy 2.37-10ubuntu1
$ pull-lp-debs --arch s390x gcc-11 jammy 11.2.0-13ubuntu1

And downgrade packages that you have already installed to versions one pulls down with the above commands (i.e. one does not need _all_ of the packages pulled).

Regards,

Dimitri.

Revision history for this message
Andrea Righi (arighi) wrote :

 GCC: (Ubuntu 11.2.0-14ubuntu1) 11.2.0

^ This version of gcc is reverting the patch that introduces the bug. It is possible to reproduce the bug with gcc <= 11.2.0-13ubuntu1.

Revision history for this message
Ilya Leoshkevich (iii-i) wrote :

Oh, my bad - I remember checking that I had the right compiler version, and even verifying that the patch is in by compiling some samples, but I guess then I ran a full upgrade after adding -proposed.

Unfortunately, downgrading the toolchain did not help - both manually built and dkms-built modules still load just fine for me.

iii@jammy:~/zfs$ modinfo -n zfs
/lib/modules/5.15.0-16-generic/updates/dkms/zfs.ko
iii@jammy:~/zfs$ strings "$(modinfo -n zfs)"|grep -e vermagic= -e GCC -e ^version=|sort -u
GCC: (Ubuntu 11.2.0-13ubuntu1) 11.2.0
vermagic=5.15.0-16-generic SMP mod_unload modversions
version=2.0.6-1ubuntu3
...
[ 2496.048917] ZFS: Loaded module v2.0.6-1ubuntu3, ZFS pool version 5000, ZFS filesystem version 5
[ 2630.808993] ZFS: Unloaded module v2.0.6-1ubuntu3

iii@jammy:~/zfs$ strings module/zfs/zfs.ko|grep -e vermagic= -e GCC -e ^version=|sort -u
GCC: (Ubuntu 11.2.0-13ubuntu1) 11.2.0
vermagic=5.15.0-16-generic SMP mod_unload modversions
version=2.1.2-1
...
[ 2635.328149] ZFS: Loaded module v2.1.2-1, ZFS pool version 5000, ZFS filesystem version 5
[ 3083.604904] ZFS: Unloaded module v2.1.2-1

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

@ilja

Now this is starting to be more fun then. Note our mainframe is z13 2964 and we have reproduced the bug in both LPAR and KVM. I'll try to package up a more concrete reproducer (to the point of tarring up a full chroot with all the deps).

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

@arighi

Explicit nack on zfs-linux patches, that is incomplete merge that drops ubuntu delta & zsys support.

Revision history for this message
Ilya Leoshkevich (iii-i) wrote :

z13 appears to have been the key: I can reproduce this on a z13 LPAR.

Frank Heimes (fheimes)
Changed in gcc-11 (Ubuntu Jammy):
status: New → Confirmed
Changed in ubuntu-z-systems:
status: New → Confirmed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package gcc-11 - 11.2.0-14ubuntu1

---------------
gcc-11 (11.2.0-14ubuntu1) jammy; urgency=medium

  * Merge with Debian; remaining changes:
    - Build from upstream sources.

gcc-11 (11.2.0-14) unstable; urgency=medium

  * Update to git 20220112 from the gcc-11 branch.
    - Fix PR target/103465, PR middle-end/101751, PR c/101289, PR c/97548,
      PR target/103661 (x86), PR c++/103783, PR c++/102229, PR c++/103714,
      PR c++/103703, PR fortran/102332, PR fortran/103776, PR fortran/103778,
      PR fortran/101329, PR fortran/103418, PR fortran/103412,
      PR libfortran/103634, PR d/103604, PR libstdc++/100017,
      PR libstdc++/103877, PR libstdc++/103501, PR libstdc++/103549,
      PR libstdc++/103453, PR libstdc++/103919, PR c++/103831,
      PR target/98348 (x86), PR sanitizer/102911, PR tree-optimization/103603.

  [ Matthias Klose ]
  * Remove the gdc-driver-nophobos patch.
  * Configure instead with --with-libphobos-druntime-only=yes.
  * Don't run the testsuite on alpha for now.

  [ Dimitri John Ledkov ]
  * debian/patches/gcc-ibmz-plt-revert.diff: Revert PLT changes from the gcc-11
    branch, as causing kernel dkms missbuilts on s390x. LP: #1954676.

 -- Matthias Klose <email address hidden> Wed, 12 Jan 2022 20:47:06 +0100

Changed in gcc-11 (Ubuntu Jammy):
status: Confirmed → Fix Released
Revision history for this message
Frank Heimes (fheimes) wrote :

Just as a side-note, Ilya:
The previous message that this bug is fixed, is only because we reverted the following:
"debian/patches/gcc-ibmz-plt-revert.diff: Revert PLT changes from the gcc-11     branch, as causing kernel dkms missbuilts on s390x. LP: #1954676"
to become unblocked.
It's still needed to figure out what's happening.

Revision history for this message
Ilya Leoshkevich (iii-i) wrote :

A quick update: zfs.ko gets quite a lot of relocations with the gcc patch, and the total size of PLT entries generated by kernel's apply_rela() exceeds 64k. Each PLT entry contains a short jump to an expoline thunk located at the end of the PLT array, which is out of range for the first PLT entries. When expolines are off (nobp=1), this problem does not occur. I'm currently looking into a kernel fix.

Revision history for this message
Ilya Leoshkevich (iii-i) wrote :

The attached patch resolved the issue on my test VM, could you please give it a try?

Revision history for this message
Ilya Leoshkevich (iii-i) wrote :
Revision history for this message
Frank Heimes (fheimes) wrote :

Thanks you, Ilya.
Do you plan to tag this for any of the upstream stable release kernels (like 5.15)?

Revision history for this message
Ilya Leoshkevich (iii-i) wrote :

Yes, the commit has cc:stable and fixes: tags, so it should be picked up for v4.19 and later.

Revision history for this message
Ilya Leoshkevich (iii-i) wrote :
Revision history for this message
Frank Heimes (fheimes) wrote :

Thx for the heads-up, Ilya!

Changed in ubuntu-z-systems:
status: Confirmed → In Progress
Revision history for this message
Frank Heimes (fheimes) wrote :

Juts had a quick chat with the kernel team,
and it's expected to have this incl. in jammy's 5.15.0-20.20 latest.

Changed in ubuntu-z-systems:
status: In Progress → Fix Committed
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

@fheimes @iii-i

thank you for this, we will include that patch into our kernels. Do you expect that the toolchain change may have affected any other binaries (ie. userspace binaries)? Or should we re-apply the reverted patch back in our gcc-11 (given the improvements that it brings)?

Revision history for this message
Ilya Leoshkevich (iii-i) wrote :

I would reapply the gcc patch. It should not affect most userspace binaries - for example, it did not produce any changes whatsoever in SPEC suite. Even if there are some binaries that are built with kernel-like options and get extra relocations as a result, ld.so processes such relocations correctly, so nothing bad should happen.

Revision history for this message
Andrea Righi (arighi) wrote :

JFYI, I've sent an SRU email to apply the kernel fix also to Impish 5.13, just to be safe in case the gcc patch lands in Impish:

https://lists.ubuntu.com/archives/kernel-team/2022-February/127652.html

Changed in linux (Ubuntu Jammy):
status: New → Fix Committed
no longer affects: zfs-linux (Ubuntu)
no longer affects: zfs-linux (Ubuntu Impish)
no longer affects: zfs-linux (Ubuntu Jammy)
Changed in linux (Ubuntu Jammy):
importance: Undecided → Critical
Changed in linux (Ubuntu Impish):
importance: Undecided → High
Stefan Bader (smb)
Changed in linux (Ubuntu Impish):
status: New → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.13.0-32.35 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-impish' to 'verification-done-impish'. If the problem still exists, change the tag 'verification-needed-impish' to 'verification-failed-impish'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-impish
Revision history for this message
Ilya Leoshkevich (iii-i) wrote :

I've installed the impish kernel 5.13.0-32 on top of jammy; zfs module loads fine:

$ uname -r
5.13.0-32-generic

$ modinfo zfs | grep filename:
filename: /lib/modules/5.13.0-32-generic/updates/dkms/zfs.ko

$ strings /lib/modules/5.13.0-32-generic/updates/dkms/zfs.ko|grep -e vermagic= -e GCC -e ^version=|sort -u
GCC: (Ubuntu 11.2.0-13ubuntu1) 11.2.0
vermagic=5.13.0-32-generic SMP mod_unload modversions
version=2.0.6-1ubuntu3

tags: added: verification-done-impish
removed: verification-needed-impish
Revision history for this message
Frank Heimes (fheimes) wrote :

Many thanks Ilya!

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure-5.13/5.13.0-1019.21~20.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (49.8 KiB)

This bug was fixed in the package linux - 5.13.0-37.42

---------------
linux (5.13.0-37.42) impish; urgency=medium

  * impish/linux: 5.13.0-37.42 -proposed tracker (LP: #1964959)

  * CVE-2022-0742
    - ipv6: fix skb drops in igmp6_event_query() and igmp6_event_report()

linux (5.13.0-36.41) impish; urgency=medium

  * Packaging resync (LP: #1786013)
    - [Packaging] resync getabis
    - debian/dkms-versions -- update from kernel-versions (main/2022.02.21)

  * Broken network on some AWS instances with focal/impish kernels
    (LP: #1961968)
    - SAUCE: Revert "PCI/MSI: Mask MSI-X vectors only on success"

  * [SRU]PCI: vmd: Do not disable MSI-X remapping if interrupt remapping is
    enabled by IOMMU (LP: #1937295)
    - PCI: vmd: Do not disable MSI-X remapping if interrupt remapping is enabled
      by IOMMU

  * [UBUNTU 20.04] kernel: Add support for CPU-MF counter second version 7
    (LP: #1960182)
    - s390/cpumf: Support for CPU Measurement Facility CSVN 7
    - s390/cpumf: Support for CPU Measurement Sampling Facility LS bit

  * [UBUNTU 21.10] s390/cio: verify the driver availability for path_event call
    (LP: #1960875)
    - s390/cio: verify the driver availability for path_event call

  * Impish update: upstream stable patchset 2022-02-14 (LP: #1960861)
    - devtmpfs regression fix: reconfigure on each mount
    - orangefs: Fix the size of a memory allocation in orangefs_bufmap_alloc()
    - remoteproc: qcom: pil_info: Don't memcpy_toio more than is provided
    - perf: Protect perf_guest_cbs with RCU
    - KVM: x86: Register Processor Trace interrupt hook iff PT enabled in guest
    - KVM: s390: Clarify SIGP orders versus STOP/RESTART
    - 9p: only copy valid iattrs in 9P2000.L setattr implementation
    - video: vga16fb: Only probe for EGA and VGA 16 color graphic cards
    - media: uvcvideo: fix division by zero at stream start
    - rtlwifi: rtl8192cu: Fix WARNING when calling local_irq_restore() with
      interrupts enabled
    - firmware: qemu_fw_cfg: fix sysfs information leak
    - firmware: qemu_fw_cfg: fix NULL-pointer deref on duplicate entries
    - firmware: qemu_fw_cfg: fix kobject leak in probe error path
    - KVM: x86: remove PMU FIXED_CTR3 from msrs_to_save_all
    - ALSA: hda/realtek: Add speaker fixup for some Yoga 15ITL5 devices
    - ALSA: hda/realtek - Fix silent output on Gigabyte X570 Aorus Master after
      reboot from Windows
    - ALSA: hda: ALC287: Add Lenovo IdeaPad Slim 9i 14ITL5 speaker quirk
    - ALSA: hda/realtek: Add quirk for Legion Y9000X 2020
    - ALSA: hda/realtek: Re-order quirk entries for Lenovo
    - powerpc/pseries: Get entry and uaccess flush required bits from
      H_GET_CPU_CHARACTERISTICS
    - mtd: fixup CFI on ixp4xx
    - KVM: x86: don't print when fail to read/write pv eoi memory
    - remoteproc: qcom: pas: Add missing power-domain "mxc" for CDSP
    - perf annotate: Avoid TUI crash when navigating in the annotation of
      recursive functions
    - ALSA: hda/realtek: Use ALC285_FIXUP_HP_GPIO_LED on another HP laptop
    - ALSA: hda/tegra: Fix Tegra194 HDA reset failure

  * CVE-2022-0516
    - KVM: s390: Return error on SIDA memop on normal guest

  * CVE-2022-04...

Changed in linux (Ubuntu Impish):
status: Fix Committed → Fix Released
Revision history for this message
Frank Heimes (fheimes) wrote :

Upadting the 'affects jammy' kernel entry as well.
Since commit 'f3b7e73b2c6619884351a3a0a7468642f852b8a2' is included in jammy's 5.15 since Ubuntu-5.15.0-20.20, I'm updating the status to Fix released and with this the projects entry, too.

Changed in linux (Ubuntu Jammy):
status: Fix Committed → Fix Released
Changed in ubuntu-z-systems:
status: Fix Committed → Fix Released
Changed in gcc-11 (Ubuntu Impish):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.