random oopses on s390 systems using NVMe devices

Bug #1790480 reported by bugproxy
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
High
Canonical Kernel Team
linux (Ubuntu)
Fix Released
Medium
Seth Forshee
Xenial
Fix Released
High
Kleber Sacilotto de Souza
Bionic
Fix Released
High
Kleber Sacilotto de Souza

Bug Description

== SRU Justification ==
IBM is requesting a fix for the following issue found with NVMe devices on s390x:

The trigger is a PCI function whose driver requests more interrupts than the architectural maximum. Currently this is only possible with a machine that supports 64 CPUs (or more) with a NVMe function attached. Note that the LPAR does not have to use >=64 CPUs since the NVMe driver uses num_possible_cpus() which is resolved to the machine maximum on s390 (since all CPUs are hot-pluggable). The oops happens after the driver calls pci_alloc_irq_vectors during device probing - so most likely the system will panic during boot.

The fix has been cc'ed to stable@, but hasn't been picked up for Bionic yet.

== Fix ==
866f3576a72b s390/pci: fix out of bounds access during irq setup

== Regression Potential ==
Low. Affects only s390x systems with more than 64 cpus and NVMe function enabled.

== Test case ==
Boot the kernel in an affected environment.

=== Original bug description ===
Random oopses on s390 systems using NVMe and running the Ubuntu 18.04.1 kernel have been reported.
Bisect of the upstream kernel points to:
16ccfff28976 nvme: pci: pass max vectors as num_possible_cpus() to pci_alloc_irq_vectors

This commit is correct but reveals a bug in s390s IRQ setup routine. A fix is available fixed via:

Commit-ID : 866f3576a72b2233a76dffb80290f8086dc49e17

Need also be applied for Ubuntu 18.10

bugproxy (bugproxy)
tags: added: architecture-s39064 bugnameltc-170595 severity-high targetmilestone-inin1804
Changed in ubuntu:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
affects: ubuntu → linux (Ubuntu)
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Seth Forshee (sforshee)
Changed in linux (Ubuntu):
assignee: Skipper Bug Screeners (skipper-screen-team) → Seth Forshee (sforshee)
importance: Undecided → Medium
status: New → Fix Committed
Revision history for this message
Frank Heimes (fheimes) wrote :

@IBM: Even if we do not have NVMe devices in our Z machine (hence we cannot test this on s390x by ourselves) it would be good and helpful if you can share a description / or some steps of a potential test case.
This would help judging the regression risk in case of an SRU to 18.04 (and is needed for a SRU anyway).

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-09-05 05:06 EDT-------
(In reply to comment #8)
> @IBM: Even if we do not have NVMe devices in our Z machine (hence we cannot
> test this on s390x by ourselves) it would be good and helpful if you can
> share a description / or some steps of a potential test case.
> This would help judging the regression risk in case of an SRU to 18.04 (and
> is needed for a SRU anyway).

The trigger is a PCI function whose driver requests more interrupts than the architectural maximum. Currently this is only possible with a machine that supports 64 CPUs (or more) with a NVMe function attached. Note that the LPAR does not have to use >=64 CPUs since the NVMe driver uses num_possible_cpus() which is resolved to the machine maximum on s390 (since all CPUs are hot-pluggable). The oops happens after the driver calls pci_alloc_irq_vectors during device probing - so most likely the system will panic during boot.

Changed in linux (Ubuntu Bionic):
assignee: nobody → Kleber Sacilotto de Souza (kleber-souza)
status: New → Triaged
importance: Undecided → Medium
importance: Medium → High
description: updated
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

I built a test kernel with the commit 866f3576a72b ("s390/pci: fix out of bounds access during irq setup").

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~ksouza/lp1790480/

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
* For test kernels that are 4.15(Bionic) or newer, you need to install the linux-modules and linux-modules-extra .deb packages.

Thank you.

Changed in linux (Ubuntu Bionic):
status: Triaged → In Progress
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Triaged → In Progress
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-09-05 07:45 EDT-------
(In reply to comment #10)
> The test kernel can be downloaded from:
> http://kernel.ubuntu.com/~ksouza/lp1790480/
>
> Can you test this kernel and see if it resolves this bug?

uname -v
#37~lp1790480 SMP Wed Sep 5 09:47:51 UTC 2018

lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
dasda 94:0 0 22.5G 0 disk
`-dasda1 94:1 0 22.5G 0 part /
nvme0n1 259:0 0 931.5G 0 disk
`-nvme0n1p1 259:1 0 931.5G 0 part

This kernel resolves the bug. Thanks!

Revision history for this message
Frank Heimes (fheimes) wrote :

According to https://lkml.org/lkml/2018/9/3/1125 is needs to be incl. into xenial (kernel 4.4) as well.

Changed in linux (Ubuntu Xenial):
status: New → In Progress
Changed in linux (Ubuntu Xenial):
importance: Undecided → High
assignee: nobody → Kleber Sacilotto de Souza (kleber-souza)
status: In Progress → Fix Committed
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: In Progress → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-09-13 05:46 EDT-------
Already verified upfront

Revision history for this message
Frank Heimes (fheimes) wrote :

adjusting tags according to comment #8

tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-09-18 03:38 EDT-------
Verfied upfront by IBM

Revision history for this message
Frank Heimes (fheimes) wrote :

adjusting tags according to comment #11

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (9.6 KiB)

This bug was fixed in the package linux - 4.4.0-137.163

---------------
linux (4.4.0-137.163) xenial; urgency=medium

  * CVE-2018-14633
    - iscsi target: Use hex2bin instead of a re-implementation

  * CVE-2018-17182
    - mm: get rid of vmacache_flush_all() entirely

linux (4.4.0-136.162) xenial; urgency=medium

  * linux: 4.4.0-136.162 -proposed tracker (LP: #1791745)

  * CVE-2017-5753
    - bpf: properly enforce index mask to prevent out-of-bounds speculation
    - Revert "UBUNTU: SAUCE: bpf: Use barrier_nospec() instead of osb()"
    - Revert "bpf: prevent speculative execution in eBPF interpreter"

  * L1TF mitigation not effective in some CPU and RAM combinations
    (LP: #1788563) // CVE-2018-3620 // CVE-2018-3646
    - x86/speculation/l1tf: Fix overflow in l1tf_pfn_limit() on 32bit
    - x86/speculation/l1tf: Fix off-by-one error when warning that system has too
      much RAM
    - x86/speculation/l1tf: Increase l1tf memory limit for Nehalem+

  * CVE-2018-15594
    - x86/paravirt: Fix spectre-v2 mitigations for paravirt guests

  * Xenial update to 4.4.144 stable release (LP: #1791080)
    - KVM/Eventfd: Avoid crash when assign and deassign specific eventfd in
      parallel.
    - x86/MCE: Remove min interval polling limitation
    - fat: fix memory allocation failure handling of match_strdup()
    - ALSA: rawmidi: Change resized buffers atomically
    - ARC: Fix CONFIG_SWAP
    - ARC: mm: allow mprotect to make stack mappings executable
    - mm: memcg: fix use after free in mem_cgroup_iter()
    - ipv4: Return EINVAL when ping_group_range sysctl doesn't map to user ns
    - ipv6: fix useless rol32 call on hash
    - lib/rhashtable: consider param->min_size when setting initial table size
    - net/ipv4: Set oif in fib_compute_spec_dst
    - net: phy: fix flag masking in __set_phy_supported
    - ptp: fix missing break in switch
    - tg3: Add higher cpu clock for 5762.
    - net: Don't copy pfmemalloc flag in __copy_skb_header()
    - skbuff: Unconditionally copy pfmemalloc in __skb_clone()
    - xhci: Fix perceived dead host due to runtime suspend race with event handler
    - x86/paravirt: Make native_save_fl() extern inline
    - SAUCE: Add missing CPUID_7_EDX defines
    - SAUCE: x86/speculation: Expose indirect_branch_prediction_barrier()
    - x86/pti: Mark constant arrays as __initconst
    - x86/asm/entry/32: Simplify pushes of zeroed pt_regs->REGs
    - x86/entry/64/compat: Clear registers for compat syscalls, to reduce
      speculation attack surface
    - x86/speculation: Clean up various Spectre related details
    - x86/speculation: Fix up array_index_nospec_mask() asm constraint
    - x86/xen: Zero MSR_IA32_SPEC_CTRL before suspend
    - x86/mm: Factor out LDT init from context init
    - x86/mm: Give each mm TLB flush generation a unique ID
    - SAUCE: x86/speculation: Use Indirect Branch Prediction Barrier in context
      switch
    - x86/speculation: Use IBRS if available before calling into firmware
    - x86/speculation: Move firmware_restrict_branch_speculation_*() from C to CPP
    - selftest/seccomp: Fix the seccomp(2) signature
    - xen: set cpu capabilities from xen_start_kernel()
    - x86/amd: d...

Read more...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (23.5 KiB)

This bug was fixed in the package linux - 4.15.0-36.39

---------------
linux (4.15.0-36.39) bionic; urgency=medium

  * CVE-2018-14633
    - iscsi target: Use hex2bin instead of a re-implementation

  * CVE-2018-17182
    - mm: get rid of vmacache_flush_all() entirely

linux (4.15.0-35.38) bionic; urgency=medium

  * linux: 4.15.0-35.38 -proposed tracker (LP: #1791719)

  * device hotplug of vfio devices can lead to deadlock in vfio_pci_release
    (LP: #1792099)
    - SAUCE: vfio -- release device lock before userspace requests

  * L1TF mitigation not effective in some CPU and RAM combinations
    (LP: #1788563)
    - x86/speculation/l1tf: Fix overflow in l1tf_pfn_limit() on 32bit
    - x86/speculation/l1tf: Fix off-by-one error when warning that system has too
      much RAM
    - x86/speculation/l1tf: Increase l1tf memory limit for Nehalem+

  * CVE-2018-15594
    - x86/paravirt: Fix spectre-v2 mitigations for paravirt guests

  * CVE-2017-5715 (Spectre v2 s390x)
    - KVM: s390: implement CPU model only facilities
    - s390: detect etoken facility
    - KVM: s390: add etoken support for guests
    - s390/lib: use expoline for all bcr instructions
    - s390: fix br_r1_trampoline for machines without exrl
    - SAUCE: s390: use expoline thunks for all branches generated by the BPF JIT

  * Ubuntu18.04.1: cpuidle: powernv: Fix promotion from snooze if next state
    disabled (performance) (LP: #1790602)
    - cpuidle: powernv: Fix promotion from snooze if next state disabled

  * Watchdog CPU:19 Hard LOCKUP when kernel crash was triggered (LP: #1790636)
    - powerpc: hard disable irqs in smp_send_stop loop
    - powerpc: Fix deadlock with multiple calls to smp_send_stop
    - powerpc: smp_send_stop do not offline stopped CPUs
    - powerpc/powernv: Fix opal_event_shutdown() called with interrupts disabled

  * Security fix: check if IOMMU page is contained in the pinned physical page
    (LP: #1785675)
    - vfio/spapr: Use IOMMU pageshift rather than pagesize
    - KVM: PPC: Check if IOMMU page is contained in the pinned physical page

  * Missing Intel GPU pci-id's (LP: #1789924)
    - drm/i915/kbl: Add KBL GT2 sku
    - drm/i915/whl: Introducing Whiskey Lake platform
    - drm/i915/aml: Introducing Amber Lake platform
    - drm/i915/cfl: Add a new CFL PCI ID.

  * CVE-2018-15572
    - x86/speculation: Protect against userspace-userspace spectreRSB

  * Support Power Management for Thunderbolt Controller (LP: #1789358)
    - thunderbolt: Handle NULL boot ACL entries properly
    - thunderbolt: Notify userspace when boot_acl is changed
    - thunderbolt: Use 64-bit DMA mask if supported by the platform
    - thunderbolt: Do not unnecessarily call ICM get route
    - thunderbolt: No need to take tb->lock in domain suspend/complete
    - thunderbolt: Use correct ICM commands in system suspend
    - thunderbolt: Add support for runtime PM

  * random oopses on s390 systems using NVMe devices (LP: #1790480)
    - s390/pci: fix out of bounds access during irq setup

  * [Bionic] Spectre v4 mitigation (Speculative Store Bypass Disable) support
    for arm64 using SMC firmware call to set a hardware chicken bit
    (LP: #1787993) // CVE-2018...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (29.0 KiB)

This bug was fixed in the package linux - 4.18.0-8.9

---------------
linux (4.18.0-8.9) cosmic; urgency=medium

  * linux: 4.18.0-8.9 -proposed tracker (LP: #1791663)

  * Cosmic update to v4.18.7 stable release (LP: #1791660)
    - rcu: Make expedited GPs handle CPU 0 being offline
    - net: 6lowpan: fix reserved space for single frames
    - net: mac802154: tx: expand tailroom if necessary
    - 9p/net: Fix zero-copy path in the 9p virtio transport
    - spi: davinci: fix a NULL pointer dereference
    - spi: pxa2xx: Add support for Intel Ice Lake
    - spi: spi-fsl-dspi: Fix imprecise abort on VF500 during probe
    - spi: cadence: Change usleep_range() to udelay(), for atomic context
    - mmc: block: Fix unsupported parallel dispatch of requests
    - mmc: renesas_sdhi_internal_dmac: mask DMAC interrupts
    - mmc: renesas_sdhi_internal_dmac: fix #define RST_RESERVED_BITS
    - readahead: stricter check for bdi io_pages
    - block: fix infinite loop if the device loses discard capability
    - block: blk_init_allocated_queue() set q->fq as NULL in the fail case
    - block: really disable runtime-pm for blk-mq
    - blkcg: Introduce blkg_root_lookup()
    - block: Introduce blk_exit_queue()
    - block: Ensure that a request queue is dissociated from the cgroup controller
    - apparmor: fix bad debug check in apparmor_secid_to_secctx()
    - dma-buf: Move BUG_ON from _add_shared_fence to _add_shared_inplace
    - libertas: fix suspend and resume for SDIO connected cards
    - media: Revert "[media] tvp5150: fix pad format frame height"
    - mailbox: xgene-slimpro: Fix potential NULL pointer dereference
    - Replace magic for trusting the secondary keyring with #define
    - Fix kexec forbidding kernels signed with keys in the secondary keyring to
      boot
    - powerpc/fadump: handle crash memory ranges array index overflow
    - powerpc/64s: Fix page table fragment refcount race vs speculative references
    - powerpc/pseries: Fix endianness while restoring of r3 in MCE handler.
    - powerpc/pkeys: Give all threads control of their key permissions
    - powerpc/pkeys: Deny read/write/execute by default
    - powerpc/pkeys: key allocation/deallocation must not change pkey registers
    - powerpc/pkeys: Save the pkey registers before fork
    - powerpc/pkeys: Fix calculation of total pkeys.
    - powerpc/pkeys: Preallocate execute-only key
    - powerpc/nohash: fix pte_access_permitted()
    - powerpc64/ftrace: Include ftrace.h needed for enable/disable calls
    - powerpc/powernv/pci: Work around races in PCI bridge enabling
    - cxl: Fix wrong comparison in cxl_adapter_context_get()
    - IB/mlx5: Honor cnt_set_id_valid flag instead of set_id
    - IB/mlx5: Fix leaking stack memory to userspace
    - IB/srpt: Fix srpt_cm_req_recv() error path (1/2)
    - IB/srpt: Fix srpt_cm_req_recv() error path (2/2)
    - IB/srpt: Support HCAs with more than two ports
    - overflow.h: Add arithmetic shift helper
    - RDMA/mlx5: Fix shift overflow in mlx5_ib_create_wq
    - ib_srpt: Fix a use-after-free in srpt_close_ch()
    - ib_srpt: Fix a use-after-free in __srpt_close_all_ch()
    - RDMA/rxe: Set wqe->status correctly if an unexpected...

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-10-17 05:53 EDT-------
IBM Bugzilla status-> closed, Fix Released for Xenial, Bionic, Cosmic

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-10-18 05:10 EDT-------
*** Bug 171073 has been marked as a duplicate of this bug. ***

Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.