tlbie master timeout checkstop (using NVidia/GPU)

Bug #1789772 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
Critical
Canonical Kernel Team
linux (Ubuntu)
Fix Released
Critical
Khaled El Mously
Bionic
Fix Released
Critical
Khaled El Mously

Bug Description

A hung state machine in the chip's NMU logic can trigger a fatal condition that will be flagged by hardware through a checkstop. Hence, customers that have a Power 9 Whitherspoon (equipped with GPUs) will experience a crash on their server when using NVIDIA's toolkit.

The server will crash with the following hardware failing message:
Unrecoverable Hardware Failure, (Critical) A system checkstop occurred (AffectedSubsystem: Canister/Appliance, PID: 19703), Resolved: 0

In this case, a `NCUFIR[10] tlbie master timeout` has been observed by only starting the NVIDIA ATS driver. This issue is being triggered because the NMU logic is getting stuck when a page is upgraded from RO -> RW without a following tlbie.

This is addressed with the following patches:
bd5050e38aec3055ff4257ade987d808ac93b582 powerpc/mm/radix: Change pte relax sequence to handle nest MMU hang
e4c1112c3fc503fc78379fa61450bfda3f0717fe powerpc/mm: Change function prototype
044003b52a78bcbda7103633c351da16505096cf powerpc/mm/radix: Move function from radix.h to pgtable-radix.c
f069ff396d657ac7bdb5de866c3ec28b8d08d953 powerpc/mm/hugetlb: Update huge_ptep_set_access_flags to call __ptep_set_access_flags directly

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-170972 severity-critical targetmilestone-inin1804
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: triage-g
Manoj Iyer (manjo)
Changed in linux (Ubuntu):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → Critical
Changed in linux (Ubuntu):
status: New → Triaged
Changed in linux (Ubuntu Bionic):
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu):
assignee: Canonical Kernel Team (canonical-kernel-team) → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu):
assignee: Joseph Salisbury (jsalisbury) → Khaled El Mously (kmously)
Changed in linux (Ubuntu Bionic):
assignee: Joseph Salisbury (jsalisbury) → Khaled El Mously (kmously)
Changed in linux (Ubuntu Bionic):
status: Triaged → Fix Committed
Manoj Iyer (manjo)
Changed in linux (Ubuntu):
status: Triaged → Fix Committed
Changed in ubuntu-power-systems:
status: Triaged → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
Mike Ranweiler (mranweil) wrote :

I tried out the -proposed kernel and ran with the GPU. The original problem would hit under stress during some longer runs - the problem did not recreate on the new kernel and I didn't see any new regressions.

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (23.5 KiB)

This bug was fixed in the package linux - 4.15.0-36.39

---------------
linux (4.15.0-36.39) bionic; urgency=medium

  * CVE-2018-14633
    - iscsi target: Use hex2bin instead of a re-implementation

  * CVE-2018-17182
    - mm: get rid of vmacache_flush_all() entirely

linux (4.15.0-35.38) bionic; urgency=medium

  * linux: 4.15.0-35.38 -proposed tracker (LP: #1791719)

  * device hotplug of vfio devices can lead to deadlock in vfio_pci_release
    (LP: #1792099)
    - SAUCE: vfio -- release device lock before userspace requests

  * L1TF mitigation not effective in some CPU and RAM combinations
    (LP: #1788563)
    - x86/speculation/l1tf: Fix overflow in l1tf_pfn_limit() on 32bit
    - x86/speculation/l1tf: Fix off-by-one error when warning that system has too
      much RAM
    - x86/speculation/l1tf: Increase l1tf memory limit for Nehalem+

  * CVE-2018-15594
    - x86/paravirt: Fix spectre-v2 mitigations for paravirt guests

  * CVE-2017-5715 (Spectre v2 s390x)
    - KVM: s390: implement CPU model only facilities
    - s390: detect etoken facility
    - KVM: s390: add etoken support for guests
    - s390/lib: use expoline for all bcr instructions
    - s390: fix br_r1_trampoline for machines without exrl
    - SAUCE: s390: use expoline thunks for all branches generated by the BPF JIT

  * Ubuntu18.04.1: cpuidle: powernv: Fix promotion from snooze if next state
    disabled (performance) (LP: #1790602)
    - cpuidle: powernv: Fix promotion from snooze if next state disabled

  * Watchdog CPU:19 Hard LOCKUP when kernel crash was triggered (LP: #1790636)
    - powerpc: hard disable irqs in smp_send_stop loop
    - powerpc: Fix deadlock with multiple calls to smp_send_stop
    - powerpc: smp_send_stop do not offline stopped CPUs
    - powerpc/powernv: Fix opal_event_shutdown() called with interrupts disabled

  * Security fix: check if IOMMU page is contained in the pinned physical page
    (LP: #1785675)
    - vfio/spapr: Use IOMMU pageshift rather than pagesize
    - KVM: PPC: Check if IOMMU page is contained in the pinned physical page

  * Missing Intel GPU pci-id's (LP: #1789924)
    - drm/i915/kbl: Add KBL GT2 sku
    - drm/i915/whl: Introducing Whiskey Lake platform
    - drm/i915/aml: Introducing Amber Lake platform
    - drm/i915/cfl: Add a new CFL PCI ID.

  * CVE-2018-15572
    - x86/speculation: Protect against userspace-userspace spectreRSB

  * Support Power Management for Thunderbolt Controller (LP: #1789358)
    - thunderbolt: Handle NULL boot ACL entries properly
    - thunderbolt: Notify userspace when boot_acl is changed
    - thunderbolt: Use 64-bit DMA mask if supported by the platform
    - thunderbolt: Do not unnecessarily call ICM get route
    - thunderbolt: No need to take tb->lock in domain suspend/complete
    - thunderbolt: Use correct ICM commands in system suspend
    - thunderbolt: Add support for runtime PM

  * random oopses on s390 systems using NVMe devices (LP: #1790480)
    - s390/pci: fix out of bounds access during irq setup

  * [Bionic] Spectre v4 mitigation (Speculative Store Bypass Disable) support
    for arm64 using SMC firmware call to set a hardware chicken bit
    (LP: #1787993) // CVE-2018...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
Revision history for this message
Frank Heimes (fheimes) wrote :

Since we are already on kernel 4.18.0.8.9 in cosmic that incl. the mentioned patches:
$ git log --oneline | grep "Check if IOMMU page is contained in the pinned physical page"
76fa497 KVM: PPC: Check if IOMMU page is contained in the pinned physical page
fheimes@T570:~/ubuntu-cosmic$ git tag --contains 76fa497
Ubuntu-4.18.0-7.8
Ubuntu-4.18.0-8.9
Ubuntu-4.18.0-9.10
v4.18
$ git log --oneline | grep "powerpc/mm/radix: Change pte relax sequence to handle nest MMU hang"
bd5050e powerpc/mm/radix: Change pte relax sequence to handle nest MMU hang
$ git tag --contains bd5050e
Ubuntu-4.18.0-7.8
Ubuntu-4.18.0-8.9
Ubuntu-4.18.0-9.10
v4.18
$ git log --oneline | grep "powerpc/mm: Change function prototype"
e4c1112 powerpc/mm: Change function prototype
$ git tag --contains e4c1112
Ubuntu-4.18.0-7.8
Ubuntu-4.18.0-8.9
Ubuntu-4.18.0-9.10
v4.18
$ git log --oneline | grep "powerpc/mm/hugetlb: Update huge_ptep_set_access_flags to call __ptep_set_access_flags directly"
f069ff3 powerpc/mm/hugetlb: Update huge_ptep_set_access_flags to call __ptep_set_access_flags directly
$ git tag --contains f069ff3
Ubuntu-4.18.0-7.8
Ubuntu-4.18.0-8.9
Ubuntu-4.18.0-9.10
v4.18
this can be set to Fix Released for cosmic, too.

Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.