tlbie master timeout checkstop (using NVidia/GPU)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
The Ubuntu-power-systems project |
Fix Released
|
Critical
|
Canonical Kernel Team | ||
linux (Ubuntu) |
Fix Released
|
Critical
|
Khaled El Mously | ||
Bionic |
Fix Released
|
Critical
|
Khaled El Mously |
Bug Description
A hung state machine in the chip's NMU logic can trigger a fatal condition that will be flagged by hardware through a checkstop. Hence, customers that have a Power 9 Whitherspoon (equipped with GPUs) will experience a crash on their server when using NVIDIA's toolkit.
The server will crash with the following hardware failing message:
Unrecoverable Hardware Failure, (Critical) A system checkstop occurred (AffectedSubsystem: Canister/Appliance, PID: 19703), Resolved: 0
In this case, a `NCUFIR[10] tlbie master timeout` has been observed by only starting the NVIDIA ATS driver. This issue is being triggered because the NMU logic is getting stuck when a page is upgraded from RO -> RW without a following tlbie.
This is addressed with the following patches:
bd5050e38aec305
e4c1112c3fc503f
044003b52a78bcb
f069ff396d657ac
CVE References
tags: | added: architecture-ppc64le bugnameltc-170972 severity-critical targetmilestone-inin1804 |
Changed in ubuntu: | |
assignee: | nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) |
affects: | ubuntu → linux (Ubuntu) |
Changed in ubuntu-power-systems: | |
status: | New → Triaged |
importance: | Undecided → Critical |
assignee: | nobody → Canonical Kernel Team (canonical-kernel-team) |
tags: | added: triage-g |
Changed in linux (Ubuntu): | |
assignee: | Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Kernel Team (canonical-kernel-team) |
importance: | Undecided → Critical |
Changed in linux (Ubuntu): | |
status: | New → Triaged |
Changed in linux (Ubuntu Bionic): | |
status: | New → Triaged |
importance: | Undecided → Critical |
assignee: | nobody → Joseph Salisbury (jsalisbury) |
Changed in linux (Ubuntu): | |
assignee: | Canonical Kernel Team (canonical-kernel-team) → Joseph Salisbury (jsalisbury) |
Changed in linux (Ubuntu): | |
assignee: | Joseph Salisbury (jsalisbury) → Khaled El Mously (kmously) |
Changed in linux (Ubuntu Bionic): | |
assignee: | Joseph Salisbury (jsalisbury) → Khaled El Mously (kmously) |
Changed in linux (Ubuntu Bionic): | |
status: | Triaged → Fix Committed |
Changed in linux (Ubuntu): | |
status: | Triaged → Fix Committed |
Changed in ubuntu-power-systems: | |
status: | Triaged → Fix Committed |
tags: |
added: verification-done-bionic removed: verification-needed-bionic |
Changed in linux (Ubuntu): | |
status: | Fix Committed → Fix Released |
Changed in ubuntu-power-systems: | |
status: | Fix Committed → Fix Released |
tags: | added: cscc |
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification- needed- bionic' to 'verification- done-bionic' . If the problem still exists, change the tag 'verification- needed- bionic' to 'verification- failed- bionic' .
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.
See https:/ /wiki.ubuntu. com/Testing/ EnableProposed for documentation how to enable and use -proposed. Thank you!