amdgpu: GPU Recovery fails, frequent hangs

Bug #2043640 reported by nachokb
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mesa
Fix Released
Unknown
linux (Ubuntu)
Confirmed
Undecided
Unassigned
Jammy
Confirmed
Undecided
Unassigned
Lunar
Confirmed
Undecided
Unassigned
mesa (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
New
Undecided
Unassigned
Lunar
New
Undecided
Unassigned

Bug Description

I've been using 23.04 for a few months, and experienced a total system hang occasionally when sharing my screen over Zoom or Google Meet (running on Google Chrome).

At first it hangs and then it periodically flashes like it's trying (unsuccessfully) to recover; I've got 3 screens (including the laptop's internal one) and each attempt shows something different (at first it tries to recover the contents of all 3 screens, then it shows only one of them, and then it shows the same content on all 3, but it never gets responsive).

I've recently upgraded to 23.10, hoping a new kernel would help the situation. It's only gotten considerably worse now; it hangs sometimes just when opening Zoom; it's somehow easier to reproduce with Google Chrome. Interestingly, it fails quickly and reliably now when enabling my webcam (with special effects). It started hanging badly when using Google Maps as well.

For all these behaviors, I suspect amdgpu is to blame (I'm running on Renoir, 4750U Pro); `dmesg` and `journalctl` didn't seem to show anything interesting.

Any tips about debugging this further?

ProblemType: Bug
DistroRelease: Ubuntu 23.10
Package: linux-generic 6.5.0.10.12
ProcVersionSignature: Ubuntu 6.5.0-10.10-generic 6.5.3
Uname: Linux 6.5.0-10-generic x86_64
ApportVersion: 2.27.0-0ubuntu5
Architecture: amd64
CRDA: N/A
CasperMD5CheckResult: pass
CurrentDesktop: GNOME
Date: Thu Nov 16 02:27:45 2023
InstallationDate: Installed on 2023-07-02 (137 days ago)
InstallationMedia: Ubuntu 23.04 "Lunar Lobster" - Release amd64 (20230418)
MachineType: {report['dmi.sys.vendor']} {report['dmi.product.name']}
ProcEnviron:
 LANG=en_US.UTF-8
 PATH=(custom, no user)
 SHELL=/bin/bash
 TERM=xterm-256color
 XDG_RUNTIME_DIR=<set>
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-6.5.0-10-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro quiet splash vt.handoff=7
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-6.5.0-10-generic N/A
 linux-backports-modules-6.5.0-10-generic N/A
 linux-firmware 20230919.git3672ccab-0ubuntu2.1
SourcePackage: linux
UpgradeStatus: Upgraded to mantic on 2023-11-14 (2 days ago)
dmi.bios.date: 06/13/2023
dmi.bios.release: 1.44
dmi.bios.vendor: LENOVO
dmi.bios.version: R1BET75W(1.44 )
dmi.board.asset.tag: Not Available
dmi.board.name: 20UD000GUS
dmi.board.vendor: LENOVO
dmi.board.version: SDK0J40697 WIN
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: None
dmi.ec.firmware.release: 1.44
dmi.modalias: dmi:bvnLENOVO:bvrR1BET75W(1.44):bd06/13/2023:br1.44:efr1.44:svnLENOVO:pn20UD000GUS:pvrThinkPadT14Gen1:rvnLENOVO:rn20UD000GUS:rvrSDK0J40697WIN:cvnLENOVO:ct10:cvrNone:skuLENOVO_MT_20UD_BU_Think_FM_ThinkPadT14Gen1:
dmi.product.family: ThinkPad T14 Gen 1
dmi.product.name: 20UD000GUS
dmi.product.sku: LENOVO_MT_20UD_BU_Think_FM_ThinkPad T14 Gen 1
dmi.product.version: ThinkPad T14 Gen 1
dmi.sys.vendor: LENOVO

X-HWE-Bug: Bug #2047389

Revision history for this message
nachokb (nachokb) wrote :
Revision history for this message
nachokb (nachokb) wrote :

This is the only output after it tries a few times to recover.

But this output, was there before the crash. In fact, this is `dmesg` on my current session (obviously not hung... yet):

```
[ 5.655902] amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 5.668958] amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 5.676129] [drm] psp gfx command LOAD_TA(0x1) failed and response status is (0x7)
[ 5.676323] [drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x4)
[ 5.676330] amdgpu 0000:07:00.0: amdgpu: Secure display: Generic Failure.
[ 5.676351] amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0
[ 5.676963] amdgpu 0000:07:00.0: amdgpu: SMU is initialized successfully!
[ 5.678001] [drm] Display Core v3.2.241 initialized on DCN 2.1
[ 5.678007] [drm] DP-HDMI FRL PCON supported
[ 5.678789] [drm] DMUB hardware initialized: version=0x01010027
[ 5.888351] usb 4-1.2: new high-speed USB device number 9 using xhci_hcd
```

and it's followed by ~1600 lines. Plus the timestamp shows it's the very beginning (looks like it hanged trying to show dmesg itself)

Revision history for this message
nachokb (nachokb) wrote :

This is a video showing the behavior once it hangs: https://youtu.be/cTQtYIKzo8E

Revision history for this message
nachokb (nachokb) wrote :

I found relevant log entries.

Revision history for this message
nachokb (nachokb) wrote :

This is all the logs for that boot.

summary: - amdgpu hangs the computer frequently
+ amdgpu: GPU Recovery fails, frequent hangs
Revision history for this message
Mario Limonciello (superm1) wrote :

Try amdgpu.mcbp=0 on your kernel command line.

Revision history for this message
nachokb (nachokb) wrote (last edit ):

Thanks a lot, Mario, for pointing me in the right direction. It's definitely MCBP what causes it. I've been testing it for a few hours, and it was _very_ easy to trigger.

It seems to only be triggered (or at least _more frequently_) when using multiple outputs, and USB-C DP => HDMI through docks. That would explain why it slipped through testing while being so pervasive in my particular setup.

This seems to be an active topic atm both in the GNOME Mutter GitLab and Ubuntu side (and others as well). Some references for the record:

1. https://unix.stackexchange.com/questions/756281/kernel-6-5-2-seems-to-have-amdgpu-crash-on-no-retry-page-fault
2. https://www.reddit.com/r/Fedora/comments/16wzpup/how_to_mitigate_amdgpu_crash_caused_by_bug_in/?rdt=47451
3. https://gitlab.freedesktop.org/drm/amd/-/issues/2830
4. https://gitlab.freedesktop.org/drm/amd/-/issues/2971
5. https://gitlab.gnome.org/GNOME/mutter/-/issues/3151

I don't know how to close this bug.

Revision history for this message
Mario Limonciello (superm1) wrote :

I don't think we should close it, if you reproduce a hang in Ubuntu it should be fixed in Ubuntu.
The patch should come in mesa. I already know what patch should fix it in mesa (it's mentioned in https://gitlab.freedesktop.org/drm/amd/-/issues/2971)

If I build you a PPA to test, can you see if it helps without the parameter?

affects: mutter (Ubuntu) → mesa (Ubuntu)
Changed in mesa (Ubuntu):
status: New → Fix Released
Changed in linux (Ubuntu):
status: New → Won't Fix
Changed in linux (Ubuntu Jammy):
status: New → Won't Fix
Changed in linux (Ubuntu Lunar):
status: New → Won't Fix
Revision history for this message
Mario Limonciello (superm1) wrote :

I've published a PPA here: https://launchpad.net/~superm1/+archive/ubuntu/gitlab2971/+packages

This has builds both for 22.04 (Jammy) and 23.04 (Lunar). Please upgrade to that, drop the module parameter and see if things improve.

# sudo add-apt-repository ppa:superm1/gitlab2971
# sudo apt upgrade
# sudo reboot

If they don't, you can remove the PPA using ppa-purge like this:

# sudo ppa-purge ppa:superm1/gitlab2971
# sudo reboot

Changed in mesa:
status: Unknown → New
Changed in mesa:
status: New → Fix Released
Revision history for this message
Mario Limonciello (superm1) wrote :

This is the same issue as https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2045573

Here is a commit that fixes the issue by changing default pre-emption policy since the kernel can't know about your mesa version.

https://github.com/torvalds/linux/commit/d6a57588666301acd9d42d3b00d74240964f07f6

Changed in linux (Ubuntu):
status: Won't Fix → Confirmed
Changed in linux (Ubuntu Jammy):
status: Won't Fix → Confirmed
Changed in linux (Ubuntu Lunar):
status: Won't Fix → Confirmed
AaronMa (mapengyu)
description: updated
Revision history for this message
OEM Taipei Bot (oem-taipei-bot) wrote :
Revision history for this message
OEM Taipei Bot (oem-taipei-bot) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.