amdgpu: GPU Recovery fails, frequent hangs
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Mesa |
Fix Released
|
Unknown
|
|||
linux (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned | ||
Jammy |
Confirmed
|
Undecided
|
Unassigned | ||
Lunar |
Confirmed
|
Undecided
|
Unassigned | ||
mesa (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Jammy |
New
|
Undecided
|
Unassigned | ||
Lunar |
New
|
Undecided
|
Unassigned |
Bug Description
I've been using 23.04 for a few months, and experienced a total system hang occasionally when sharing my screen over Zoom or Google Meet (running on Google Chrome).
At first it hangs and then it periodically flashes like it's trying (unsuccessfully) to recover; I've got 3 screens (including the laptop's internal one) and each attempt shows something different (at first it tries to recover the contents of all 3 screens, then it shows only one of them, and then it shows the same content on all 3, but it never gets responsive).
I've recently upgraded to 23.10, hoping a new kernel would help the situation. It's only gotten considerably worse now; it hangs sometimes just when opening Zoom; it's somehow easier to reproduce with Google Chrome. Interestingly, it fails quickly and reliably now when enabling my webcam (with special effects). It started hanging badly when using Google Maps as well.
For all these behaviors, I suspect amdgpu is to blame (I'm running on Renoir, 4750U Pro); `dmesg` and `journalctl` didn't seem to show anything interesting.
Any tips about debugging this further?
ProblemType: Bug
DistroRelease: Ubuntu 23.10
Package: linux-generic 6.5.0.10.12
ProcVersionSign
Uname: Linux 6.5.0-10-generic x86_64
ApportVersion: 2.27.0-0ubuntu5
Architecture: amd64
CRDA: N/A
CasperMD5CheckR
CurrentDesktop: GNOME
Date: Thu Nov 16 02:27:45 2023
InstallationDate: Installed on 2023-07-02 (137 days ago)
InstallationMedia: Ubuntu 23.04 "Lunar Lobster" - Release amd64 (20230418)
MachineType: {report[
ProcEnviron:
LANG=en_US.UTF-8
PATH=(custom, no user)
SHELL=/bin/bash
TERM=xterm-
XDG_RUNTIME_
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageV
linux-
linux-
linux-firmware 20230919.
SourcePackage: linux
UpgradeStatus: Upgraded to mantic on 2023-11-14 (2 days ago)
dmi.bios.date: 06/13/2023
dmi.bios.release: 1.44
dmi.bios.vendor: LENOVO
dmi.bios.version: R1BET75W(1.44 )
dmi.board.
dmi.board.name: 20UD000GUS
dmi.board.vendor: LENOVO
dmi.board.version: SDK0J40697 WIN
dmi.chassis.
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.
dmi.ec.
dmi.modalias: dmi:bvnLENOVO:
dmi.product.family: ThinkPad T14 Gen 1
dmi.product.name: 20UD000GUS
dmi.product.sku: LENOVO_
dmi.product.
dmi.sys.vendor: LENOVO
X-HWE-Bug: Bug #2047389
affects: | mutter (Ubuntu) → mesa (Ubuntu) |
Changed in mesa (Ubuntu): | |
status: | New → Fix Released |
Changed in linux (Ubuntu): | |
status: | New → Won't Fix |
Changed in linux (Ubuntu Jammy): | |
status: | New → Won't Fix |
Changed in linux (Ubuntu Lunar): | |
status: | New → Won't Fix |
Changed in mesa: | |
status: | Unknown → New |
Changed in mesa: | |
status: | New → Fix Released |
description: | updated |
This is the only output after it tries a few times to recover.
But this output, was there before the crash. In fact, this is `dmesg` on my current session (obviously not hung... yet):
```
[ 5.655902] amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 5.668958] amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 5.676129] [drm] psp gfx command LOAD_TA(0x1) failed and response status is (0x7)
[ 5.676323] [drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x4)
[ 5.676330] amdgpu 0000:07:00.0: amdgpu: Secure display: Generic Failure.
[ 5.676351] amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0
[ 5.676963] amdgpu 0000:07:00.0: amdgpu: SMU is initialized successfully!
[ 5.678001] [drm] Display Core v3.2.241 initialized on DCN 2.1
[ 5.678007] [drm] DP-HDMI FRL PCON supported
[ 5.678789] [drm] DMUB hardware initialized: version=0x01010027
[ 5.888351] usb 4-1.2: new high-speed USB device number 9 using xhci_hcd
```
and it's followed by ~1600 lines. Plus the timestamp shows it's the very beginning (looks like it hanged trying to show dmesg itself)