amdgpu hangs for 90 seconds at a time in 5.13.0-23, but 5.13.0-22 works
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Impish |
Fix Released
|
High
|
Unassigned | ||
Jammy |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
SRU Justification
Impact:
This does not occur with linux-image-
On startup, I get about a 60 second hang, with the following in the kernel dmesg:
Jan 4 15:26:36 inspiron-3505 kernel: [ 34.160572] amdgpu 0000:04:00.0: amdgp : failed to write reg 28b4 wait reg 28c6
Jan 4 15:26:56 inspiron-3505 kernel: [ 54.189055] amdgpu 0000:04:00.0: amdgp : failed to write reg 1a6f4 wait reg 1a706
Jan 4 15:27:16 inspiron-3505 kernel: [ 74.329264] amdgpu 0000:04:00.0: amdgp : failed to write reg 28b4 wait reg 28c6
Jan 4 15:27:36 inspiron-3505 kernel: [ 94.337904] amdgpu 0000:04:00.0: amdgp : failed to write reg 1a6f4 wait reg 1a706
I have the following GPU:
04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picass
o (rev c2) (prog-if 00 [VGA controller])
04:00.0 0300: 1002:15d8 (rev c2)
(This is a Ryzen 5 3450U CPU with Radeon Vega Mobile.)
I get a similar hang if I start firefox (when it's probing OpenGL contexts), and even with glxgears and glxinfo. Seems like anything that'd kick on a OpenGL context does it. I had a freeze as well when I tried running firefox and glxgears both. Along with odd BUG: messages logged (I have some in the attached log.)
I was running with "iommu=pt", but did try with this removed, still got the errors (I think amdgpu driver uses the IOMMU even when it's set to IOMMU=pt though.). See the attached log for some very odd "[Hardware Error]" messages that were logged on one test run. I think this was when I tried to run firestorm (second life viewer) -- that had a large pause then opened to a black window.
Per Google, I see there was a bug like this that turned up in kernel 5.14.15 but fixed in 5.14.17. See https:/
Thanks!
--Henry
Fix:
upstream commit afd18180c070 ("drm/amdkfd: fix boot failure when iommu is disabled in Picasso.")
Patch was included in the Impish kernel in -proposed (5.13.0.24.24) from an upstream patch set. multiple confirmations the problem is resolved with the kernel in -proposed.
affects: | linux-hwe-5.13 (Ubuntu Impish) → linux (Ubuntu Impish) |
Changed in linux (Ubuntu Impish): | |
status: | New → Confirmed |
no longer affects: | linux (Ubuntu) |
description: | updated |
description: | updated |
no longer affects: | linux (Ubuntu Impish) |
affects: | linux-hwe-5.13 (Ubuntu) → linux (Ubuntu) |
Changed in linux (Ubuntu): | |
status: | New → Invalid |
Changed in linux (Ubuntu Impish): | |
importance: | Undecided → High |
tags: | added: amdgpu |
tags: |
added: verification-failed-focal removed: verification-needed-focal |
Additional note, I did notice one "un-regression" -- I have a build of rocm where I've tried enabling "GFX902" support for my card, this is an unsupported configuration so I don't know if I have it 100% functional but rocminfo (which as the name suggests dumps info about the rocm install and any video or compute cards it detects that can use. ) With the 5.4.0-91-generic kernel I can run rocminfo and it dumps some info about the card. On 5.13.0-22 it prints: ROCm/rocminfo/ rocminfo. cc:1143 ERROR_OUT_ OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
hsa api call failure at: /home/hwertz/
Call returned HSA_STATUS_
On 5.13.0-23, although opengl is hosed the rocminfo didn't pause and printed the rocm-related information.