AMD GPUs fail with null pointer dereference when IOMMU enabled, leading to black screen
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Jammy |
Fix Committed
|
High
|
Matthew Ruffell |
Bug Description
BugLink: https:/
[Impact]
On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is enabled, the system fails to boot correctly, and all users see is a black screen.
This is caused by a null pointer dereference when enabling the IOMMU after the device has been initialised. It should happen the other way around.
AMD-Vi: AMD IOMMUv2 loaded and initialized
...
amdgpu: Topology: Add APU node [0x15d8:0x1002]
kfd kfd: amdgpu: added device 1002:15d8
kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
...
amdgpu 0000:06:00.0: amdgpu: amdgpu_
amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
...
BUG: kernel NULL pointer dereference, address: 000000000000013c
...
CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic #122-Ubuntu
...
RIP: 0010:amdgpu_
...
Call Trace:
<TASK>
? srso_return_
? show_trace_
? show_trace_
? dm_hw_fini+
? show_regs.
? __die_body.
? __die+0x2b/0x37
? page_fault_
? srso_return_
? do_user_
? srso_return_
? __free_
? exc_page_
? asm_exc_
? amdgpu_
dm_hw_
amdgpu_
amdgpu_
amdgpu_
amdgpu_
amdgpu_
local_
? srso_return_
pci_device_
really_
__driver_
driver_
__driver_
? __device_
bus_for_
driver_
bus_add_
? srso_return_
driver_
__pci_
amdgpu_
? 0xffffffffc0e0b000
do_one_
? srso_return_
? kmem_cache_
do_init_
load_module+
__do_sys_
__x64_
x64_sys_
do_syscall_
...
entry_
A workaround does exist. Users can set "nomodeset" or "amd_iommu=off" to GRUB_CMDLINE_
[Fix]
The regression was caused by the following commit that landed in 5.15.0-112-generic, and 5.15.150 upstream:
commit 3c7e53c0d4b43ff
Author: Yifan Zhang <email address hidden>
Date: Tue Sep 28 15:42:35 2021 +0800
Subject: drm/amdgpu: init iommu after amdkfd device init
Link: https:/
The fix is to revert this patch, as it was not suppose to be backported to 5.15 stable.
The mailing list discussion with AMD developers is:
https://<email address hidden>/
The fix hasn't been acknowledged by Greg KH or Sasha Levin yet, so sending as a Ubuntu SAUCE patch. If the upstream status changes, we can NAK and resend.
[Testcase]
You need a system with an AMD Picasso/Raven 2 device. It will likely be an APU, and not a discrete graphics card, but any AMD Picasso/Raven 2 device is affected.
Install the kernel and boot. Make sure full modesetting is enabled.
There is a test kernel available in the ppa below:
https:/
If you install the test kernel, your system should boot successfully.
[Where problems could occur]
We are reverting a problematic patch and going back to how it was before 5.15.0-112-generic. This should not cause any issues for users.
If a regression were to occur, users can set "nomodeset" or "amd_iommu=off" to GRUB_CMDLINE_
The impact of a regression would be high, as users displays could be blank.
[Other Info]
User reports:
https:/
https:/
https:/
https:/
https:/
https:/
https:/
As bizarre as it is, this commit was actually originally included in 5.15-rc5:
commit 714d9e4574d5459
Author: Yifan Zhang <email address hidden>
Date: Tue Sep 28 15:42:35 2021 +0800
Subject: drm/amdgpu: init iommu after amdkfd device init
Link: https:/
It seems to have caused issues back then too, and was removed in the following fixups, in 5.16-rc1:
commit 93cec184788b0cf
Author: James Zhu <email address hidden>
Date: Tue Nov 2 21:33:50 2021 -0400
Subject: drm/amdgpu: remove duplicated kfd_resume_iommu
Link: https:/
commit 9f4f2c1a35248f5
Author: shaoyunl <email address hidden>
Date: Fri Nov 5 12:34:14 2021 -0400
Subject: drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
Link: https:/
I'm not exactly in favor of rewriting history twice, so I think we should just revert the upstream stable patch and move on.
Changed in linux (Ubuntu Jammy): | |
importance: | Undecided → High |
summary: |
- Kernel update 5.15.0-112 might cause severe problems with specific AMD - GPUs + AMD GPUs fail with null pointer dereference when IOMMU enabled, leading + to black screen |
description: | updated |
Changed in linux (Ubuntu Jammy): | |
status: | In Progress → Fix Committed |
Status changed to 'Confirmed' because the bug affects multiple users.