AMD GPUs fail with null pointer dereference when IOMMU enabled, leading to black screen

Bug #2068738 reported by berni123
292
This bug affects 50 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
Fix Committed
High
Matthew Ruffell

Bug Description

BugLink: https://bugs.launchpad.net/bugs/2068738

[Impact]

On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is enabled, the system fails to boot correctly, and all users see is a black screen.

This is caused by a null pointer dereference when enabling the IOMMU after the device has been initialised. It should happen the other way around.

AMD-Vi: AMD IOMMUv2 loaded and initialized
...
amdgpu: Topology: Add APU node [0x15d8:0x1002]
kfd kfd: amdgpu: added device 1002:15d8
kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
...
amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
...
BUG: kernel NULL pointer dereference, address: 000000000000013c
...
CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic #122-Ubuntu
...
RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
...
Call Trace:
 <TASK>
 ? srso_return_thunk+0x5/0x10
 ? show_trace_log_lvl+0x28e/0x2ea
 ? show_trace_log_lvl+0x28e/0x2ea
 ? dm_hw_fini+0x23/0x30 [amdgpu]
 ? show_regs.part.0+0x23/0x29
 ? __die_body.cold+0x8/0xd
 ? __die+0x2b/0x37
 ? page_fault_oops+0x13b/0x170
 ? srso_return_thunk+0x5/0x10
 ? do_user_addr_fault+0x321/0x670
 ? srso_return_thunk+0x5/0x10
 ? __free_pages_ok+0x34a/0x4f0
 ? exc_page_fault+0x77/0x170
 ? asm_exc_page_fault+0x27/0x30
 ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
 dm_hw_fini+0x23/0x30 [amdgpu]
 amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
 amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
 amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
 amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
 amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
 local_pci_probe+0x4b/0x90
 ? srso_return_thunk+0x5/0x10
 pci_device_probe+0x119/0x200
 really_probe+0x222/0x420
 __driver_probe_device+0xe8/0x140
 driver_probe_device+0x23/0xc0
 __driver_attach+0xf7/0x1f0
 ? __device_attach_driver+0x140/0x140
 bus_for_each_dev+0x7f/0xd0
 driver_attach+0x1e/0x30
 bus_add_driver+0x148/0x220
 ? srso_return_thunk+0x5/0x10
 driver_register+0x95/0x100
 __pci_register_driver+0x68/0x70
 amdgpu_init+0x7c/0x1000 [amdgpu]
 ? 0xffffffffc0e0b000
 do_one_initcall+0x49/0x1e0
 ? srso_return_thunk+0x5/0x10
 ? kmem_cache_alloc_trace+0x19e/0x2e0
 do_init_module+0x52/0x260
 load_module+0xb45/0xbe0
 __do_sys_finit_module+0xbf/0x120
 __x64_sys_finit_module+0x18/0x20
 x64_sys_call+0x1ac3/0x1fa0
 do_syscall_64+0x56/0xb0
...
 entry_SYSCALL_64_after_hwframe+0x67/0xd1

A workaround does exist. Users can set "nomodeset" or "amd_iommu=off" to GRUB_CMDLINE_LINUX_DEFAULT, update-grub and reboot.

[Fix]

The regression was caused by the following commit that landed in 5.15.0-112-generic, and 5.15.150 upstream:

commit 3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd ubuntu-jammy
Author: Yifan Zhang <email address hidden>
Date: Tue Sep 28 15:42:35 2021 +0800
Subject: drm/amdgpu: init iommu after amdkfd device init
Link: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?id=3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd

The fix is to revert this patch, as it was not suppose to be backported to 5.15 stable.

The mailing list discussion with AMD developers is:

https://<email address hidden>/

The fix hasn't been acknowledged by Greg KH or Sasha Levin yet, so sending as a Ubuntu SAUCE patch. If the upstream status changes, we can NAK and resend.

[Testcase]

You need a system with an AMD Picasso/Raven 2 device. It will likely be an APU, and not a discrete graphics card, but any AMD Picasso/Raven 2 device is affected.

Install the kernel and boot. Make sure full modesetting is enabled.

There is a test kernel available in the ppa below:

https://launchpad.net/~mruffell/+archive/ubuntu/lp2068738-test

If you install the test kernel, your system should boot successfully.

[Where problems could occur]

We are reverting a problematic patch and going back to how it was before 5.15.0-112-generic. This should not cause any issues for users.

If a regression were to occur, users can set "nomodeset" or "amd_iommu=off" to GRUB_CMDLINE_LINUX_DEFAULT and reboot, or pin their kernel to a working one.

The impact of a regression would be high, as users displays could be blank.

[Other Info]

User reports:
https://forums.linuxmint.com/viewtopic.php?t=421484
https://forums.linuxmint.com/viewtopic.php?t=421441
https://www.reddit.com/r/Ubuntu/comments/1d9uviz/had_to_purge_kernel_5150112_could_not_boot/
https://www.reddit.com/r/linuxmint/comments/1d9w6c9/kernel_5150112_boot_failure/
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068735
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068793
https://bugs.launchpad.net/bugs/2068812

As bizarre as it is, this commit was actually originally included in 5.15-rc5:

commit 714d9e4574d54596973ee3b0624ee4a16264d700
Author: Yifan Zhang <email address hidden>
Date: Tue Sep 28 15:42:35 2021 +0800
Subject: drm/amdgpu: init iommu after amdkfd device init
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=714d9e4574d54596973ee3b0624ee4a16264d700

It seems to have caused issues back then too, and was removed in the following fixups, in 5.16-rc1:

commit 93cec184788b0cf3926bc1f7b47fed74ba87990c
Author: James Zhu <email address hidden>
Date: Tue Nov 2 21:33:50 2021 -0400
Subject: drm/amdgpu: remove duplicated kfd_resume_iommu
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=93cec184788b0cf3926bc1f7b47fed74ba87990c

commit 9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
Author: shaoyunl <email address hidden>
Date: Fri Nov 5 12:34:14 2021 -0400
Subject: drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d

I'm not exactly in favor of rewriting history twice, so I think we should just revert the upstream stable patch and move on.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
rfc (rfc0) wrote :

I have similar symptom: after upgrading to linux-image-5.15.0-112-generic, upon boot the screen goes black and does not display the usual prompts to unlock disk or log in to ubuntu user account.

The issue is recurring, repeated attempts to boot using linux-image-5.15.0-112-generic give the same result (black screen).

Restarting and using grub to boot using linux-image-5.15.0-107-generic works around the issue.

processor: AMD® Ryzen 3 2200g with radeon vega graphics × 4
graphics: AMD® Radeon vega 8 graphics
os name: Ubuntu 22.04.4 LTS
gnome version: 42.9
windowing system: wayland

Additional reports:

https://www.reddit.com/r/Ubuntu/comments/1d9uviz/had_to_purge_kernel_5150112_could_not_boot/

One of the folks impacted in the above reddit thread reported that their although their display was black, the machine booted and they were able to interact with it normally over SSH.

Revision history for this message
Fred Jupiter (fredjupiter) wrote :

Try this ("Temporary" first to gain access to the computer again), if you want to use the new kernel:

It worked for me.

Workaround:
Temporary:
Add "nomodeset" before "quiet splash" via editing of GRUB menu.

See: https://askubuntu.com/questions/162075/my-computer-boots-to-a-black-screen-what-options-do-i-have-to-fix-it

Permanent:
Edit /etc/default/grub: the line "GRUB_CMDLINE_LINUX_DEFAULT=". Add the word "nomodeset" before "quiet splash".
- Run "sudo update-grub"

Revision history for this message
berni123 (bernd-bernd--zimmermann) wrote :

@Fred: setting nomodeset is only a small workaround, it disables full graphic support, no multi display usage, no high resolution ...

Revision history for this message
berni123 (bernd-bernd--zimmermann) wrote :

Quick-Fix: GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=off"

Revision history for this message
Lin Manfu (linmanfu) wrote :

I am also affected by this bug. Booting to kernel 5.15.0-112 produces a black screen. Kernel 5.15.0-107 works correctly. Here is information about my system:

Operating System: Kubuntu 22.04
KDE Plasma Version: 5.24.7
KDE Frameworks Version: 5.92.0
Qt Version: 5.15.3
Kernel Version: 5.15.0-107-generic (64-bit)
Graphics Platform: X11
Processors: 8 × AMD Ryzen 5 3400G with Radeon Vega Graphics
Memory: 13.6 GiB of RAM
Graphics Processor: AMD Radeon Vega 11 Graphics

Revision history for this message
rfc (rfc0) wrote :
Revision history for this message
Lin Manfu (linmanfu) wrote (last edit ):

Attached is an annotated copy of my system log with the relevant update from 107 to 112, the failed boots with 112, a successful boot with the last good kernel 107, and a boot with 112 using the workaround. Search for BOOT in capital letters to find the start of each boot. I hope this might be helpful.

I am no expert, but perhaps these lines might be relevant, as they only seem to occur on the failed boots with 112:

Jun 8 19:17:56 paul kernel: [ 3.190269] amdgpu: Topology: Add APU node [0x15d8:0x1002]
Jun 8 19:17:56 paul kernel: [ 3.190273] kfd kfd: amdgpu: added device 1002:15d8
Jun 8 19:17:56 paul kernel: [ 3.190280] kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
Jun 8 19:17:56 paul kernel: [ 3.190287] amdgpu 0000:27:00.0: amdgpu: amdgpu_device_ip_init failed
Jun 8 19:17:56 paul kernel: [ 3.190292] amdgpu 0000:27:00.0: amdgpu: Fatal error during GPU init
Jun 8 19:17:56 paul kernel: [ 3.190296] amdgpu 0000:27:00.0: amdgpu: amdgpu: finishing device.

Revision history for this message
berni123 (bernd-bernd--zimmermann) wrote :
Download full text (12.9 KiB)

Some more info on booting -112 related to amdgpu, which is crashing:

Jun 7 09:29:22 enterprise kernel: [ 2.809882] AMD-Vi: AMD IOMMUv2 loaded and initialized
Jun 7 09:29:22 enterprise kernel: [ 2.979171] amdgpu: Topology: Add APU node [0x0:0x0]
Jun 7 09:29:22 enterprise kernel: [ 2.979234] checking generic (d0000000 300000) vs hw (d0000000 10000000)
Jun 7 09:29:22 enterprise kernel: [ 2.979341] amdgpu 0000:06:00.0: amdgpu: Trusted Memory Zone (TMZ) feature enabled
Jun 7 09:29:22 enterprise kernel: [ 3.002476] amdgpu 0000:06:00.0: amdgpu: Fetched VBIOS from ROM BAR
Jun 7 09:29:22 enterprise kernel: [ 3.002478] amdgpu: ATOM BIOS: 113-PICASSO-117
Jun 7 09:29:22 enterprise kernel: [ 3.002608] amdgpu 0000:06:00.0: vgaarb: deactivate vga console
Jun 7 09:29:22 enterprise kernel: [ 3.002653] amdgpu 0000:06:00.0: amdgpu: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
Jun 7 09:29:22 enterprise kernel: [ 3.002656] amdgpu 0000:06:00.0: amdgpu: GART: 1024M 0x0000000000000000 - 0x000000003FFFFFFF
Jun 7 09:29:22 enterprise kernel: [ 3.002659] amdgpu 0000:06:00.0: amdgpu: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
Jun 7 09:29:22 enterprise kernel: [ 3.003052] amdgpu 0000:06:00.0: amdgpu: PSP runtime database doesn't exist
Jun 7 09:29:22 enterprise kernel: [ 3.004269] amdgpu: hwmgr_sw_init smu backed is smu10_smu
Jun 7 09:29:22 enterprise kernel: [ 3.004404] amdgpu 0000:06:00.0: amdgpu: Will use PSP to load VCN firmware
Jun 7 09:29:22 enterprise kernel: [ 3.088846] amdgpu 0000:06:00.0: amdgpu: RAS: optional ras ta ucode is not available
Jun 7 09:29:22 enterprise kernel: [ 3.093879] amdgpu 0000:06:00.0: amdgpu: RAP: optional rap ta ucode is not available
un 7 09:29:22 enterprise kernel: [ 3.186429] amdgpu: HMM registered 2048MB device memory
Jun 7 09:29:22 enterprise kernel: [ 3.186503] amdgpu: Topology: Add APU node [0x15d8:0x1002]
Jun 7 09:29:22 enterprise kernel: [ 3.186508] kfd kfd: amdgpu: added device 1002:15d8
Jun 7 09:29:22 enterprise kernel: [ 3.186516] kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
Jun 7 09:29:22 enterprise kernel: [ 3.186530] amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
Jun 7 09:29:22 enterprise kernel: [ 3.186535] amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
Jun 7 09:29:22 enterprise kernel: [ 3.186545] amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
Jun 7 09:29:22 enterprise kernel: [ 3.194967] BUG: kernel NULL pointer dereference, address: 000000000000013c
Jun 7 09:29:22 enterprise kernel: [ 3.194971] #PF: supervisor read access in kernel mode
Jun 7 09:29:22 enterprise kernel: [ 3.194975] #PF: error_code(0x0000) - not-present page
Jun 7 09:29:22 enterprise kernel: [ 3.194978] PGD 0 P4D 0
Jun 7 09:29:22 enterprise kernel: [ 3.194983] Oops: 0000 [#1] SMP NOPTI
Jun 7 09:29:22 enterprise kernel: [ 3.194987] CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic #122-Ubuntu
Jun 7 09:29:22 enterprise kernel: [ 3.194993] Hardware name: Gigabyte Technology Co., Ltd. B450 AORUS ELITE/B450 AORUS ELITE, BIOS F60c 10/29/2020
Jun 7 09:29...

Revision history for this message
Carrington Institute (symbolize) wrote :

I had the same issue, similar hardware I have had to revert back to the previous kernel version 5.15.0.-107 there was some errors in the log that pointed at the AMD GPU

if you need to boot from the previous kernel you will need to tap shift when the device boots, go into advanced options and select

kernel .107 generic

I then changed my grub menu

sudo nano /etc/default/grub

changed

GRUB_DEFAULT=0

to

GRUB_DEFAULT="1>2" (this tells grub to boot into advanced options and then the second linux kernel image, .107 at least for me ) save the file

sudo update-grub

sudo reboot

The device then booted successfully into the working .107

then i removed the .112 kernel

sudo apt remove linux-image-5.15.0-112-generic

sudo apt remove linux-headers-5.15.0-112

sudo update-grub

I also put a hold on .112

sudo apt-mark hold linux-image-5.15.0-112-generic
sudo apt-mark hold linux-image-5.15.0-112-generic
apt-mark showhold

I then changed grub back to

GRUB_DEFAULT=0

I have attached a full log for the developers perusal

Revision history for this message
Fenestrae, et vastata annis. (exfenestrae) wrote :

Ditto @linmanfu:

Operating System: Kubuntu 22.04
KDE Plasma Version: 5.24.7
KDE Frameworks Version: 5.92.0
Qt Version: 5.15.3
Kernel Version: 5.15.0-107-generic (64-bit)
Graphics Platform: X11
Processors: 8 × AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx
Memory: 13.6 GiB of RAM
Graphics Processor: AMD Radeon Vega 8 Graphics

Revision history for this message
Mario Limonciello (superm1) wrote :

AFAIK It's a bad backport to 5.15 stable. If it's what I think, here's the fix (IIRC).

https://<email address hidden>/

Try applying that to your 5.15 kernel.

Changed in linux (Ubuntu Jammy):
importance: Undecided → High
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hi everyone,

Mario, that does indeed look like the relevant fix on the stable mailing list.
It reverts this commit:

commit 3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd ubuntu-jammy
Author: Yifan Zhang <email address hidden>
Date: Tue Sep 28 15:42:35 2021 +0800
Subject: drm/amdgpu: init iommu after amdkfd device init
Link: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?id=3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd

It arrived in 5.15.150 stable, applied in 5.15.0-112-generic.

Interestingly enough, this was once applied during the 5.15 development cycle
in 2021, in 5.15-rc5:

commit 714d9e4574d54596973ee3b0624ee4a16264d700
Author: Yifan Zhang <email address hidden>
Date: Tue Sep 28 15:42:35 2021 +0800
Subject: drm/amdgpu: init iommu after amdkfd device init
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=714d9e4574d54596973ee3b0624ee4a16264d700

Bizarre. It seems to have been removed back then as well.

Anyway, it seems it is on track to be once again reverted in the stable mailing list thread that Mario linked.

https://<email address hidden>/

To make sure this is the commit that is causing you issue, I have built you test kernels with the above revert commit applied.

There are builds for 22.04 and 20.04 HWE.

I only just uploaded them, so give them 3 hours from this message to build before trying to install. You can also check build status here:
https://launchpad.net/~mruffell/+archive/ubuntu/lp2068738-test

Please note this package is NOT SUPPORTED by Canonical, and is for TESTING
PURPOSES ONLY. ONLY Install in a dedicated test environment.

Instructions to Install (On a focal or jammy system):
1) sudo add-apt-repository ppa:mruffell/lp2068738-test
2) sudo apt update
3) sudo apt install linux-image-unsigned-5.15.0-112-generic linux-modules-5.15.0-112-generic linux-modules-extra-5.15.0-112-generic linux-headers-5.15.0-112-generic
4) sudo reboot
5) uname -rv
Look for "5.15.0-112.122+TEST2068738v20240610b1".

Can you boot into this kernel and let me know if it fixes the problem? If it does, we will chime in upstream, and get this included, and we will land it in the next Ubuntu kernel release.

Just give the kernels about 3 hours to build first.

Thanks,
Matthew

Changed in linux (Ubuntu Jammy):
status: New → In Progress
assignee: nobody → Matthew Ruffell (mruffell)
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Matthew Ruffell (mruffell) wrote :

PS. Please remove the workaround "nomodeset" if you applied it before you try the test kernel. I really just want to know if the reverted patch fixes the issue.

Revision history for this message
Tino (tinoh82) wrote :

Same issue here.

My Grub file loks like this, any help?

# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
# info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=10
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
GRUB_CMDLINE_LINUX=""

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal (grub-pc only)
#GRUB_TERMINAL=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
#GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"

Revision history for this message
Tino (tinoh82) wrote :

-it says fix released for linux (ubuntu). Where can i get it?

thanks

Revision history for this message
Fred Jupiter (fredjupiter) wrote :

I had this problem, too, with a PC (I'm sorry, I can't access the PC; this is from my written/saved records):
CPU: AMD Ryzen 3 2200G, 4x 3.50GHz, boxed (YD2200C5FBBOX)
Graphics: AMD Vega 8
Kubuntu 22.04, fully updated today

Expected:
PC starts

Actual:
PC monitor stays black after logo of mainboard manufacturer.

Attempts at fixing this:
Because I had problems with startup where the addition of "nomodeset" in GRUB helped, I tried this. This time it did not help. The computer actually started - more than before - but got stuck at various stages of the boot process.

The PC would not fully start with kernel 5.15.0-112-generic.

I chose 5.15.0-107-generic in GRUB. 107 worked without a hitch.

At the moment the workaround is simply using 5.15.0-107-generic.

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hi Tino,

It says Fix Released for Ubuntu, aka, the latest Ubuntu release, which is 24.10 Oracular, as it isn't affected. Maybe I should have marked it invalid in the first place.

Anyway, once someone tries the test kernel I built yesterday to see if it fixes the issue, I can go about getting the patch into Jammy's 5.15 kernel.

Thanks,
Matthew

Revision history for this message
Simon (simonphillips762) wrote :

This bug stopped my AMD 2200G booting and it stopped my HP thin client with AMD Embedded G-Series GX-420GI Radeon R7E cpu both running on the intergrated graphics.

Im trying to boot Lubuntu, i have reverted back to kernel 107 and all seems ok so far and found a redddit post on apt-mark hold and holding linux-image, linux-headers, linux-modules and linux-modules-extras. After having removed all these packages via apt.

hope the extra info helps.

Revision history for this message
Xie Dongping (dongping-xie) wrote :

Hi Matthew,

thanks for the quick fix. It works for me.

```
# uname -rv
5.15.0-112-generic #122+TEST2068738v20240610b1-Ubuntu SMP Mon Jun 10 05:02:07 UTC 2
# inxi -G
Graphics:
  Device-1: AMD Picasso/Raven 2 [Radeon Vega Series / Radeon Mobile Series]
    driver: amdgpu v: kernel
  Device-2: Lite-On HP Webcam type: USB driver: uvcvideo
  Display: server: X.Org v: 1.21.1.4 driver: X: loaded: amdgpu,ati
    unloaded: fbdev,modesetting,vesa gpu: amdgpu resolution: 1920x1080~60Hz
  OpenGL: renderer: AMD Radeon Vega 3 Graphics (raven2 LLVM 15.0.7 DRM 3.42
  5.15.0-112-generic)
    v: 4.6 Mesa 23.2.1-1ubuntu3.1~22.04.2
```

Also tried `vkcube` and it also works. So I guess that this is fixed.

Thanks!

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hi Xie,

Thanks for trying the test kernel, and its great news that we have identified the fix. I'll write up a SRU template, and get the patch sent off to the kernel team today.

Thanks,
Matthew

summary: - Kernel update 5.15.0-112 might cause severe problems with specific AMD
- GPUs
+ AMD GPUs fail with null pointer dereference when IOMMU enabled, leading
+ to black screen
description: updated
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hi everyone,

Patches are on the Ubuntu Kernel Team mailing list:

Cover letter:
https://lists.ubuntu.com/archives/kernel-team/2024-June/151409.html
Patch:
https://lists.ubuntu.com/archives/kernel-team/2024-June/151410.html

I will go talk to the kernel team now, I can't promise anything, but will try and get this prioritised.

I will write back with what they say.

Thanks,
Matthew

Revision history for this message
Acru (aoellerer) wrote :

Hey,
As of now, the update to 5.15.0-112 is still getting proposed in the update manager, meaning that with automatic updates your computer might fail from one boot to the next, would it be possible to pull it from there?
Best and thanks for all the work
Anton

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hi everyone,

An update:

Greg KH has picked up the patch and added it to upstream stable now:

https://lore.kernel.org/amd-gfx/2024061223-suitable-handler-b6f2@gregkh/
https://lore.kernel.org/amd-gfx/2024061239-rehydrate-flyable-343e@gregkh/

I suppose we can drop the UBUNTU: SAUCE tags.

I talked to Stefan Bader on the Kernel Team. His current feeling is that they might respin the -generic kernels before the release of the current cycle (2024.06.10 as per https://kernel.ubuntu.com/) but they are still unsure. They might see what else comes up this cycle before they decide.

I'll follow up with the Kernel Team in a couple days.

Thanks,
Matthew

Revision history for this message
Pete Orlando (peteola) wrote : Re: [Bug 2068738] Re: AMD GPUs fail with null pointer dereference when IOMMU enabled, leading to black screen
Download full text (8.0 KiB)

Hi Matthew, I have an Acer Aspire 5 and running Linux Mint 21.2 Victoria
    base: Ubuntu 22.04 jammy with AMD Ryzen 7 3700U with Radeon Vega.
I get the Black screen on boot after installing updates about a week ago.
will there be an update
that will come through the normal update manager ? I don't have programming
skills like many users
of Linux so that would much appreciated. Thank You !

On Thu, Jun 13, 2024 at 8:25 PM Matthew Ruffell <email address hidden>
wrote:

> Hi everyone,
>
> An update:
>
> Greg KH has picked up the patch and added it to upstream stable now:
>
> https://lore.kernel.org/amd-gfx/2024061223-suitable-handler-b6f2@gregkh/
> https://lore.kernel.org/amd-gfx/2024061239-rehydrate-flyable-343e@gregkh/
>
> I suppose we can drop the UBUNTU: SAUCE tags.
>
> I talked to Stefan Bader on the Kernel Team. His current feeling is that
> they might respin the -generic kernels before the release of the current
> cycle (2024.06.10 as per https://kernel.ubuntu.com/) but they are still
> unsure. They might see what else comes up this cycle before they decide.
>
> I'll follow up with the Kernel Team in a couple days.
>
> Thanks,
> Matthew
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2068738
>
> Title:
> AMD GPUs fail with null pointer dereference when IOMMU enabled,
> leading to black screen
>
> Status in linux package in Ubuntu:
> Fix Released
> Status in linux source package in Jammy:
> In Progress
>
> Bug description:
> BugLink: https://bugs.launchpad.net/bugs/2068738
>
> [Impact]
>
> On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
> enabled, the system fails to boot correctly, and all users see is a
> black screen.
>
> This is caused by a null pointer dereference when enabling the IOMMU
> after the device has been initialised. It should happen the other way
> around.
>
> AMD-Vi: AMD IOMMUv2 loaded and initialized
> ...
> amdgpu: Topology: Add APU node [0x15d8:0x1002]
> kfd kfd: amdgpu: added device 1002:15d8
> kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
> ...
> amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
> ...
> BUG: kernel NULL pointer dereference, address: 000000000000013c
> ...
> CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic
> #122-Ubuntu
> ...
> RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> ...
> Call Trace:
> <TASK>
> ? srso_return_thunk+0x5/0x10
> ? show_trace_log_lvl+0x28e/0x2ea
> ? show_trace_log_lvl+0x28e/0x2ea
> ? dm_hw_fini+0x23/0x30 [amdgpu]
> ? show_regs.part.0+0x23/0x29
> ? __die_body.cold+0x8/0xd
> ? __die+0x2b/0x37
> ? page_fault_oops+0x13b/0x170
> ? srso_return_thunk+0x5/0x10
> ? do_user_addr_fault+0x321/0x670
> ? srso_return_thunk+0x5/0x10
> ? __free_pages_ok+0x34a/0x4f0
> ? exc_page_fault+0x77/0x170
> ? asm_exc_page_fault+0x27/0x30
> ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> dm_hw_fini+0x23/0x30 [amdgpu]
> amdgpu_device_ip_fini_early.isra...

Read more...

Revision history for this message
Frazer Ross (frazer-r) wrote :

Today Update Manager updated my Kernel from 5.15.0-107 to 5.15.0-112. Unfortunately 5.15.0-112 is not compatible with the AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx x 4 graphics processor in my HP model 14-dk0011na laptop resulting in a black screen.
Via GNU GRUB menu booted into Linux 5.15.0-107-generic from the list, and once into the OS using Update Manager removed 5.15.0-112.
The laptop now boots correctly using Kernel 5.15.0-107.
Looking forward to a fixed version of 5.15.0-112
How will novices like myself know when the 5.15.0-112 kernel is corrected without having to go through the above cycle every few days please?

Changed in linux (Ubuntu Jammy):
status: In Progress → Fix Committed
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hi everyone,

I spoke with Stefan Bader of the Kernel Team again. They are planning to include
the patch in a respin of the current 2024.06.10 SRU cycle. https://kernel.ubuntu.com/

I will let you know once the kernel has been built and placed into -proposed for
verification.

But at this stage, as long as the respin occurs, this should be released the
week of 8th July, give or take a few days, if anything else pops up.

Thanks,
Matthew

Revision history for this message
Roger Ramjet (rogerramjet123) wrote :
Download full text (7.2 KiB)

Thank you Matthew.

Sent with Proton Mail secure email.

On Tuesday, June 18th, 2024 at 2:41 AM, Matthew Ruffell <email address hidden> wrote:

> Hi everyone,
>
> I spoke with Stefan Bader of the Kernel Team again. They are planning to include
> the patch in a respin of the current 2024.06.10 SRU cycle. https://kernel.ubuntu.com/
>
> I will let you know once the kernel has been built and placed into -proposed for
> verification.
>
> But at this stage, as long as the respin occurs, this should be released the
> week of 8th July, give or take a few days, if anything else pops up.
>
> Thanks,
> Matthew
>
> --
> You received this bug notification because you are subscribed to a
> duplicate bug report (2069485).
> https://bugs.launchpad.net/bugs/2068738
>
> Title:
> AMD GPUs fail with null pointer dereference when IOMMU enabled,
> leading to black screen
>
> Status in linux package in Ubuntu:
> Fix Released
> Status in linux source package in Jammy:
> Fix Committed
>
> Bug description:
> BugLink: https://bugs.launchpad.net/bugs/2068738
>
> [Impact]
>
> On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
> enabled, the system fails to boot correctly, and all users see is a
> black screen.
>
> This is caused by a null pointer dereference when enabling the IOMMU
> after the device has been initialised. It should happen the other way
> around.
>
> AMD-Vi: AMD IOMMUv2 loaded and initialized
> ...
> amdgpu: Topology: Add APU node [0x15d8:0x1002]
> kfd kfd: amdgpu: added device 1002:15d8
> kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
> ...
> amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
> ...
> BUG: kernel NULL pointer dereference, address: 000000000000013c
> ...
> CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic #122-Ubuntu
> ...
> RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> ...
> Call Trace:
> <TASK>
>
> ? srso_return_thunk+0x5/0x10
> ? show_trace_log_lvl+0x28e/0x2ea
> ? show_trace_log_lvl+0x28e/0x2ea
> ? dm_hw_fini+0x23/0x30 [amdgpu]
> ? show_regs.part.0+0x23/0x29
> ? __die_body.cold+0x8/0xd
> ? __die+0x2b/0x37
> ? page_fault_oops+0x13b/0x170
> ? srso_return_thunk+0x5/0x10
> ? do_user_addr_fault+0x321/0x670
> ? srso_return_thunk+0x5/0x10
> ? __free_pages_ok+0x34a/0x4f0
> ? exc_page_fault+0x77/0x170
> ? asm_exc_page_fault+0x27/0x30
> ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> dm_hw_fini+0x23/0x30 [amdgpu]
> amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
> amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
> amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
> amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
> amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
> local_pci_probe+0x4b/0x90
> ? srso_return_thunk+0x5/0x10
> pci_device_probe+0x119/0x200
> really_probe+0x222/0x420
> __driver_probe_device+0xe8/0x140
> driver_probe_device+0x23/0xc0
> __driver_attach+0xf7/0x1f0
> ? __device_attach_driver+0x140/0x140
> bus_for_each_dev+0x7f/0xd0
> driver_attach+0x1e/0x30
> bus_add_driver+0x148/0x220
> ? srso_return_thunk+0x5/0x10
> driver_register+0x95/0x100
> __pci_register_driver+0x68...

Read more...

Revision history for this message
Acru (aoellerer) wrote :

Hey,
After disabling the kernel update via the update manager, it now got installed again, again resulting in a not working system.
Isn't it possible to just completely remove this one version from the repo?

Revision history for this message
Pete Orlando (peteola) wrote :
Download full text (7.2 KiB)

How do we fix this?

On Tue, Jun 18, 2024 at 9:40 AM Acru <email address hidden> wrote:

> Hey,
> After disabling the kernel update via the update manager, it now got
> installed again, again resulting in a not working system.
> Isn't it possible to just completely remove this one version from the repo?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2068738
>
> Title:
> AMD GPUs fail with null pointer dereference when IOMMU enabled,
> leading to black screen
>
> Status in linux package in Ubuntu:
> Fix Released
> Status in linux source package in Jammy:
> Fix Committed
>
> Bug description:
> BugLink: https://bugs.launchpad.net/bugs/2068738
>
> [Impact]
>
> On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
> enabled, the system fails to boot correctly, and all users see is a
> black screen.
>
> This is caused by a null pointer dereference when enabling the IOMMU
> after the device has been initialised. It should happen the other way
> around.
>
> AMD-Vi: AMD IOMMUv2 loaded and initialized
> ...
> amdgpu: Topology: Add APU node [0x15d8:0x1002]
> kfd kfd: amdgpu: added device 1002:15d8
> kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
> ...
> amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
> ...
> BUG: kernel NULL pointer dereference, address: 000000000000013c
> ...
> CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic
> #122-Ubuntu
> ...
> RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> ...
> Call Trace:
> <TASK>
> ? srso_return_thunk+0x5/0x10
> ? show_trace_log_lvl+0x28e/0x2ea
> ? show_trace_log_lvl+0x28e/0x2ea
> ? dm_hw_fini+0x23/0x30 [amdgpu]
> ? show_regs.part.0+0x23/0x29
> ? __die_body.cold+0x8/0xd
> ? __die+0x2b/0x37
> ? page_fault_oops+0x13b/0x170
> ? srso_return_thunk+0x5/0x10
> ? do_user_addr_fault+0x321/0x670
> ? srso_return_thunk+0x5/0x10
> ? __free_pages_ok+0x34a/0x4f0
> ? exc_page_fault+0x77/0x170
> ? asm_exc_page_fault+0x27/0x30
> ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> dm_hw_fini+0x23/0x30 [amdgpu]
> amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
> amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
> amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
> amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
> amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
> local_pci_probe+0x4b/0x90
> ? srso_return_thunk+0x5/0x10
> pci_device_probe+0x119/0x200
> really_probe+0x222/0x420
> __driver_probe_device+0xe8/0x140
> driver_probe_device+0x23/0xc0
> __driver_attach+0xf7/0x1f0
> ? __device_attach_driver+0x140/0x140
> bus_for_each_dev+0x7f/0xd0
> driver_attach+0x1e/0x30
> bus_add_driver+0x148/0x220
> ? srso_return_thunk+0x5/0x10
> driver_register+0x95/0x100
> __pci_register_driver+0x68/0x70
> amdgpu_init+0x7c/0x1000 [amdgpu]
> ? 0xffffffffc0e0b000
> do_one_initcall+0x49/0x1e0
> ? srso_return_thunk+0x5/0x10
> ? kmem_cache_alloc_trace+0x19...

Read more...

Revision history for this message
Willem Dreyer (willemdreyer) wrote :

Hi all,

Same problem: No problem on 5.15.0-107.117 and 5.15.0-112.122 shows black screen after GRUB.

Same hardware: AMD Ryzen 5 PRO 2500U w/ Radeon Vega Mobile Gfx

It's both funny and disappointing that a clinfo fix caused my first crash in 5 years on my daily driver.

In my opinion 5.15.0-112.122 should be blocked since it prevents multiple generations of Laptop and Desktop APUs from booting.

Special thanks to Barry K. for bisecting!

Thanks Armin W. for driving the fix and everyone else for your inputs.

Take care,
Willem

Revision history for this message
Roger Ramjet (rogerramjet123) wrote :
Download full text (8.0 KiB)

I just loaded Linux Mint Kernel 5.15.0-113 and it is not recognized at all, I'm told to load a Kernel first, press any key and I have to go back to the index of Linux Kernels and choose 5.15.0-107 to power up.

Thanks, Ralph

Sent with Proton Mail secure email.

On Tuesday, June 18th, 2024 at 7:12 AM, Roger Ramjet <email address hidden> wrote:

> Thank you Matthew.
>
>
>
>
> Sent with Proton Mail secure email.
>
>
> On Tuesday, June 18th, 2024 at 2:41 AM, Matthew Ruffell <email address hidden> wrote:
>
> > Hi everyone,
> >
> > I spoke with Stefan Bader of the Kernel Team again. They are planning to include
> > the patch in a respin of the current 2024.06.10 SRU cycle. https://kernel.ubuntu.com/
> >
> > I will let you know once the kernel has been built and placed into -proposed for
> > verification.
> >
> > But at this stage, as long as the respin occurs, this should be released the
> > week of 8th July, give or take a few days, if anything else pops up.
> >
> > Thanks,
> > Matthew
> >
> > --
> > You received this bug notification because you are subscribed to a
> > duplicate bug report (2069485).
> > https://bugs.launchpad.net/bugs/2068738
> >
> > Title:
> > AMD GPUs fail with null pointer dereference when IOMMU enabled,
> > leading to black screen
> >
> > Status in linux package in Ubuntu:
> > Fix Released
> > Status in linux source package in Jammy:
> > Fix Committed
> >
> > Bug description:
> > BugLink: https://bugs.launchpad.net/bugs/2068738
> >
> > [Impact]
> >
> > On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
> > enabled, the system fails to boot correctly, and all users see is a
> > black screen.
> >
> > This is caused by a null pointer dereference when enabling the IOMMU
> > after the device has been initialised. It should happen the other way
> > around.
> >
> > AMD-Vi: AMD IOMMUv2 loaded and initialized
> > ...
> > amdgpu: Topology: Add APU node [0x15d8:0x1002]
> > kfd kfd: amdgpu: added device 1002:15d8
> > kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
> > ...
> > amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
> > amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
> > amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
> > ...
> > BUG: kernel NULL pointer dereference, address: 000000000000013c
> > ...
> > CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic #122-Ubuntu
> > ...
> > RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> > ...
> > Call Trace:
> > <TASK>
> >
> > ? srso_return_thunk+0x5/0x10
> > ? show_trace_log_lvl+0x28e/0x2ea
> > ? show_trace_log_lvl+0x28e/0x2ea
> > ? dm_hw_fini+0x23/0x30 [amdgpu]
> > ? show_regs.part.0+0x23/0x29
> > ? __die_body.cold+0x8/0xd
> > ? __die+0x2b/0x37
> > ? page_fault_oops+0x13b/0x170
> > ? srso_return_thunk+0x5/0x10
> > ? do_user_addr_fault+0x321/0x670
> > ? srso_return_thunk+0x5/0x10
> > ? __free_pages_ok+0x34a/0x4f0
> > ? exc_page_fault+0x77/0x170
> > ? asm_exc_page_fault+0x27/0x30
> > ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> > dm_hw_fini+0x23/0x30 [amdgpu]
> > amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
> > amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
> > amdgpu_driver_unload_kms+0x69/0x90 [...

Read more...

Revision history for this message
Filip Średnicki (fsrednicki) wrote :

I have noticed one more consequence of this bug, which can be very painful for those who are running thier own servers. Regardless that it is possible to connect by SSH and operate server normally, in my case when I was trying to reboot it whole power down procedure is carried properly up to closing of journal, but server freezes at the end and is not restarting, so if one is far away from infrastructure, can be in trouble. Fix with nomodeset is solving this restart issue.

My server is on Ryzen 3 + Radeon Vega integrated GPU.

Revision history for this message
Fred Jupiter (fredjupiter) wrote :

Yes, I agree with https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068738/comments/32 .

I updated to -113 today on Kubuntu 22.04. There is no improvement on -112.

The only things that is a workaround for me is "nomodeset" in GRUB or using the -107 kernel.

Updating to -113 kernel leaves black screen, backlit, on restart.

Revision history for this message
Tino (tinoh82) wrote :

Same on 5.15.0-113, no improvement. I still select 5.15.0-107 manually on boot screen. I hope this won't be removed because the 5.15.0-112 is gone now from GRUB

I Have an AMD Ryzen as well

Revision history for this message
Thomas Løcke (thomasmonadic) wrote :

I can confirm the same problem with linux-image-5.15.0-113-generic on my AMD Ryzen 3 3200G with Radeon Vega Graphics boxes. I had to revert to linux-image-5.15.0-107-generic, as 113 fails to boot (ends in black screen) and while SSH login works, the machines cannot be restarted from CLI.

Revision history for this message
Filip Średnicki (fsrednicki) wrote :

Confirmed also by myself - 5.15.0-113 is not solving the problem...

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hi everyone,

The patch was not present in 5.15.0-113-generic, as this kernel only contained
CVE fixes.

It should be in the next kernel update. Still waiting on the Kernel team to
respin.

I will write back as soon as we have a kernel tagged and in -proposed to test.

Thanks,
Matthew

Revision history for this message
Richard Durso (reefland) wrote :
Download full text (5.4 KiB)

Tried to use "nomodeset" or "amd_iommu=off" to GRUB_CMDLINE_LINUX_DEFAULT which allowed screen to work for booting -113 kernel, however PCIe based I225-V network card was not detected. I was left with only a loopback device.

PC: HP T740 Thin PC / AMD Ryzen V1756B

00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 IOMMU
        Subsystem: Hewlett-Packard Company Raven/Raven2 IOMMU
        Flags: bus master, fast devsel, latency 0, IRQ 25
        Capabilities: [40] Secure device <?>
        Capabilities: [64] MSI: Enable+ Count=1/4 Maskable- 64bit+
        Capabilities: [74] HyperTransport: MSI Mapping Enable+ Fixed+

01:00.0 Ethernet controller: Intel Corporation Ethernet Controller I225-V (rev 03)
        Subsystem: Intel Corporation Ethernet Controller I225-V
        Flags: bus master, fast devsel, latency 0, IRQ 30, IOMMU group 7
        Memory at fe700000 (32-bit, non-prefetchable) [size=1M]
        Memory at fe800000 (32-bit, non-prefetchable) [size=16K]
        Expansion ROM at fe600000 [disabled] [size=1M]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number 88-c9-b3-ff-ff-bf-72-fc
        Capabilities: [1c0] Latency Tolerance Reporting
        Capabilities: [1f0] Precision Time Measurement
        Capabilities: [1e0] L1 PM Substates
        Kernel driver in use: igc
        Kernel modules: igc

$ journalctl -b | grep amdgpu
Jun 28 18:00:20 k3s03 kernel: [drm] amdgpu kernel modesetting enabled.
Jun 28 18:00:20 k3s03 kernel: amdgpu: Topology: Add APU node [0x0:0x0]
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: enabling device (0106 -> 0107)
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature enabled
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from VFCT
Jun 28 18:00:20 k3s03 kernel: amdgpu: ATOM BIOS: 113-RAVEN-111
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: vgaarb: deactivate vga console
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: VRAM: 256M 0x000000F400000000 - 0x000000F40FFFFFFF (256M used)
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: GART: 1024M 0x0000000000000000 - 0x000000003FFFFFFF
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
Jun 28 18:00:20 k3s03 kernel: [drm] amdgpu: 256M of VRAM memory ready
Jun 28 18:00:20 k3s03 kernel: [drm] amdgpu: 3072M of GTT memory ready.
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
Jun 28 18:00:20 k3s03 kernel: amdgpu: hwmgr_sw_init smu backed is smu10_smu
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: Will use PSP to load VCN firmware
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
Jun 28 18:00:20 k3s03 kernel: amdgpu 000...

Read more...

Revision history for this message
Daniel (treki) wrote :

Did I understand correctly that this bug will be fixed after kernel 5.15.0-113?

Revision history for this message
Pete Orlando (peteola) wrote :
Download full text (7.2 KiB)

Still no fix as of July 1, 2024. Have to boot manually with the 107 to get
my Acer laptop up and running. Hope someone can fix this problem soon.
Thanks for all the hard work on this issue.

On Mon, Jul 1, 2024 at 8:10 AM Daniel <email address hidden> wrote:

> Did I understand correctly that this bug will be fixed after kernel
> 5.15.0-113?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2068738
>
> Title:
> AMD GPUs fail with null pointer dereference when IOMMU enabled,
> leading to black screen
>
> Status in linux package in Ubuntu:
> Fix Released
> Status in linux source package in Jammy:
> Fix Committed
>
> Bug description:
> BugLink: https://bugs.launchpad.net/bugs/2068738
>
> [Impact]
>
> On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
> enabled, the system fails to boot correctly, and all users see is a
> black screen.
>
> This is caused by a null pointer dereference when enabling the IOMMU
> after the device has been initialised. It should happen the other way
> around.
>
> AMD-Vi: AMD IOMMUv2 loaded and initialized
> ...
> amdgpu: Topology: Add APU node [0x15d8:0x1002]
> kfd kfd: amdgpu: added device 1002:15d8
> kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
> ...
> amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
> ...
> BUG: kernel NULL pointer dereference, address: 000000000000013c
> ...
> CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic
> #122-Ubuntu
> ...
> RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> ...
> Call Trace:
> <TASK>
> ? srso_return_thunk+0x5/0x10
> ? show_trace_log_lvl+0x28e/0x2ea
> ? show_trace_log_lvl+0x28e/0x2ea
> ? dm_hw_fini+0x23/0x30 [amdgpu]
> ? show_regs.part.0+0x23/0x29
> ? __die_body.cold+0x8/0xd
> ? __die+0x2b/0x37
> ? page_fault_oops+0x13b/0x170
> ? srso_return_thunk+0x5/0x10
> ? do_user_addr_fault+0x321/0x670
> ? srso_return_thunk+0x5/0x10
> ? __free_pages_ok+0x34a/0x4f0
> ? exc_page_fault+0x77/0x170
> ? asm_exc_page_fault+0x27/0x30
> ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> dm_hw_fini+0x23/0x30 [amdgpu]
> amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
> amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
> amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
> amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
> amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
> local_pci_probe+0x4b/0x90
> ? srso_return_thunk+0x5/0x10
> pci_device_probe+0x119/0x200
> really_probe+0x222/0x420
> __driver_probe_device+0xe8/0x140
> driver_probe_device+0x23/0xc0
> __driver_attach+0xf7/0x1f0
> ? __device_attach_driver+0x140/0x140
> bus_for_each_dev+0x7f/0xd0
> driver_attach+0x1e/0x30
> bus_add_driver+0x148/0x220
> ? srso_return_thunk+0x5/0x10
> driver_register+0x95/0x100
> __pci_register_driver+0x68/0x70
> amdgpu_init+0x7c/0x1000 [amdgpu]
> ? 0xffffffffc0e0b000
> do_one_initcall+0x49/0x1e0
> ? srso_return_thunk+0x5/...

Read more...

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.15.0-116.126 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux' to 'verification-done-jammy-linux'. If the problem still exists, change the tag 'verification-needed-jammy-linux' to 'verification-failed-jammy-linux'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-v2 verification-needed-jammy-linux
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hi everyone,

The Kernel Team have respun the latest 5.15 kernel with the fix, and have placed
it into -proposed for verification.

Could someone more technically minded help test it and let me know if it fixes
the problem?

Instructions to Install (On a jammy system):
1) cat << EOF | sudo tee /etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list
# Enable Ubuntu proposed archive
deb http://archive.ubuntu.com/ubuntu/ $(lsb_release -cs)-proposed main universe
EOF
2) sudo apt update
3) sudo apt install linux-image-5.15.0-116-generic linux-modules-5.15.0-116-generic linux-modules-extra-5.15.0-116-generic linux-headers-5.15.0-116-generic
4) sudo rm /etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list
5) sudo apt update
6) sudo reboot
7) uname -rv
5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:14:24 UTC 2024

Can you let me know if you can boot to your desktop and have a working screen?

We are still on track for a release to -updates next week sometime.

Thanks,
Matthew

Revision history for this message
Mathias Weyland (launchpad-weyland) wrote :

Hello. I ran these instructions and was able to boot the -116 kernel on a laptop that was subject to this bug, i.e. where the -113 kernel booted into a black screen (HP EliteBook 735 G5).

Thank you for your efforts and regards!
Matt

Revision history for this message
H A Killenbeck (akillenb) wrote :

followed your instructions, and the kernel update seems to work fine on
my system

On 7/3/24 11:50 PM, Matthew Ruffell wrote:
> Hi everyone,
>
> The Kernel Team have respun the latest 5.15 kernel with the fix, and have placed
> it into -proposed for verification.
>
> Could someone more technically minded help test it and let me know if it fixes
> the problem?
>
> Instructions to Install (On a jammy system):
> 1) cat << EOF | sudo tee /etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list
> # Enable Ubuntu proposed archive
> deb http://archive.ubuntu.com/ubuntu/ $(lsb_release -cs)-proposed main universe
> EOF
> 2) sudo apt update
> 3) sudo apt install linux-image-5.15.0-116-generic linux-modules-5.15.0-116-generic linux-modules-extra-5.15.0-116-generic linux-headers-5.15.0-116-generic
> 4) sudo rm /etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list
> 5) sudo apt update
> 6) sudo reboot
> 7) uname -rv
> 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:14:24 UTC 2024
>
> Can you let me know if you can boot to your desktop and have a working
> screen?
>
> We are still on track for a release to -updates next week sometime.
>
> Thanks,
> Matthew
>

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Thanks for testing Matt, H.A.

I marked the bug as verified. We should be all good for a release to -updates
early next week. I'll write a new message as soon as the kernel has been
released for everyone.

Thanks,
Matthew

tags: added: verification-done-jammy-linux
removed: verification-needed-jammy-linux
Revision history for this message
Pete Orlando (peteola) wrote :
Download full text (7.4 KiB)

problem still exists after running your instructions. I am still only able
to boot live by booting into the 107.

On Thu, Jul 4, 2024 at 3:20 PM Matthew Ruffell <email address hidden>
wrote:

> Thanks for testing Matt, H.A.
>
> I marked the bug as verified. We should be all good for a release to
> -updates
> early next week. I'll write a new message as soon as the kernel has been
> released for everyone.
>
> Thanks,
> Matthew
>
> ** Tags removed: verification-needed-jammy-linux
> ** Tags added: verification-done-jammy-linux
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2068738
>
> Title:
> AMD GPUs fail with null pointer dereference when IOMMU enabled,
> leading to black screen
>
> Status in linux package in Ubuntu:
> Fix Released
> Status in linux source package in Jammy:
> Fix Committed
>
> Bug description:
> BugLink: https://bugs.launchpad.net/bugs/2068738
>
> [Impact]
>
> On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
> enabled, the system fails to boot correctly, and all users see is a
> black screen.
>
> This is caused by a null pointer dereference when enabling the IOMMU
> after the device has been initialised. It should happen the other way
> around.
>
> AMD-Vi: AMD IOMMUv2 loaded and initialized
> ...
> amdgpu: Topology: Add APU node [0x15d8:0x1002]
> kfd kfd: amdgpu: added device 1002:15d8
> kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
> ...
> amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
> ...
> BUG: kernel NULL pointer dereference, address: 000000000000013c
> ...
> CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic
> #122-Ubuntu
> ...
> RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> ...
> Call Trace:
> <TASK>
> ? srso_return_thunk+0x5/0x10
> ? show_trace_log_lvl+0x28e/0x2ea
> ? show_trace_log_lvl+0x28e/0x2ea
> ? dm_hw_fini+0x23/0x30 [amdgpu]
> ? show_regs.part.0+0x23/0x29
> ? __die_body.cold+0x8/0xd
> ? __die+0x2b/0x37
> ? page_fault_oops+0x13b/0x170
> ? srso_return_thunk+0x5/0x10
> ? do_user_addr_fault+0x321/0x670
> ? srso_return_thunk+0x5/0x10
> ? __free_pages_ok+0x34a/0x4f0
> ? exc_page_fault+0x77/0x170
> ? asm_exc_page_fault+0x27/0x30
> ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> dm_hw_fini+0x23/0x30 [amdgpu]
> amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
> amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
> amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
> amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
> amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
> local_pci_probe+0x4b/0x90
> ? srso_return_thunk+0x5/0x10
> pci_device_probe+0x119/0x200
> really_probe+0x222/0x420
> __driver_probe_device+0xe8/0x140
> driver_probe_device+0x23/0xc0
> __driver_attach+0xf7/0x1f0
> ? __device_attach_driver+0x140/0x140
> bus_for_each_dev+0x7f/0xd0
> driver_attach+0x1e/0x30
> bus_add_driver+0x148/0x220
> ? srso_return_thunk+0x5/0x10
> dri...

Read more...

Revision history for this message
Arno Mayrhofer (azrael3000) wrote :

Works for me as well Matthew. Thanks for the fix.

Revision history for this message
berni123 (bernd-bernd--zimmermann) wrote :

I can confirm, proposed image -116 works without disabling amd_iommu:

berni@enterprise:~$ uname -rv
5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:14:24 UTC 2024

Revision history for this message
rfc (rfc0) wrote :

Thank you Matthew & the kernel team,

Proposed image 5.15.0-116-generic resolves the issue on my machine without the need for any workarounds (disabling amd_iommu,)

$ uname -r -v
5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:14:24 UTC 2024

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.