Bug #2068738 “AMD GPUs fail with null pointer dereference when I...” : Bugs : linux package : Ubuntu

Revision history for this message

Launchpad Janitor (janitor) wrote on 2024-06-08:

#1

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu):
status:	New → Confirmed

Revision history for this message

rfc (rfc0) wrote on 2024-06-08:

#2

I have similar symptom: after upgrading to linux-image-5.15.0-112-generic, upon boot the screen goes black and does not display the usual prompts to unlock disk or log in to ubuntu user account.

The issue is recurring, repeated attempts to boot using linux-image-5.15.0-112-generic give the same result (black screen).

Restarting and using grub to boot using linux-image-5.15.0-107-generic works around the issue.

processor: AMD® Ryzen 3 2200g with radeon vega graphics × 4
graphics: AMD® Radeon vega 8 graphics
os name: Ubuntu 22.04.4 LTS
gnome version: 42.9
windowing system: wayland

Additional reports:

https://www.reddit.com/r/Ubuntu/comments/1d9uviz/had_to_purge_kernel_5150112_could_not_boot/

One of the folks impacted in the above reddit thread reported that their although their display was black, the machine booted and they were able to interact with it normally over SSH.

Revision history for this message

Fred Jupiter (fredjupiter) wrote on 2024-06-08:

#3

Try this ("Temporary" first to gain access to the computer again), if you want to use the new kernel:

It worked for me.

Workaround:
Temporary:
Add "nomodeset" before "quiet splash" via editing of GRUB menu.

See: https://askubuntu.com/questions/162075/my-computer-boots-to-a-black-screen-what-options-do-i-have-to-fix-it

Permanent:
Edit /etc/default/grub: the line "GRUB_CMDLINE_LINUX_DEFAULT=". Add the word "nomodeset" before "quiet splash".
- Run "sudo update-grub"

Revision history for this message

berni123 (bernd-bernd--zimmermann) wrote on 2024-06-08:

#4

@Fred: setting nomodeset is only a small workaround, it disables full graphic support, no multi display usage, no high resolution ...

Revision history for this message

berni123 (bernd-bernd--zimmermann) wrote on 2024-06-08:

#5

Quick-Fix: GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=off"

Revision history for this message

Lin Manfu (linmanfu) wrote on 2024-06-08:

#6

I am also affected by this bug. Booting to kernel 5.15.0-112 produces a black screen. Kernel 5.15.0-107 works correctly. Here is information about my system:

Operating System: Kubuntu 22.04
KDE Plasma Version: 5.24.7
KDE Frameworks Version: 5.92.0
Qt Version: 5.15.3
Kernel Version: 5.15.0-107-generic (64-bit)
Graphics Platform: X11
Processors: 8 × AMD Ryzen 5 3400G with Radeon Vega Graphics
Memory: 13.6 GiB of RAM
Graphics Processor: AMD Radeon Vega 11 Graphics

Revision history for this message

rfc (rfc0) wrote on 2024-06-08:

#7

additional reports of this issue: https://www.reddit.com/r/linuxmint/comments/1d9w6c9/kernel_5150112_boot_failure/

Revision history for this message

Lin Manfu (linmanfu) wrote on 2024-06-08 (last edit on 2024-06-08):

#8

syslog-for-broken-kernel-112 Edit (2.0 MiB, text/html)

Attached is an annotated copy of my system log with the relevant update from 107 to 112, the failed boots with 112, a successful boot with the last good kernel 107, and a boot with 112 using the workaround. Search for BOOT in capital letters to find the start of each boot. I hope this might be helpful.

I am no expert, but perhaps these lines might be relevant, as they only seem to occur on the failed boots with 112:

Jun 8 19:17:56 paul kernel: [ 3.190269] amdgpu: Topology: Add APU node [0x15d8:0x1002]
Jun 8 19:17:56 paul kernel: [ 3.190273] kfd kfd: amdgpu: added device 1002:15d8
Jun 8 19:17:56 paul kernel: [ 3.190280] kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
Jun 8 19:17:56 paul kernel: [ 3.190287] amdgpu 0000:27:00.0: amdgpu: amdgpu_device_ip_init failed
Jun 8 19:17:56 paul kernel: [ 3.190292] amdgpu 0000:27:00.0: amdgpu: Fatal error during GPU init
Jun 8 19:17:56 paul kernel: [ 3.190296] amdgpu 0000:27:00.0: amdgpu: amdgpu: finishing device.

Revision history for this message

berni123 (bernd-bernd--zimmermann) wrote on 2024-06-09:

#9

Download full text (12.9 KiB)

Some more info on booting -112 related to amdgpu, which is crashing:

Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ un 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29:22 enterprise kernel: [ Jun 7 09:29... 2.809882] AMD-Vi: AMD IOMMUv2 loaded and initialized
2.979171] amdgpu: Topology: Add APU node [0x0:0x0]
2.979234] checking generic (d0000000 300000) vs hw (d0000000 10000000)
2.979341] amdgpu 0000:06:00.0: amdgpu: Trusted Memory Zone (TMZ) feature enabled
3.002476] amdgpu 0000:06:00.0: amdgpu: Fetched VBIOS from ROM BAR
3.002478] amdgpu: ATOM BIOS: 113-PICASSO-117
3.002608] amdgpu 0000:06:00.0: vgaarb: deactivate vga console
3.002653] amdgpu 0000:06:00.0: amdgpu: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
3.002656] amdgpu 0000:06:00.0: amdgpu: GART: 1024M 0x0000000000000000 - 0x000000003FFFFFFF
3.002659] amdgpu 0000:06:00.0: amdgpu: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
3.003052] amdgpu 0000:06:00.0: amdgpu: PSP runtime database doesn't exist
3.004269] amdgpu: hwmgr_sw_init smu backed is smu10_smu
3.004404] amdgpu 0000:06:00.0: amdgpu: Will use PSP to load VCN firmware
3.088846] amdgpu 0000:06:00.0: amdgpu: RAS: optional ras ta ucode is not available
3.093879] amdgpu 0000:06:00.0: amdgpu: RAP: optional rap ta ucode is not available
3.186429] amdgpu: HMM registered 2048MB device memory
3.186503] amdgpu: Topology: Add APU node [0x15d8:0x1002]
3.186508] kfd kfd: amdgpu: added device 1002:15d8
3.186516] kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
3.186530] amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
3.186535] amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
3.186545] amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
3.194967] BUG: kernel NULL pointer dereference, address: 000000000000013c
3.194971] #PF: supervisor read access in kernel mode
3.194975] #PF: error_code(0x0000) - not-present page
3.194978] PGD 0 P4D 0
3.194983] Oops: 0000 [#1] SMP NOPTI
3.194987] CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic #122-Ubuntu
3.194993] Hardware name: Gigabyte Technology Co., Ltd. B450 AORUS ELITE/B450 AORUS ELITE, BIOS F60c 10/29/2020

Some more info on booting -112 related to amdgpu, which is crashing:

Jun  7 09:29:22 enterprise kernel: [    2.809882] AMD-Vi: AMD IOMMUv2 loaded and initialized
Jun  7 09:29:22 enterprise kernel: [    2.979171] amdgpu: Topology: Add APU node [0x0:0x0]
Jun  7 09:29:22 enterprise kernel: [    2.979234] checking generic (d0000000 300000) vs hw (d0000000 10000000)
Jun  7 09:29:22 enterprise kernel: [    2.979341] amdgpu 0000:06:00.0: amdgpu: Trusted Memory Zone (TMZ) feature enabled
Jun  7 09:29:22 enterprise kernel: [    3.002476] amdgpu 0000:06:00.0: amdgpu: Fetched VBIOS from ROM BAR
Jun  7 09:29:22 enterprise kernel: [    3.002478] amdgpu: ATOM BIOS: 113-PICASSO-117
Jun  7 09:29:22 enterprise kernel: [    3.002608] amdgpu 0000:06:00.0: vgaarb: deactivate vga console
Jun  7 09:29:22 enterprise kernel: [    3.002653] amdgpu 0000:06:00.0: amdgpu: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
Jun  7 09:29:22 enterprise kernel: [    3.002656] amdgpu 0000:06:00.0: amdgpu: GART: 1024M 0x0000000000000000 - 0x000000003FFFFFFF
Jun  7 09:29:22 enterprise kernel: [    3.002659] amdgpu 0000:06:00.0: amdgpu: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
Jun  7 09:29:22 enterprise kernel: [    3.003052] amdgpu 0000:06:00.0: amdgpu: PSP runtime database doesn't exist
Jun  7 09:29:22 enterprise kernel: [    3.004269] amdgpu: hwmgr_sw_init smu backed is smu10_smu
Jun  7 09:29:22 enterprise kernel: [    3.004404] amdgpu 0000:06:00.0: amdgpu: Will use PSP to load VCN firmware
Jun  7 09:29:22 enterprise kernel: [    3.088846] amdgpu 0000:06:00.0: amdgpu: RAS: optional ras ta ucode is not available
Jun  7 09:29:22 enterprise kernel: [    3.093879] amdgpu 0000:06:00.0: amdgpu: RAP: optional rap ta ucode is not available
un  7 09:29:22 enterprise kernel: [    3.186429] amdgpu: HMM registered 2048MB device memory
Jun  7 09:29:22 enterprise kernel: [    3.186503] amdgpu: Topology: Add APU node [0x15d8:0x1002]
Jun  7 09:29:22 enterprise kernel: [    3.186508] kfd kfd: amdgpu: added device 1002:15d8
Jun  7 09:29:22 enterprise kernel: [    3.186516] kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
Jun  7 09:29:22 enterprise kernel: [    3.186530] amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
Jun  7 09:29:22 enterprise kernel: [    3.186535] amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
Jun  7 09:29:22 enterprise kernel: [    3.186545] amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
Jun  7 09:29:22 enterprise kernel: [    3.194967] BUG: kernel NULL pointer dereference, address: 000000000000013c
Jun  7 09:29:22 enterprise kernel: [    3.194971] #PF: supervisor read access in kernel mode
Jun  7 09:29:22 enterprise kernel: [    3.194975] #PF: error_code(0x0000) - not-present page
Jun  7 09:29:22 enterprise kernel: [    3.194978] PGD 0 P4D 0 
Jun  7 09:29:22 enterprise kernel: [    3.194983] Oops: 0000 [#1] SMP NOPTI
Jun  7 09:29:22 enterprise kernel: [    3.194987] CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic #122-Ubuntu
Jun  7 09:29:22 enterprise kernel: [    3.194993] Hardware name: Gigabyte Technology Co., Ltd. B450 AORUS ELITE/B450 AORUS ELITE, BIOS F60c 10/29/2020
Jun  7 09:29:22 enterprise kernel: [    3.194997] RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
Jun  7 09:29:22 enterprise kernel: [    3.195211] Code: 01 00 48 85 ff 74 10 e8 05 0e 27 00 48 c7 83 e0 4e 01 00 00 00 00 00 4c 8b 83 48 4f 01 00 4d 85 c0 74 66 48 8b 93 40 3e 01 00 <8b> 82 3c 01 00 00 85 c0 74 42 45 31 e4 49 63 c4 4c 8d 2c 40 4b 8b
Jun  7 09:29:22 enterprise kernel: [    3.195218] RSP: 0018:ffffc2d6013ab8d0 EFLAGS: 00010286
Jun  7 09:29:22 enterprise kernel: [    3.195223] RAX: 0000000000000000 RBX: ffffa0835cea0000 RCX: 00000000810000f8
Jun  7 09:29:22 enterprise kernel: [    3.195227] RDX: 0000000000000000 RSI: 00000000810000f8 RDI: ffffa08340042300
Jun  7 09:29:22 enterprise kernel: [    3.195231] RBP: ffffc2d6013ab8e8 R08: ffffa0835e6d9ac0 R09: 0000000000000000
Jun  7 09:29:22 enterprise kernel: [    3.195234] R10: 0000000000000001 R11: dead000000000100 R12: ffffa0835cea0000
Jun  7 09:29:22 enterprise kernel: [    3.195238] R13: ffffa0835cea0070 R14: 0000000000000007 R15: 0000000000000001
Jun  7 09:29:22 enterprise kernel: [    3.195242] FS:  00007fa26be9a8c0(0000) GS:ffffa08a60a40000(0000) knlGS:0000000000000000
Jun  7 09:29:22 enterprise kernel: [    3.195246] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun  7 09:29:22 enterprise kernel: [    3.195251] CR2: 000000000000013c CR3: 000000011b89c000 CR4: 00000000003506e0
Jun  7 09:29:22 enterprise kernel: [    3.195255] Call Trace:
Jun  7 09:29:22 enterprise kernel: [    3.195258]  <TASK>
Jun  7 09:29:22 enterprise kernel: [    3.195261]  ? srso_return_thunk+0x5/0x10
Jun  7 09:29:22 enterprise kernel: [    3.195267]  ? show_trace_log_lvl+0x28e/0x2ea
Jun  7 09:29:22 enterprise kernel: [    3.195273]  ? show_trace_log_lvl+0x28e/0x2ea
Jun  7 09:29:22 enterprise kernel: [    3.195279]  ? dm_hw_fini+0x23/0x30 [amdgpu]
Jun  7 09:29:22 enterprise kernel: [    3.195454]  ? show_regs.part.0+0x23/0x29
Jun  7 09:29:22 enterprise kernel: [    3.195458]  ? __die_body.cold+0x8/0xd
Jun  7 09:29:22 enterprise kernel: [    3.195462]  ? __die+0x2b/0x37
Jun  7 09:29:22 enterprise kernel: [    3.195466]  ? page_fault_oops+0x13b/0x170
Jun  7 09:29:22 enterprise kernel: [    3.195472]  ? srso_return_thunk+0x5/0x10
Jun  7 09:29:22 enterprise kernel: [    3.195476]  ? do_user_addr_fault+0x321/0x670
Jun  7 09:29:22 enterprise kernel: [    3.195481]  ? srso_return_thunk+0x5/0x10
Jun  7 09:29:22 enterprise kernel: [    3.195485]  ? __free_pages_ok+0x34a/0x4f0
Jun  7 09:29:22 enterprise kernel: [    3.195491]  ? exc_page_fault+0x77/0x170
Jun  7 09:29:22 enterprise kernel: [    3.195496]  ? asm_exc_page_fault+0x27/0x30
Jun  7 09:29:22 enterprise kernel: [    3.195503]  ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
Jun  7 09:29:22 enterprise kernel: [    3.195675]  dm_hw_fini+0x23/0x30 [amdgpu]
Jun  7 09:29:22 enterprise kernel: [    3.195846]  amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
Jun  7 09:29:22 enterprise kernel: [    3.196022]  amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
Jun  7 09:29:22 enterprise kernel: [    3.196193]  amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
Jun  7 09:29:22 enterprise kernel: [    3.196325]  amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
Jun  7 09:29:22 enterprise kernel: [    3.196496]  amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
Jun  7 09:29:22 enterprise kernel: [    3.196627]  local_pci_probe+0x4b/0x90
Jun  7 09:29:22 enterprise kernel: [    3.196632]  ? srso_return_thunk+0x5/0x10
Jun  7 09:29:22 enterprise kernel: [    3.196637]  pci_device_probe+0x119/0x200
Jun  7 09:29:22 enterprise kernel: [    3.196642]  really_probe+0x222/0x420
Jun  7 09:29:22 enterprise kernel: [    3.196647]  __driver_probe_device+0xe8/0x140
Jun  7 09:29:22 enterprise kernel: [    3.196652]  driver_probe_device+0x23/0xc0
Jun  7 09:29:22 enterprise kernel: [    3.196657]  __driver_attach+0xf7/0x1f0
Jun  7 09:29:22 enterprise kernel: [    3.196661]  ? __device_attach_driver+0x140/0x140
Jun  7 09:29:22 enterprise kernel: [    3.196665]  bus_for_each_dev+0x7f/0xd0
Jun  7 09:29:22 enterprise kernel: [    3.196671]  driver_attach+0x1e/0x30
Jun  7 09:29:22 enterprise kernel: [    3.196675]  bus_add_driver+0x148/0x220
Jun  7 09:29:22 enterprise kernel: [    3.196679]  ? srso_return_thunk+0x5/0x10
Jun  7 09:29:22 enterprise kernel: [    3.196683]  driver_register+0x95/0x100
Jun  7 09:29:22 enterprise kernel: [    3.196688]  __pci_register_driver+0x68/0x70
Jun  7 09:29:22 enterprise kernel: [    3.196692]  amdgpu_init+0x7c/0x1000 [amdgpu]
Jun  7 09:29:22 enterprise kernel: [    3.196810]  ? 0xffffffffc0e0b000
Jun  7 09:29:22 enterprise kernel: [    3.196814]  do_one_initcall+0x49/0x1e0
Jun  7 09:29:22 enterprise kernel: [    3.196820]  ? srso_return_thunk+0x5/0x10
Jun  7 09:29:22 enterprise kernel: [    3.196824]  ? kmem_cache_alloc_trace+0x19e/0x2e0
Jun  7 09:29:22 enterprise kernel: [    3.196830]  do_init_module+0x52/0x260
Jun  7 09:29:22 enterprise kernel: [    3.196835]  load_module+0xb45/0xbe0
Jun  7 09:29:22 enterprise kernel: [    3.196841]  __do_sys_finit_module+0xbf/0x120
Jun  7 09:29:22 enterprise kernel: [    3.196848]  __x64_sys_finit_module+0x18/0x20
Jun  7 09:29:22 enterprise kernel: [    3.196852]  x64_sys_call+0x1ac3/0x1fa0
Jun  7 09:29:22 enterprise kernel: [    3.196857]  do_syscall_64+0x56/0xb0
Jun  7 09:29:22 enterprise kernel: [    3.196861]  ? srso_return_thunk+0x5/0x10
Jun  7 09:29:22 enterprise kernel: [    3.196865]  ? fput+0x13/0x20
Jun  7 09:29:22 enterprise kernel: [    3.196869]  ? srso_return_thunk+0x5/0x10
Jun  7 09:29:22 enterprise kernel: [    3.196873]  ? ksys_mmap_pgoff+0x154/0x2c0
Jun  7 09:29:22 enterprise kernel: [    3.196878]  ? exit_to_user_mode_prepare+0x37/0xb0
Jun  7 09:29:22 enterprise kernel: [    3.196883]  ? syscall_exit_to_user_mode+0x35/0x50
Jun  7 09:29:22 enterprise kernel: [    3.196888]  ? srso_return_thunk+0x5/0x10
Jun  7 09:29:22 enterprise kernel: [    3.196892]  ? exit_to_user_mode_prepare+0x37/0xb0
Jun  7 09:29:22 enterprise kernel: [    3.196896]  ? srso_return_thunk+0x5/0x10
Jun  7 09:29:22 enterprise kernel: [    3.196900]  ? syscall_exit_to_user_mode+0x35/0x50
Jun  7 09:29:22 enterprise kernel: [    3.196904]  ? x64_sys_call+0x3b9/0x1fa0
Jun  7 09:29:22 enterprise kernel: [    3.196908]  ? srso_return_thunk+0x5/0x10
Jun  7 09:29:22 enterprise kernel: [    3.196912]  ? do_syscall_64+0x63/0xb0
Jun  7 09:29:22 enterprise kernel: [    3.196916]  ? srso_return_thunk+0x5/0x10
Jun  7 09:29:22 enterprise kernel: [    3.196920]  ? do_syscall_64+0x63/0xb0
Jun  7 09:29:22 enterprise kernel: [    3.196924]  ? srso_return_thunk+0x5/0x10
Jun  7 09:29:22 enterprise kernel: [    3.196927]  ? do_syscall_64+0x63/0xb0
Jun  7 09:29:22 enterprise kernel: [    3.196931]  ? do_syscall_64+0x63/0xb0
Jun  7 09:29:22 enterprise kernel: [    3.196935]  ? srso_return_thunk+0x5/0x10
Jun  7 09:29:22 enterprise kernel: [    3.196939]  ? do_syscall_64+0x63/0xb0
Jun  7 09:29:22 enterprise kernel: [    3.196944]  entry_SYSCALL_64_after_hwframe+0x67/0xd1
Jun  7 09:29:22 enterprise kernel: [    3.196948] RIP: 0033:0x7fa26c59488d
Jun  7 09:29:22 enterprise kernel: [    3.196952] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 b5 0f 00 f7 d8 64 89 01 48
Jun  7 09:29:22 enterprise kernel: [    3.196960] RSP: 002b:00007ffee7f9e528 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
Jun  7 09:29:22 enterprise kernel: [    3.196965] RAX: ffffffffffffffda RBX: 000055d753a10cc0 RCX: 00007fa26c59488d
Jun  7 09:29:22 enterprise kernel: [    3.196969] RDX: 0000000000000000 RSI: 00007fa26c72c441 RDI: 0000000000000018
Jun  7 09:29:22 enterprise kernel: [    3.196973] RBP: 0000000000020000 R08: 0000000000000000 R09: 0000000000000002
Jun  7 09:29:22 enterprise kernel: [    3.196977] R10: 0000000000000018 R11: 0000000000000246 R12: 00007fa26c72c441
Jun  7 09:29:22 enterprise kernel: [    3.196981] R13: 000055d753a15290 R14: 000055d753a024e0 R15: 000055d753a18440
Jun  7 09:29:22 enterprise kernel: [    3.196987]  </TASK>
Jun  7 09:29:22 enterprise kernel: [    3.196990] Modules linked in: raid1 amdgpu(+) hid_generic iommu_v2 gpu_sched usbhid i2c_algo_bit uas hid drm_ttm_helper usb_storage ttm drm_kms_helper crct10dif_pclmul crc32_pclmul syscopyarea ghash_clmulni_intel sysfillrect sha256_ssse3 sysimgblt fb_sys_fops sha1_ssse3 cec aesni_intel rc_core crypto_simd cryptd drm i2c_piix4 r8169 ahci xhci_pci gpio_amdpt realtek libahci xhci_pci_renesas wmi video gpio_generic
Jun  7 09:29:22 enterprise kernel: [    3.197039] CR2: 000000000000013c
Jun  7 09:29:22 enterprise kernel: [    3.197042] ---[ end trace 2fddbe30abca59f9 ]---
Jun  7 09:29:22 enterprise kernel: [    3.197045] RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
Jun  7 09:29:22 enterprise kernel: [    3.197220] Code: 01 00 48 85 ff 74 10 e8 05 0e 27 00 48 c7 83 e0 4e 01 00 00 00 00 00 4c 8b 83 48 4f 01 00 4d 85 c0 74 66 48 8b 93 40 3e 01 00 <8b> 82 3c 01 00 00 85 c0 74 42 45 31 e4 49 63 c4 4c 8d 2c 40 4b 8b
Jun  7 09:29:22 enterprise kernel: [    3.197227] RSP: 0018:ffffc2d6013ab8d0 EFLAGS: 00010286
Jun  7 09:29:22 enterprise kernel: [    3.197231] RAX: 0000000000000000 RBX: ffffa0835cea0000 RCX: 00000000810000f8
Jun  7 09:29:22 enterprise kernel: [    3.197235] RDX: 0000000000000000 RSI: 00000000810000f8 RDI: ffffa08340042300
Jun  7 09:29:22 enterprise kernel: [    3.197239] RBP: ffffc2d6013ab8e8 R08: ffffa0835e6d9ac0 R09: 0000000000000000
Jun  7 09:29:22 enterprise kernel: [    3.197243] R10: 0000000000000001 R11: dead000000000100 R12: ffffa0835cea0000
Jun  7 09:29:22 enterprise kernel: [    3.197247] R13: ffffa0835cea0070 R14: 0000000000000007 R15: 0000000000000001
Jun  7 09:29:22 enterprise kernel: [    3.197251] FS:  00007fa26be9a8c0(0000) GS:ffffa08a60a40000(0000) knlGS:0000000000000000
Jun  7 09:29:22 enterprise kernel: [    3.197256] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun  7 09:29:22 enterprise kernel: [    3.197259] CR2: 000000000000013c CR3: 000000011b89c000 CR4: 00000000003506e0

Revision history for this message

Carrington Institute (symbolize) wrote on 2024-06-09:

#10

log.txt Edit (351.3 KiB, text/plain)

I had the same issue, similar hardware I have had to revert back to the previous kernel version 5.15.0.-107 there was some errors in the log that pointed at the AMD GPU

if you need to boot from the previous kernel you will need to tap shift when the device boots, go into advanced options and select

kernel .107 generic

I then changed my grub menu

sudo nano /etc/default/grub

changed

GRUB_DEFAULT=0

to

GRUB_DEFAULT="1>2" (this tells grub to boot into advanced options and then the second linux kernel image, .107 at least for me ) save the file

sudo update-grub

sudo reboot

The device then booted successfully into the working .107

then i removed the .112 kernel

sudo apt remove linux-image-5.15.0-112-generic

sudo apt remove linux-headers-5.15.0-112

sudo update-grub

I also put a hold on .112

sudo apt-mark hold linux-image-5.15.0-112-generic
sudo apt-mark hold linux-image-5.15.0-112-generic
apt-mark showhold

I then changed grub back to

GRUB_DEFAULT=0

I have attached a full log for the developers perusal

Revision history for this message

Fenestrae, et vastata annis. (exfenestrae) wrote on 2024-06-09:

#11

log.txt Edit (17.1 KiB, text/plain)

Ditto @linmanfu:

Operating System: Kubuntu 22.04
KDE Plasma Version: 5.24.7
KDE Frameworks Version: 5.92.0
Qt Version: 5.15.3
Kernel Version: 5.15.0-107-generic (64-bit)
Graphics Platform: X11
Processors: 8 × AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx
Memory: 13.6 GiB of RAM
Graphics Processor: AMD Radeon Vega 8 Graphics

Revision history for this message

Mario Limonciello (superm1) wrote on 2024-06-10:

#12

AFAIK It's a bad backport to 5.15 stable. If it's what I think, here's the fix (IIRC).

https://<email address hidden>/

Try applying that to your 5.15 kernel.

Matthew Ruffell (mruffell) on 2024-06-10

Changed in linux (Ubuntu Jammy):
importance:	Undecided → High

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-06-10:

#13

Hi everyone,

Mario, that does indeed look like the relevant fix on the stable mailing list.
It reverts this commit:

commit 3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd ubuntu-jammy
Author: Yifan Zhang <email address hidden>
Date: Tue Sep 28 15:42:35 2021 +0800
Subject: drm/amdgpu: init iommu after amdkfd device init
Link: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?id=3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd

It arrived in 5.15.150 stable, applied in 5.15.0-112-generic.

Interestingly enough, this was once applied during the 5.15 development cycle
in 2021, in 5.15-rc5:

commit 714d9e4574d54596973ee3b0624ee4a16264d700
Author: Yifan Zhang <email address hidden>
Date: Tue Sep 28 15:42:35 2021 +0800
Subject: drm/amdgpu: init iommu after amdkfd device init
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=714d9e4574d54596973ee3b0624ee4a16264d700

Bizarre. It seems to have been removed back then as well.

Anyway, it seems it is on track to be once again reverted in the stable mailing list thread that Mario linked.

https://<email address hidden>/

To make sure this is the commit that is causing you issue, I have built you test kernels with the above revert commit applied.

There are builds for 22.04 and 20.04 HWE.

I only just uploaded them, so give them 3 hours from this message to build before trying to install. You can also check build status here:
https://launchpad.net/~mruffell/+archive/ubuntu/lp2068738-test

Please note this package is NOT SUPPORTED by Canonical, and is for TESTING
PURPOSES ONLY. ONLY Install in a dedicated test environment.

Instructions to Install (On a focal or jammy system):
1) sudo add-apt-repository ppa:mruffell/lp2068738-test
2) sudo apt update
3) sudo apt install linux-image-unsigned-5.15.0-112-generic linux-modules-5.15.0-112-generic linux-modules-extra-5.15.0-112-generic linux-headers-5.15.0-112-generic
4) sudo reboot
5) uname -rv
Look for "5.15.0-112.122+TEST2068738v20240610b1".

Can you boot into this kernel and let me know if it fixes the problem? If it does, we will chime in upstream, and get this included, and we will land it in the next Ubuntu kernel release.

Just give the kernels about 3 hours to build first.

Thanks,
Matthew

Hi everyone,

Mario, that does indeed look like the relevant fix on the stable mailing list.
It reverts this commit:

commit 3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd ubuntu-jammy
Author: Yifan Zhang <yifan1.zhang@amd.com>
Date:   Tue Sep 28 15:42:35 2021 +0800
Subject: drm/amdgpu: init iommu after amdkfd device init
Link: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?id=3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd

It arrived in 5.15.150 stable, applied in 5.15.0-112-generic.

Interestingly enough, this was once applied during the 5.15 development cycle
in 2021, in 5.15-rc5:

commit 714d9e4574d54596973ee3b0624ee4a16264d700
Author: Yifan Zhang <yifan1.zhang@amd.com>
Date:   Tue Sep 28 15:42:35 2021 +0800
Subject: drm/amdgpu: init iommu after amdkfd device init
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=714d9e4574d54596973ee3b0624ee4a16264d700

Bizarre. It seems to have been removed back then as well.

Anyway, it seems it is on track to be once again reverted in the stable mailing list thread that Mario linked.

https://lore.kernel.org/amd-gfx/20240523173031.4212-1-W_Armin@gmx.de/

To make sure this is the commit that is causing you issue, I have built you test kernels with the above revert commit applied.

There are builds for 22.04 and 20.04 HWE.

I only just uploaded them, so give them 3 hours from this message to build before trying to install. You can also check build status here:
https://launchpad.net/~mruffell/+archive/ubuntu/lp2068738-test

Please note this package is NOT SUPPORTED by Canonical, and is for TESTING
PURPOSES ONLY. ONLY Install in a dedicated test environment.

Instructions to Install (On a focal or jammy system):
1) sudo add-apt-repository ppa:mruffell/lp2068738-test
2) sudo apt update
3) sudo apt install linux-image-unsigned-5.15.0-112-generic linux-modules-5.15.0-112-generic linux-modules-extra-5.15.0-112-generic linux-headers-5.15.0-112-generic
4) sudo reboot
5) uname -rv
Look for "5.15.0-112.122+TEST2068738v20240610b1".

Can you boot into this kernel and let me know if it fixes the problem? If it does, we will chime in upstream, and get this included, and we will land it in the next Ubuntu kernel release.

Just give the kernels about 3 hours to build first.

Thanks,
Matthew

Changed in linux (Ubuntu Jammy):
status:	New → In Progress
assignee:	nobody → Matthew Ruffell (mruffell)
Changed in linux (Ubuntu):
status:	Confirmed → Fix Released

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-06-10:

#14

PS. Please remove the workaround "nomodeset" if you applied it before you try the test kernel. I really just want to know if the reverted patch fixes the issue.

Revision history for this message

Tino (tinoh82) wrote on 2024-06-10:

#15

Same issue here.

My Grub file loks like this, any help?

# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
# info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=10
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
GRUB_CMDLINE_LINUX=""

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal (grub-pc only)
#GRUB_TERMINAL=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
#GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"

Revision history for this message

Tino (tinoh82) wrote on 2024-06-10:

#16

-it says fix released for linux (ubuntu). Where can i get it?

thanks

Revision history for this message

Fred Jupiter (fredjupiter) wrote on 2024-06-10:

#17

I had this problem, too, with a PC (I'm sorry, I can't access the PC; this is from my written/saved records):
CPU: AMD Ryzen 3 2200G, 4x 3.50GHz, boxed (YD2200C5FBBOX)
Graphics: AMD Vega 8
Kubuntu 22.04, fully updated today

Expected:
PC starts

Actual:
PC monitor stays black after logo of mainboard manufacturer.

Attempts at fixing this:
Because I had problems with startup where the addition of "nomodeset" in GRUB helped, I tried this. This time it did not help. The computer actually started - more than before - but got stuck at various stages of the boot process.

The PC would not fully start with kernel 5.15.0-112-generic.

I chose 5.15.0-107-generic in GRUB. 107 worked without a hitch.

At the moment the workaround is simply using 5.15.0-107-generic.

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-06-10:

#18

Hi Tino,

It says Fix Released for Ubuntu, aka, the latest Ubuntu release, which is 24.10 Oracular, as it isn't affected. Maybe I should have marked it invalid in the first place.

Anyway, once someone tries the test kernel I built yesterday to see if it fixes the issue, I can go about getting the patch into Jammy's 5.15 kernel.

Thanks,
Matthew

Revision history for this message

Simon (simonphillips762) wrote on 2024-06-11:

#19

This bug stopped my AMD 2200G booting and it stopped my HP thin client with AMD Embedded G-Series GX-420GI Radeon R7E cpu both running on the intergrated graphics.

Im trying to boot Lubuntu, i have reverted back to kernel 107 and all seems ok so far and found a redddit post on apt-mark hold and holding linux-image, linux-headers, linux-modules and linux-modules-extras. After having removed all these packages via apt.

hope the extra info helps.

Revision history for this message

Xie Dongping (dongping-xie) wrote on 2024-06-11:

#20

Hi Matthew,

thanks for the quick fix. It works for me.

```
# uname -rv
5.15.0-112-generic #122+TEST2068738v20240610b1-Ubuntu SMP Mon Jun 10 05:02:07 UTC 2
# inxi -G
Graphics:
  Device-1: AMD Picasso/Raven 2 [Radeon Vega Series / Radeon Mobile Series]
    driver: amdgpu v: kernel
  Device-2: Lite-On HP Webcam type: USB driver: uvcvideo
  Display: server: X.Org v: 1.21.1.4 driver: X: loaded: amdgpu,ati
    unloaded: fbdev,modesetting,vesa gpu: amdgpu resolution: 1920x1080~60Hz
  OpenGL: renderer: AMD Radeon Vega 3 Graphics (raven2 LLVM 15.0.7 DRM 3.42
  5.15.0-112-generic)
    v: 4.6 Mesa 23.2.1-1ubuntu3.1~22.04.2
```

Also tried `vkcube` and it also works. So I guess that this is fixed.

Thanks!

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-06-11:

#21

Hi Xie,

Thanks for trying the test kernel, and its great news that we have identified the fix. I'll write up a SRU template, and get the patch sent off to the kernel team today.

Thanks,
Matthew

Matthew Ruffell (mruffell) on 2024-06-12

summary:	- Kernel update 5.15.0-112 might cause severe problems with specific AMD - GPUs + AMD GPUs fail with null pointer dereference when IOMMU enabled, leading + to black screen
description:	updated

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-06-12:

#22

Hi everyone,

Patches are on the Ubuntu Kernel Team mailing list:

Cover letter:
https://lists.ubuntu.com/archives/kernel-team/2024-June/151409.html
Patch:
https://lists.ubuntu.com/archives/kernel-team/2024-June/151410.html

I will go talk to the kernel team now, I can't promise anything, but will try and get this prioritised.

I will write back with what they say.

Thanks,
Matthew

Revision history for this message

Acru (aoellerer) wrote on 2024-06-13:

#23

Hey,
As of now, the update to 5.15.0-112 is still getting proposed in the update manager, meaning that with automatic updates your computer might fail from one boot to the next, would it be possible to pull it from there?
Best and thanks for all the work
Anton

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-06-14:

#24

Hi everyone,

An update:

Greg KH has picked up the patch and added it to upstream stable now:

https://lore.kernel.org/amd-gfx/2024061223-suitable-handler-b6f2@gregkh/
https://lore.kernel.org/amd-gfx/2024061239-rehydrate-flyable-343e@gregkh/

I suppose we can drop the UBUNTU: SAUCE tags.

I talked to Stefan Bader on the Kernel Team. His current feeling is that they might respin the -generic kernels before the release of the current cycle (2024.06.10 as per https://kernel.ubuntu.com/) but they are still unsure. They might see what else comes up this cycle before they decide.

I'll follow up with the Kernel Team in a couple days.

Thanks,
Matthew

Revision history for this message

Pete Orlando (peteola) wrote on 2024-06-16: Re: [Bug 2068738] Re: AMD GPUs fail with null pointer dereference when IOMMU enabled, leading to black screen

#25

Download full text (8.0 KiB)

Hi Matthew, I have an Acer Aspire 5 and running Linux Mint 21.2 Victoria
base: Ubuntu 22.04 jammy with AMD Ryzen 7 3700U with Radeon Vega.
I get the Black screen on boot after installing updates about a week ago.
will there be an update
that will come through the normal update manager ? I don't have programming
skills like many users
of Linux so that would much appreciated. Thank You !

On Thu, Jun 13, 2024 at 8:25 PM Matthew Ruffell <email address hidden>
wrote:

> Hi everyone,
>
> An update:
>
> Greg KH has picked up the patch and added it to upstream stable now:
>
> https://lore.kernel.org/amd-gfx/2024061223-suitable-handler-b6f2@gregkh/
> https://lore.kernel.org/amd-gfx/2024061239-rehydrate-flyable-343e@gregkh/
>
> I suppose we can drop the UBUNTU: SAUCE tags.
>
> I talked to Stefan Bader on the Kernel Team. His current feeling is that
> they might respin the -generic kernels before the release of the current
> cycle (2024.06.10 as per https://kernel.ubuntu.com/) but they are still
> unsure. They might see what else comes up this cycle before they decide.
>
> I'll follow up with the Kernel Team in a couple days.
>
> Thanks,
> Matthew
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2068738
>
> Title:
> AMD GPUs fail with null pointer dereference when IOMMU enabled,
> leading to black screen
>
> Status in linux package in Ubuntu:
> Fix Released
> Status in linux source package in Jammy:
> In Progress
>
> Bug description:
> BugLink: https://bugs.launchpad.net/bugs/2068738
>
> [Impact]
>
> On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
> enabled, the system fails to boot correctly, and all users see is a
> black screen.
>
> This is caused by a null pointer dereference when enabling the IOMMU
> after the device has been initialised. It should happen the other way
> around.
>
> AMD-Vi: AMD IOMMUv2 loaded and initialized
> ...
> amdgpu: Topology: Add APU node [0x15d8:0x1002]
> kfd kfd: amdgpu: added device 1002:15d8
> kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
> ...
> amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
> ...
> BUG: kernel NULL pointer dereference, address: 000000000000013c
> ...
> CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic
> #122-Ubuntu
> ...
> RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> ...
> Call Trace:
> <TASK>
> ? srso_return_thunk+0x5/0x10
> ? show_trace_log_lvl+0x28e/0x2ea
> ? show_trace_log_lvl+0x28e/0x2ea
> ? dm_hw_fini+0x23/0x30 [amdgpu]
> ? show_regs.part.0+0x23/0x29
> ? __die_body.cold+0x8/0xd
> ? __die+0x2b/0x37
> ? page_fault_oops+0x13b/0x170
> ? srso_return_thunk+0x5/0x10
> ? do_user_addr_fault+0x321/0x670
> ? srso_return_thunk+0x5/0x10
> ? __free_pages_ok+0x34a/0x4f0
> ? exc_page_fault+0x77/0x170
> ? asm_exc_page_fault+0x27/0x30
> ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> dm_hw_fini+0x23/0x30 [amdgpu]
> amdgpu_device_ip_fini_early.isra...

Hi Matthew, I have an Acer Aspire 5 and running Linux Mint 21.2 Victoria
    base: Ubuntu 22.04 jammy with AMD Ryzen 7 3700U with Radeon Vega.
I get the Black screen on boot after installing updates about a week ago.
will there be an update
that will come through the normal update manager ? I don't have programming
skills like many users
of Linux so that would much appreciated. Thank You !

On Thu, Jun 13, 2024 at 8:25 PM Matthew Ruffell <2068738@bugs.launchpad.net>
wrote:

> Hi everyone,
>
> An update:
>
> Greg KH has picked up the patch and added it to upstream stable now:
>
> https://lore.kernel.org/amd-gfx/2024061223-suitable-handler-b6f2@gregkh/
> https://lore.kernel.org/amd-gfx/2024061239-rehydrate-flyable-343e@gregkh/
>
> I suppose we can drop the UBUNTU: SAUCE tags.
>
> I talked to Stefan Bader on the Kernel Team. His current feeling is that
> they might respin the -generic kernels before the release of the current
> cycle (2024.06.10 as per https://kernel.ubuntu.com/) but they are still
> unsure. They might see what else comes up this cycle before they decide.
>
> I'll follow up with the Kernel Team in a couple days.
>
> Thanks,
> Matthew
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2068738
>
> Title:
>   AMD GPUs fail with null pointer dereference when IOMMU enabled,
>   leading to black screen
>
> Status in linux package in Ubuntu:
>   Fix Released
> Status in linux source package in Jammy:
>   In Progress
>
> Bug description:
>   BugLink: https://bugs.launchpad.net/bugs/2068738
>
>   [Impact]
>
>   On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
>   enabled, the system fails to boot correctly, and all users see is a
>   black screen.
>
>   This is caused by a null pointer dereference when enabling the IOMMU
>   after the device has been initialised. It should happen the other way
>   around.
>
>   AMD-Vi: AMD IOMMUv2 loaded and initialized
>   ...
>   amdgpu: Topology: Add APU node [0x15d8:0x1002]
>   kfd kfd: amdgpu: added device 1002:15d8
>   kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
>   ...
>   amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
>   amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
>   amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
>   ...
>   BUG: kernel NULL pointer dereference, address: 000000000000013c
>   ...
>   CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic
> #122-Ubuntu
>   ...
>   RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
>   ...
>   Call Trace:
>    <TASK>
>    ? srso_return_thunk+0x5/0x10
>    ? show_trace_log_lvl+0x28e/0x2ea
>    ? show_trace_log_lvl+0x28e/0x2ea
>    ? dm_hw_fini+0x23/0x30 [amdgpu]
>    ? show_regs.part.0+0x23/0x29
>    ? __die_body.cold+0x8/0xd
>    ? __die+0x2b/0x37
>    ? page_fault_oops+0x13b/0x170
>    ? srso_return_thunk+0x5/0x10
>    ? do_user_addr_fault+0x321/0x670
>    ? srso_return_thunk+0x5/0x10
>    ? __free_pages_ok+0x34a/0x4f0
>    ? exc_page_fault+0x77/0x170
>    ? asm_exc_page_fault+0x27/0x30
>    ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
>    dm_hw_fini+0x23/0x30 [amdgpu]
>    amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
>    amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
>    amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
>    amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
>    amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
>    local_pci_probe+0x4b/0x90
>    ? srso_return_thunk+0x5/0x10
>    pci_device_probe+0x119/0x200
>    really_probe+0x222/0x420
>    __driver_probe_device+0xe8/0x140
>    driver_probe_device+0x23/0xc0
>    __driver_attach+0xf7/0x1f0
>    ? __device_attach_driver+0x140/0x140
>    bus_for_each_dev+0x7f/0xd0
>    driver_attach+0x1e/0x30
>    bus_add_driver+0x148/0x220
>    ? srso_return_thunk+0x5/0x10
>    driver_register+0x95/0x100
>    __pci_register_driver+0x68/0x70
>    amdgpu_init+0x7c/0x1000 [amdgpu]
>    ? 0xffffffffc0e0b000
>    do_one_initcall+0x49/0x1e0
>    ? srso_return_thunk+0x5/0x10
>    ? kmem_cache_alloc_trace+0x19e/0x2e0
>    do_init_module+0x52/0x260
>    load_module+0xb45/0xbe0
>    __do_sys_finit_module+0xbf/0x120
>    __x64_sys_finit_module+0x18/0x20
>    x64_sys_call+0x1ac3/0x1fa0
>    do_syscall_64+0x56/0xb0
>   ...
>    entry_SYSCALL_64_after_hwframe+0x67/0xd1
>
>   A workaround does exist. Users can set "nomodeset" or "amd_iommu=off"
>   to GRUB_CMDLINE_LINUX_DEFAULT, update-grub and reboot.
>
>   [Fix]
>
>   The regression was caused by the following commit that landed in
>   5.15.0-112-generic, and 5.15.150 upstream:
>
>   commit 3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd ubuntu-jammy
>   Author: Yifan Zhang <yifan1.zhang@amd.com>
>   Date: Tue Sep 28 15:42:35 2021 +0800
>   Subject: drm/amdgpu: init iommu after amdkfd device init
>   Link:
> https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?id=3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd
>
>   The fix is to revert this patch, as it was not suppose to be
>   backported to 5.15 stable.
>
>   The mailing list discussion with AMD developers is:
>
>   https://lore.kernel.org/amd-gfx/20240523173031.4212-1-W_Armin@gmx.de/
>
>   The fix hasn't been acknowledged by Greg KH or Sasha Levin yet, so
>   sending as a Ubuntu SAUCE patch. If the upstream status changes, we
>   can NAK and resend.
>
>   [Testcase]
>
>   You need a system with an AMD Picasso/Raven 2 device. It will likely
>   be an APU, and not a discrete graphics card, but any AMD Picasso/Raven
>   2 device is affected.
>
>   Install the kernel and boot. Make sure full modesetting is enabled.
>
>   There is a test kernel available in the ppa below:
>
>   https://launchpad.net/~mruffell/+archive/ubuntu/lp2068738-test
>
>   If you install the test kernel, your system should boot successfully.
>
>   [Where problems could occur]
>
>   We are reverting a problematic patch and going back to how it was
>   before 5.15.0-112-generic. This should not cause any issues for users.
>
>   If a regression were to occur, users can set "nomodeset" or
>   "amd_iommu=off" to GRUB_CMDLINE_LINUX_DEFAULT and reboot, or pin their
>   kernel to a working one.
>
>   The impact of a regression would be high, as users displays could be
>   blank.
>
>   [Other Info]
>
>   User reports:
>   https://forums.linuxmint.com/viewtopic.php?t=421484
>   https://forums.linuxmint.com/viewtopic.php?t=421441
>
> https://www.reddit.com/r/Ubuntu/comments/1d9uviz/had_to_purge_kernel_5150112_could_not_boot/
>
> https://www.reddit.com/r/linuxmint/comments/1d9w6c9/kernel_5150112_boot_failure/
>   https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068735
>   https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068793
>   https://bugs.launchpad.net/bugs/2068812
>
>   As bizarre as it is, this commit was actually originally included in
>   5.15-rc5:
>
>   commit 714d9e4574d54596973ee3b0624ee4a16264d700
>   Author: Yifan Zhang <yifan1.zhang@amd.com>
>   Date: Tue Sep 28 15:42:35 2021 +0800
>   Subject: drm/amdgpu: init iommu after amdkfd device init
>   Link:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=714d9e4574d54596973ee3b0624ee4a16264d700
>
>   It seems to have caused issues back then too, and was removed in the
>   following fixups, in 5.16-rc1:
>
>   commit 93cec184788b0cf3926bc1f7b47fed74ba87990c
>   Author: James Zhu <James.Zhu@amd.com>
>   Date:   Tue Nov 2 21:33:50 2021 -0400
>   Subject: drm/amdgpu: remove duplicated kfd_resume_iommu
>   Link:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=93cec184788b0cf3926bc1f7b47fed74ba87990c
>
>   commit 9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
>   Author: shaoyunl <shaoyun.liu@amd.com>
>   Date:   Fri Nov 5 12:34:14 2021 -0400
>   Subject: drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
>   Link:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
>
>   I'm not exactly in favor of rewriting history twice, so I think we
>   should just revert the upstream stable patch and move on.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068738/+subscriptions
>
>

Revision history for this message

Frazer Ross (frazer-r) wrote on 2024-06-16:

#26

Today Update Manager updated my Kernel from 5.15.0-107 to 5.15.0-112. Unfortunately 5.15.0-112 is not compatible with the AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx x 4 graphics processor in my HP model 14-dk0011na laptop resulting in a black screen.
Via GNU GRUB menu booted into Linux 5.15.0-107-generic from the list, and once into the OS using Update Manager removed 5.15.0-112.
The laptop now boots correctly using Kernel 5.15.0-107.
Looking forward to a fixed version of 5.15.0-112
How will novices like myself know when the 5.15.0-112 kernel is corrected without having to go through the above cycle every few days please?

Roxana Nicolescu (roxanan) on 2024-06-17

Changed in linux (Ubuntu Jammy):
status:	In Progress → Fix Committed

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-06-18:

#27

Hi everyone,

I spoke with Stefan Bader of the Kernel Team again. They are planning to include
the patch in a respin of the current 2024.06.10 SRU cycle. https://kernel.ubuntu.com/

I will let you know once the kernel has been built and placed into -proposed for
verification.

But at this stage, as long as the respin occurs, this should be released the
week of 8th July, give or take a few days, if anything else pops up.

Thanks,
Matthew

Revision history for this message

Roger Ramjet (rogerramjet123) wrote on 2024-06-18:

#28

Download full text (7.2 KiB)

Thank you Matthew.

Sent with Proton Mail secure email.

On Tuesday, June 18th, 2024 at 2:41 AM, Matthew Ruffell <email address hidden> wrote:

> Hi everyone,
>
> I spoke with Stefan Bader of the Kernel Team again. They are planning to include
> the patch in a respin of the current 2024.06.10 SRU cycle. https://kernel.ubuntu.com/
>
> I will let you know once the kernel has been built and placed into -proposed for
> verification.
>
> But at this stage, as long as the respin occurs, this should be released the
> week of 8th July, give or take a few days, if anything else pops up.
>
> Thanks,
> Matthew
>
> --
> You received this bug notification because you are subscribed to a
> duplicate bug report (2069485).
> https://bugs.launchpad.net/bugs/2068738
>
> Title:
> AMD GPUs fail with null pointer dereference when IOMMU enabled,
> leading to black screen
>
> Status in linux package in Ubuntu:
> Fix Released
> Status in linux source package in Jammy:
> Fix Committed
>
> Bug description:
> BugLink: https://bugs.launchpad.net/bugs/2068738
>
> [Impact]
>
> On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
> enabled, the system fails to boot correctly, and all users see is a
> black screen.
>
> This is caused by a null pointer dereference when enabling the IOMMU
> after the device has been initialised. It should happen the other way
> around.
>
> AMD-Vi: AMD IOMMUv2 loaded and initialized
> ...
> amdgpu: Topology: Add APU node [0x15d8:0x1002]
> kfd kfd: amdgpu: added device 1002:15d8
> kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
> ...
> amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
> ...
> BUG: kernel NULL pointer dereference, address: 000000000000013c
> ...
> CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic #122-Ubuntu
> ...
> RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> ...
> Call Trace:
> <TASK>
>
> ? srso_return_thunk+0x5/0x10
> ? show_trace_log_lvl+0x28e/0x2ea
> ? show_trace_log_lvl+0x28e/0x2ea
> ? dm_hw_fini+0x23/0x30 [amdgpu]
> ? show_regs.part.0+0x23/0x29
> ? __die_body.cold+0x8/0xd
> ? __die+0x2b/0x37
> ? page_fault_oops+0x13b/0x170
> ? srso_return_thunk+0x5/0x10
> ? do_user_addr_fault+0x321/0x670
> ? srso_return_thunk+0x5/0x10
> ? __free_pages_ok+0x34a/0x4f0
> ? exc_page_fault+0x77/0x170
> ? asm_exc_page_fault+0x27/0x30
> ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> dm_hw_fini+0x23/0x30 [amdgpu]
> amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
> amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
> amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
> amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
> amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
> local_pci_probe+0x4b/0x90
> ? srso_return_thunk+0x5/0x10
> pci_device_probe+0x119/0x200
> really_probe+0x222/0x420
> __driver_probe_device+0xe8/0x140
> driver_probe_device+0x23/0xc0
> __driver_attach+0xf7/0x1f0
> ? __device_attach_driver+0x140/0x140
> bus_for_each_dev+0x7f/0xd0
> driver_attach+0x1e/0x30
> bus_add_driver+0x148/0x220
> ? srso_return_thunk+0x5/0x10
> driver_register+0x95/0x100
> __pci_register_driver+0x68...

Thank you Matthew.

Sent with Proton Mail secure email.

On Tuesday, June 18th, 2024 at 2:41 AM, Matthew Ruffell <2068738@bugs.launchpad.net> wrote:

> Hi everyone,
> 
> I spoke with Stefan Bader of the Kernel Team again. They are planning to include
> the patch in a respin of the current 2024.06.10 SRU cycle. https://kernel.ubuntu.com/
> 
> I will let you know once the kernel has been built and placed into -proposed for
> verification.
> 
> But at this stage, as long as the respin occurs, this should be released the
> week of 8th July, give or take a few days, if anything else pops up.
> 
> Thanks,
> Matthew
> 
> --
> You received this bug notification because you are subscribed to a
> duplicate bug report (2069485).
> https://bugs.launchpad.net/bugs/2068738
> 
> Title:
> AMD GPUs fail with null pointer dereference when IOMMU enabled,
> leading to black screen
> 
> Status in linux package in Ubuntu:
> Fix Released
> Status in linux source package in Jammy:
> Fix Committed
> 
> Bug description:
> BugLink: https://bugs.launchpad.net/bugs/2068738
> 
> [Impact]
> 
> On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
> enabled, the system fails to boot correctly, and all users see is a
> black screen.
> 
> This is caused by a null pointer dereference when enabling the IOMMU
> after the device has been initialised. It should happen the other way
> around.
> 
> AMD-Vi: AMD IOMMUv2 loaded and initialized
> ...
> amdgpu: Topology: Add APU node [0x15d8:0x1002]
> kfd kfd: amdgpu: added device 1002:15d8
> kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
> ...
> amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
> ...
> BUG: kernel NULL pointer dereference, address: 000000000000013c
> ...
> CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic #122-Ubuntu
> ...
> RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> ...
> Call Trace:
> <TASK>
> 
> ? srso_return_thunk+0x5/0x10
> ? show_trace_log_lvl+0x28e/0x2ea
> ? show_trace_log_lvl+0x28e/0x2ea
> ? dm_hw_fini+0x23/0x30 [amdgpu]
> ? show_regs.part.0+0x23/0x29
> ? __die_body.cold+0x8/0xd
> ? __die+0x2b/0x37
> ? page_fault_oops+0x13b/0x170
> ? srso_return_thunk+0x5/0x10
> ? do_user_addr_fault+0x321/0x670
> ? srso_return_thunk+0x5/0x10
> ? __free_pages_ok+0x34a/0x4f0
> ? exc_page_fault+0x77/0x170
> ? asm_exc_page_fault+0x27/0x30
> ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> dm_hw_fini+0x23/0x30 [amdgpu]
> amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
> amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
> amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
> amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
> amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
> local_pci_probe+0x4b/0x90
> ? srso_return_thunk+0x5/0x10
> pci_device_probe+0x119/0x200
> really_probe+0x222/0x420
> __driver_probe_device+0xe8/0x140
> driver_probe_device+0x23/0xc0
> __driver_attach+0xf7/0x1f0
> ? __device_attach_driver+0x140/0x140
> bus_for_each_dev+0x7f/0xd0
> driver_attach+0x1e/0x30
> bus_add_driver+0x148/0x220
> ? srso_return_thunk+0x5/0x10
> driver_register+0x95/0x100
> __pci_register_driver+0x68/0x70
> amdgpu_init+0x7c/0x1000 [amdgpu]
> ? 0xffffffffc0e0b000
> do_one_initcall+0x49/0x1e0
> ? srso_return_thunk+0x5/0x10
> ? kmem_cache_alloc_trace+0x19e/0x2e0
> do_init_module+0x52/0x260
> load_module+0xb45/0xbe0
> __do_sys_finit_module+0xbf/0x120
> __x64_sys_finit_module+0x18/0x20
> x64_sys_call+0x1ac3/0x1fa0
> do_syscall_64+0x56/0xb0
> ...
> entry_SYSCALL_64_after_hwframe+0x67/0xd1
> 
> A workaround does exist. Users can set "nomodeset" or "amd_iommu=off"
> to GRUB_CMDLINE_LINUX_DEFAULT, update-grub and reboot.
> 
> [Fix]
> 
> The regression was caused by the following commit that landed in
> 5.15.0-112-generic, and 5.15.150 upstream:
> 
> commit 3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd ubuntu-jammy
> Author: Yifan Zhang yifan1.zhang@amd.com
> 
> Date: Tue Sep 28 15:42:35 2021 +0800
> Subject: drm/amdgpu: init iommu after amdkfd device init
> Link: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?id=3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd
> 
> The fix is to revert this patch, as it was not suppose to be
> backported to 5.15 stable.
> 
> The mailing list discussion with AMD developers is:
> 
> https://lore.kernel.org/amd-gfx/20240523173031.4212-1-W_Armin@gmx.de/
> 
> The fix hasn't been acknowledged by Greg KH or Sasha Levin yet, so
> sending as a Ubuntu SAUCE patch. If the upstream status changes, we
> can NAK and resend.
> 
> [Testcase]
> 
> You need a system with an AMD Picasso/Raven 2 device. It will likely
> be an APU, and not a discrete graphics card, but any AMD Picasso/Raven
> 2 device is affected.
> 
> Install the kernel and boot. Make sure full modesetting is enabled.
> 
> There is a test kernel available in the ppa below:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/lp2068738-test
> 
> If you install the test kernel, your system should boot successfully.
> 
> [Where problems could occur]
> 
> We are reverting a problematic patch and going back to how it was
> before 5.15.0-112-generic. This should not cause any issues for users.
> 
> If a regression were to occur, users can set "nomodeset" or
> "amd_iommu=off" to GRUB_CMDLINE_LINUX_DEFAULT and reboot, or pin their
> kernel to a working one.
> 
> The impact of a regression would be high, as users displays could be
> blank.
> 
> [Other Info]
> 
> User reports:
> https://forums.linuxmint.com/viewtopic.php?t=421484
> https://forums.linuxmint.com/viewtopic.php?t=421441
> https://www.reddit.com/r/Ubuntu/comments/1d9uviz/had_to_purge_kernel_5150112_could_not_boot/
> https://www.reddit.com/r/linuxmint/comments/1d9w6c9/kernel_5150112_boot_failure/
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068735
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068793
> https://bugs.launchpad.net/bugs/2068812
> 
> As bizarre as it is, this commit was actually originally included in
> 5.15-rc5:
> 
> commit 714d9e4574d54596973ee3b0624ee4a16264d700
> Author: Yifan Zhang yifan1.zhang@amd.com
> 
> Date: Tue Sep 28 15:42:35 2021 +0800
> Subject: drm/amdgpu: init iommu after amdkfd device init
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=714d9e4574d54596973ee3b0624ee4a16264d700
> 
> It seems to have caused issues back then too, and was removed in the
> following fixups, in 5.16-rc1:
> 
> commit 93cec184788b0cf3926bc1f7b47fed74ba87990c
> Author: James Zhu James.Zhu@amd.com
> 
> Date: Tue Nov 2 21:33:50 2021 -0400
> Subject: drm/amdgpu: remove duplicated kfd_resume_iommu
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=93cec184788b0cf3926bc1f7b47fed74ba87990c
> 
> commit 9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
> Author: shaoyunl shaoyun.liu@amd.com
> 
> Date: Fri Nov 5 12:34:14 2021 -0400
> Subject: drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
> 
> I'm not exactly in favor of rewriting history twice, so I think we
> should just revert the upstream stable patch and move on.
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068738/+subscriptions

Revision history for this message

Acru (aoellerer) wrote on 2024-06-18:

#29

Hey,
After disabling the kernel update via the update manager, it now got installed again, again resulting in a not working system.
Isn't it possible to just completely remove this one version from the repo?

Revision history for this message

Pete Orlando (peteola) wrote on 2024-06-24:

#30

Download full text (7.2 KiB)

How do we fix this?

On Tue, Jun 18, 2024 at 9:40 AM Acru <email address hidden> wrote:

> Hey,
> After disabling the kernel update via the update manager, it now got
> installed again, again resulting in a not working system.
> Isn't it possible to just completely remove this one version from the repo?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2068738
>
> Title:
> AMD GPUs fail with null pointer dereference when IOMMU enabled,
> leading to black screen
>
> Status in linux package in Ubuntu:
> Fix Released
> Status in linux source package in Jammy:
> Fix Committed
>
> Bug description:
> BugLink: https://bugs.launchpad.net/bugs/2068738
>
> [Impact]
>
> On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
> enabled, the system fails to boot correctly, and all users see is a
> black screen.
>
> This is caused by a null pointer dereference when enabling the IOMMU
> after the device has been initialised. It should happen the other way
> around.
>
> AMD-Vi: AMD IOMMUv2 loaded and initialized
> ...
> amdgpu: Topology: Add APU node [0x15d8:0x1002]
> kfd kfd: amdgpu: added device 1002:15d8
> kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
> ...
> amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
> ...
> BUG: kernel NULL pointer dereference, address: 000000000000013c
> ...
> CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic
> #122-Ubuntu
> ...
> RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> ...
> Call Trace:
> <TASK>
> ? srso_return_thunk+0x5/0x10
> ? show_trace_log_lvl+0x28e/0x2ea
> ? show_trace_log_lvl+0x28e/0x2ea
> ? dm_hw_fini+0x23/0x30 [amdgpu]
> ? show_regs.part.0+0x23/0x29
> ? __die_body.cold+0x8/0xd
> ? __die+0x2b/0x37
> ? page_fault_oops+0x13b/0x170
> ? srso_return_thunk+0x5/0x10
> ? do_user_addr_fault+0x321/0x670
> ? srso_return_thunk+0x5/0x10
> ? __free_pages_ok+0x34a/0x4f0
> ? exc_page_fault+0x77/0x170
> ? asm_exc_page_fault+0x27/0x30
> ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> dm_hw_fini+0x23/0x30 [amdgpu]
> amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
> amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
> amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
> amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
> amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
> local_pci_probe+0x4b/0x90
> ? srso_return_thunk+0x5/0x10
> pci_device_probe+0x119/0x200
> really_probe+0x222/0x420
> __driver_probe_device+0xe8/0x140
> driver_probe_device+0x23/0xc0
> __driver_attach+0xf7/0x1f0
> ? __device_attach_driver+0x140/0x140
> bus_for_each_dev+0x7f/0xd0
> driver_attach+0x1e/0x30
> bus_add_driver+0x148/0x220
> ? srso_return_thunk+0x5/0x10
> driver_register+0x95/0x100
> __pci_register_driver+0x68/0x70
> amdgpu_init+0x7c/0x1000 [amdgpu]
> ? 0xffffffffc0e0b000
> do_one_initcall+0x49/0x1e0
> ? srso_return_thunk+0x5/0x10
> ? kmem_cache_alloc_trace+0x19...

How do we fix this?

On Tue, Jun 18, 2024 at 9:40 AM Acru <2068738@bugs.launchpad.net> wrote:

> Hey,
> After disabling the kernel update via the update manager, it now got
> installed again, again resulting in a not working system.
> Isn't it possible to just completely remove this one version from the repo?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2068738
>
> Title:
>   AMD GPUs fail with null pointer dereference when IOMMU enabled,
>   leading to black screen
>
> Status in linux package in Ubuntu:
>   Fix Released
> Status in linux source package in Jammy:
>   Fix Committed
>
> Bug description:
>   BugLink: https://bugs.launchpad.net/bugs/2068738
>
>   [Impact]
>
>   On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
>   enabled, the system fails to boot correctly, and all users see is a
>   black screen.
>
>   This is caused by a null pointer dereference when enabling the IOMMU
>   after the device has been initialised. It should happen the other way
>   around.
>
>   AMD-Vi: AMD IOMMUv2 loaded and initialized
>   ...
>   amdgpu: Topology: Add APU node [0x15d8:0x1002]
>   kfd kfd: amdgpu: added device 1002:15d8
>   kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
>   ...
>   amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
>   amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
>   amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
>   ...
>   BUG: kernel NULL pointer dereference, address: 000000000000013c
>   ...
>   CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic
> #122-Ubuntu
>   ...
>   RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
>   ...
>   Call Trace:
>    <TASK>
>    ? srso_return_thunk+0x5/0x10
>    ? show_trace_log_lvl+0x28e/0x2ea
>    ? show_trace_log_lvl+0x28e/0x2ea
>    ? dm_hw_fini+0x23/0x30 [amdgpu]
>    ? show_regs.part.0+0x23/0x29
>    ? __die_body.cold+0x8/0xd
>    ? __die+0x2b/0x37
>    ? page_fault_oops+0x13b/0x170
>    ? srso_return_thunk+0x5/0x10
>    ? do_user_addr_fault+0x321/0x670
>    ? srso_return_thunk+0x5/0x10
>    ? __free_pages_ok+0x34a/0x4f0
>    ? exc_page_fault+0x77/0x170
>    ? asm_exc_page_fault+0x27/0x30
>    ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
>    dm_hw_fini+0x23/0x30 [amdgpu]
>    amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
>    amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
>    amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
>    amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
>    amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
>    local_pci_probe+0x4b/0x90
>    ? srso_return_thunk+0x5/0x10
>    pci_device_probe+0x119/0x200
>    really_probe+0x222/0x420
>    __driver_probe_device+0xe8/0x140
>    driver_probe_device+0x23/0xc0
>    __driver_attach+0xf7/0x1f0
>    ? __device_attach_driver+0x140/0x140
>    bus_for_each_dev+0x7f/0xd0
>    driver_attach+0x1e/0x30
>    bus_add_driver+0x148/0x220
>    ? srso_return_thunk+0x5/0x10
>    driver_register+0x95/0x100
>    __pci_register_driver+0x68/0x70
>    amdgpu_init+0x7c/0x1000 [amdgpu]
>    ? 0xffffffffc0e0b000
>    do_one_initcall+0x49/0x1e0
>    ? srso_return_thunk+0x5/0x10
>    ? kmem_cache_alloc_trace+0x19e/0x2e0
>    do_init_module+0x52/0x260
>    load_module+0xb45/0xbe0
>    __do_sys_finit_module+0xbf/0x120
>    __x64_sys_finit_module+0x18/0x20
>    x64_sys_call+0x1ac3/0x1fa0
>    do_syscall_64+0x56/0xb0
>   ...
>    entry_SYSCALL_64_after_hwframe+0x67/0xd1
>
>   A workaround does exist. Users can set "nomodeset" or "amd_iommu=off"
>   to GRUB_CMDLINE_LINUX_DEFAULT, update-grub and reboot.
>
>   [Fix]
>
>   The regression was caused by the following commit that landed in
>   5.15.0-112-generic, and 5.15.150 upstream:
>
>   commit 3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd ubuntu-jammy
>   Author: Yifan Zhang <yifan1.zhang@amd.com>
>   Date: Tue Sep 28 15:42:35 2021 +0800
>   Subject: drm/amdgpu: init iommu after amdkfd device init
>   Link:
> https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?id=3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd
>
>   The fix is to revert this patch, as it was not suppose to be
>   backported to 5.15 stable.
>
>   The mailing list discussion with AMD developers is:
>
>   https://lore.kernel.org/amd-gfx/20240523173031.4212-1-W_Armin@gmx.de/
>
>   The fix hasn't been acknowledged by Greg KH or Sasha Levin yet, so
>   sending as a Ubuntu SAUCE patch. If the upstream status changes, we
>   can NAK and resend.
>
>   [Testcase]
>
>   You need a system with an AMD Picasso/Raven 2 device. It will likely
>   be an APU, and not a discrete graphics card, but any AMD Picasso/Raven
>   2 device is affected.
>
>   Install the kernel and boot. Make sure full modesetting is enabled.
>
>   There is a test kernel available in the ppa below:
>
>   https://launchpad.net/~mruffell/+archive/ubuntu/lp2068738-test
>
>   If you install the test kernel, your system should boot successfully.
>
>   [Where problems could occur]
>
>   We are reverting a problematic patch and going back to how it was
>   before 5.15.0-112-generic. This should not cause any issues for users.
>
>   If a regression were to occur, users can set "nomodeset" or
>   "amd_iommu=off" to GRUB_CMDLINE_LINUX_DEFAULT and reboot, or pin their
>   kernel to a working one.
>
>   The impact of a regression would be high, as users displays could be
>   blank.
>
>   [Other Info]
>
>   User reports:
>   https://forums.linuxmint.com/viewtopic.php?t=421484
>   https://forums.linuxmint.com/viewtopic.php?t=421441
>
> https://www.reddit.com/r/Ubuntu/comments/1d9uviz/had_to_purge_kernel_5150112_could_not_boot/
>
> https://www.reddit.com/r/linuxmint/comments/1d9w6c9/kernel_5150112_boot_failure/
>   https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068735
>   https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068793
>   https://bugs.launchpad.net/bugs/2068812
>
>   As bizarre as it is, this commit was actually originally included in
>   5.15-rc5:
>
>   commit 714d9e4574d54596973ee3b0624ee4a16264d700
>   Author: Yifan Zhang <yifan1.zhang@amd.com>
>   Date: Tue Sep 28 15:42:35 2021 +0800
>   Subject: drm/amdgpu: init iommu after amdkfd device init
>   Link:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=714d9e4574d54596973ee3b0624ee4a16264d700
>
>   It seems to have caused issues back then too, and was removed in the
>   following fixups, in 5.16-rc1:
>
>   commit 93cec184788b0cf3926bc1f7b47fed74ba87990c
>   Author: James Zhu <James.Zhu@amd.com>
>   Date:   Tue Nov 2 21:33:50 2021 -0400
>   Subject: drm/amdgpu: remove duplicated kfd_resume_iommu
>   Link:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=93cec184788b0cf3926bc1f7b47fed74ba87990c
>
>   commit 9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
>   Author: shaoyunl <shaoyun.liu@amd.com>
>   Date:   Fri Nov 5 12:34:14 2021 -0400
>   Subject: drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
>   Link:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
>
>   I'm not exactly in favor of rewriting history twice, so I think we
>   should just revert the upstream stable patch and move on.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068738/+subscriptions
>
>

Revision history for this message

Willem Dreyer (willemdreyer) wrote on 2024-06-25:

#31

Hi all,

Same problem: No problem on 5.15.0-107.117 and 5.15.0-112.122 shows black screen after GRUB.

Same hardware: AMD Ryzen 5 PRO 2500U w/ Radeon Vega Mobile Gfx

It's both funny and disappointing that a clinfo fix caused my first crash in 5 years on my daily driver.

In my opinion 5.15.0-112.122 should be blocked since it prevents multiple generations of Laptop and Desktop APUs from booting.

Special thanks to Barry K. for bisecting!

Thanks Armin W. for driving the fix and everyone else for your inputs.

Take care,
Willem

Revision history for this message

Roger Ramjet (rogerramjet123) wrote on 2024-06-26:

#32

Download full text (8.0 KiB)

I just loaded Linux Mint Kernel 5.15.0-113 and it is not recognized at all, I'm told to load a Kernel first, press any key and I have to go back to the index of Linux Kernels and choose 5.15.0-107 to power up.

Thanks, Ralph

Sent with Proton Mail secure email.

On Tuesday, June 18th, 2024 at 7:12 AM, Roger Ramjet <email address hidden> wrote:

> Thank you Matthew.
>
>
>
>
> Sent with Proton Mail secure email.
>
>
> On Tuesday, June 18th, 2024 at 2:41 AM, Matthew Ruffell <email address hidden> wrote:
>
> > Hi everyone,
> >
> > I spoke with Stefan Bader of the Kernel Team again. They are planning to include
> > the patch in a respin of the current 2024.06.10 SRU cycle. https://kernel.ubuntu.com/
> >
> > I will let you know once the kernel has been built and placed into -proposed for
> > verification.
> >
> > But at this stage, as long as the respin occurs, this should be released the
> > week of 8th July, give or take a few days, if anything else pops up.
> >
> > Thanks,
> > Matthew
> >
> > --
> > You received this bug notification because you are subscribed to a
> > duplicate bug report (2069485).
> > https://bugs.launchpad.net/bugs/2068738
> >
> > Title:
> > AMD GPUs fail with null pointer dereference when IOMMU enabled,
> > leading to black screen
> >
> > Status in linux package in Ubuntu:
> > Fix Released
> > Status in linux source package in Jammy:
> > Fix Committed
> >
> > Bug description:
> > BugLink: https://bugs.launchpad.net/bugs/2068738
> >
> > [Impact]
> >
> > On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
> > enabled, the system fails to boot correctly, and all users see is a
> > black screen.
> >
> > This is caused by a null pointer dereference when enabling the IOMMU
> > after the device has been initialised. It should happen the other way
> > around.
> >
> > AMD-Vi: AMD IOMMUv2 loaded and initialized
> > ...
> > amdgpu: Topology: Add APU node [0x15d8:0x1002]
> > kfd kfd: amdgpu: added device 1002:15d8
> > kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
> > ...
> > amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
> > amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
> > amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
> > ...
> > BUG: kernel NULL pointer dereference, address: 000000000000013c
> > ...
> > CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic #122-Ubuntu
> > ...
> > RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> > ...
> > Call Trace:
> > <TASK>
> >
> > ? srso_return_thunk+0x5/0x10
> > ? show_trace_log_lvl+0x28e/0x2ea
> > ? show_trace_log_lvl+0x28e/0x2ea
> > ? dm_hw_fini+0x23/0x30 [amdgpu]
> > ? show_regs.part.0+0x23/0x29
> > ? __die_body.cold+0x8/0xd
> > ? __die+0x2b/0x37
> > ? page_fault_oops+0x13b/0x170
> > ? srso_return_thunk+0x5/0x10
> > ? do_user_addr_fault+0x321/0x670
> > ? srso_return_thunk+0x5/0x10
> > ? __free_pages_ok+0x34a/0x4f0
> > ? exc_page_fault+0x77/0x170
> > ? asm_exc_page_fault+0x27/0x30
> > ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> > dm_hw_fini+0x23/0x30 [amdgpu]
> > amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
> > amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
> > amdgpu_driver_unload_kms+0x69/0x90 [...

I just loaded Linux Mint Kernel 5.15.0-113 and it is not recognized at all, I'm told to load a Kernel first, press any key and I have to go back to the index of Linux Kernels and choose 5.15.0-107 to power up.

Thanks, Ralph

Sent with Proton Mail secure email.

On Tuesday, June 18th, 2024 at 7:12 AM, Roger Ramjet <eog98@proton.me> wrote:

> Thank you Matthew.
> 
> 
> 
> 
> Sent with Proton Mail secure email.
> 
> 
> On Tuesday, June 18th, 2024 at 2:41 AM, Matthew Ruffell 2068738@bugs.launchpad.net wrote:
> 
> > Hi everyone,
> > 
> > I spoke with Stefan Bader of the Kernel Team again. They are planning to include
> > the patch in a respin of the current 2024.06.10 SRU cycle. https://kernel.ubuntu.com/
> > 
> > I will let you know once the kernel has been built and placed into -proposed for
> > verification.
> > 
> > But at this stage, as long as the respin occurs, this should be released the
> > week of 8th July, give or take a few days, if anything else pops up.
> > 
> > Thanks,
> > Matthew
> > 
> > --
> > You received this bug notification because you are subscribed to a
> > duplicate bug report (2069485).
> > https://bugs.launchpad.net/bugs/2068738
> > 
> > Title:
> > AMD GPUs fail with null pointer dereference when IOMMU enabled,
> > leading to black screen
> > 
> > Status in linux package in Ubuntu:
> > Fix Released
> > Status in linux source package in Jammy:
> > Fix Committed
> > 
> > Bug description:
> > BugLink: https://bugs.launchpad.net/bugs/2068738
> > 
> > [Impact]
> > 
> > On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
> > enabled, the system fails to boot correctly, and all users see is a
> > black screen.
> > 
> > This is caused by a null pointer dereference when enabling the IOMMU
> > after the device has been initialised. It should happen the other way
> > around.
> > 
> > AMD-Vi: AMD IOMMUv2 loaded and initialized
> > ...
> > amdgpu: Topology: Add APU node [0x15d8:0x1002]
> > kfd kfd: amdgpu: added device 1002:15d8
> > kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
> > ...
> > amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
> > amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
> > amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
> > ...
> > BUG: kernel NULL pointer dereference, address: 000000000000013c
> > ...
> > CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic #122-Ubuntu
> > ...
> > RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> > ...
> > Call Trace:
> > <TASK>
> > 
> > ? srso_return_thunk+0x5/0x10
> > ? show_trace_log_lvl+0x28e/0x2ea
> > ? show_trace_log_lvl+0x28e/0x2ea
> > ? dm_hw_fini+0x23/0x30 [amdgpu]
> > ? show_regs.part.0+0x23/0x29
> > ? __die_body.cold+0x8/0xd
> > ? __die+0x2b/0x37
> > ? page_fault_oops+0x13b/0x170
> > ? srso_return_thunk+0x5/0x10
> > ? do_user_addr_fault+0x321/0x670
> > ? srso_return_thunk+0x5/0x10
> > ? __free_pages_ok+0x34a/0x4f0
> > ? exc_page_fault+0x77/0x170
> > ? asm_exc_page_fault+0x27/0x30
> > ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> > dm_hw_fini+0x23/0x30 [amdgpu]
> > amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
> > amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
> > amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
> > amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
> > amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
> > local_pci_probe+0x4b/0x90
> > ? srso_return_thunk+0x5/0x10
> > pci_device_probe+0x119/0x200
> > really_probe+0x222/0x420
> > __driver_probe_device+0xe8/0x140
> > driver_probe_device+0x23/0xc0
> > __driver_attach+0xf7/0x1f0
> > ? __device_attach_driver+0x140/0x140
> > bus_for_each_dev+0x7f/0xd0
> > driver_attach+0x1e/0x30
> > bus_add_driver+0x148/0x220
> > ? srso_return_thunk+0x5/0x10
> > driver_register+0x95/0x100
> > __pci_register_driver+0x68/0x70
> > amdgpu_init+0x7c/0x1000 [amdgpu]
> > ? 0xffffffffc0e0b000
> > do_one_initcall+0x49/0x1e0
> > ? srso_return_thunk+0x5/0x10
> > ? kmem_cache_alloc_trace+0x19e/0x2e0
> > do_init_module+0x52/0x260
> > load_module+0xb45/0xbe0
> > __do_sys_finit_module+0xbf/0x120
> > __x64_sys_finit_module+0x18/0x20
> > x64_sys_call+0x1ac3/0x1fa0
> > do_syscall_64+0x56/0xb0
> > ...
> > entry_SYSCALL_64_after_hwframe+0x67/0xd1
> > 
> > A workaround does exist. Users can set "nomodeset" or "amd_iommu=off"
> > to GRUB_CMDLINE_LINUX_DEFAULT, update-grub and reboot.
> > 
> > [Fix]
> > 
> > The regression was caused by the following commit that landed in
> > 5.15.0-112-generic, and 5.15.150 upstream:
> > 
> > commit 3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd ubuntu-jammy
> > Author: Yifan Zhang yifan1.zhang@amd.com
> > 
> > Date: Tue Sep 28 15:42:35 2021 +0800
> > Subject: drm/amdgpu: init iommu after amdkfd device init
> > Link: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?id=3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd
> > 
> > The fix is to revert this patch, as it was not suppose to be
> > backported to 5.15 stable.
> > 
> > The mailing list discussion with AMD developers is:
> > 
> > https://lore.kernel.org/amd-gfx/20240523173031.4212-1-W_Armin@gmx.de/
> > 
> > The fix hasn't been acknowledged by Greg KH or Sasha Levin yet, so
> > sending as a Ubuntu SAUCE patch. If the upstream status changes, we
> > can NAK and resend.
> > 
> > [Testcase]
> > 
> > You need a system with an AMD Picasso/Raven 2 device. It will likely
> > be an APU, and not a discrete graphics card, but any AMD Picasso/Raven
> > 2 device is affected.
> > 
> > Install the kernel and boot. Make sure full modesetting is enabled.
> > 
> > There is a test kernel available in the ppa below:
> > 
> > https://launchpad.net/~mruffell/+archive/ubuntu/lp2068738-test
> > 
> > If you install the test kernel, your system should boot successfully.
> > 
> > [Where problems could occur]
> > 
> > We are reverting a problematic patch and going back to how it was
> > before 5.15.0-112-generic. This should not cause any issues for users.
> > 
> > If a regression were to occur, users can set "nomodeset" or
> > "amd_iommu=off" to GRUB_CMDLINE_LINUX_DEFAULT and reboot, or pin their
> > kernel to a working one.
> > 
> > The impact of a regression would be high, as users displays could be
> > blank.
> > 
> > [Other Info]
> > 
> > User reports:
> > https://forums.linuxmint.com/viewtopic.php?t=421484
> > https://forums.linuxmint.com/viewtopic.php?t=421441
> > https://www.reddit.com/r/Ubuntu/comments/1d9uviz/had_to_purge_kernel_5150112_could_not_boot/
> > https://www.reddit.com/r/linuxmint/comments/1d9w6c9/kernel_5150112_boot_failure/
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068735
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068793
> > https://bugs.launchpad.net/bugs/2068812
> > 
> > As bizarre as it is, this commit was actually originally included in
> > 5.15-rc5:
> > 
> > commit 714d9e4574d54596973ee3b0624ee4a16264d700
> > Author: Yifan Zhang yifan1.zhang@amd.com
> > 
> > Date: Tue Sep 28 15:42:35 2021 +0800
> > Subject: drm/amdgpu: init iommu after amdkfd device init
> > Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=714d9e4574d54596973ee3b0624ee4a16264d700
> > 
> > It seems to have caused issues back then too, and was removed in the
> > following fixups, in 5.16-rc1:
> > 
> > commit 93cec184788b0cf3926bc1f7b47fed74ba87990c
> > Author: James Zhu James.Zhu@amd.com
> > 
> > Date: Tue Nov 2 21:33:50 2021 -0400
> > Subject: drm/amdgpu: remove duplicated kfd_resume_iommu
> > Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=93cec184788b0cf3926bc1f7b47fed74ba87990c
> > 
> > commit 9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
> > Author: shaoyunl shaoyun.liu@amd.com
> > 
> > Date: Fri Nov 5 12:34:14 2021 -0400
> > Subject: drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
> > Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
> > 
> > I'm not exactly in favor of rewriting history twice, so I think we
> > should just revert the upstream stable patch and move on.
> > 
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068738/+subscriptions

Revision history for this message

Filip Średnicki (fsrednicki) wrote on 2024-06-26:

#33

I have noticed one more consequence of this bug, which can be very painful for those who are running thier own servers. Regardless that it is possible to connect by SSH and operate server normally, in my case when I was trying to reboot it whole power down procedure is carried properly up to closing of journal, but server freezes at the end and is not restarting, so if one is far away from infrastructure, can be in trouble. Fix with nomodeset is solving this restart issue.

My server is on Ryzen 3 + Radeon Vega integrated GPU.

Revision history for this message

Fred Jupiter (fredjupiter) wrote on 2024-06-26:

#34

Yes, I agree with https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068738/comments/32 .

I updated to -113 today on Kubuntu 22.04. There is no improvement on -112.

The only things that is a workaround for me is "nomodeset" in GRUB or using the -107 kernel.

Updating to -113 kernel leaves black screen, backlit, on restart.

Revision history for this message

Tino (tinoh82) wrote on 2024-06-26:

#35

Same on 5.15.0-113, no improvement. I still select 5.15.0-107 manually on boot screen. I hope this won't be removed because the 5.15.0-112 is gone now from GRUB

I Have an AMD Ryzen as well

Revision history for this message

Thomas Løcke (thomasmonadic) wrote on 2024-06-27:

#36

I can confirm the same problem with linux-image-5.15.0-113-generic on my AMD Ryzen 3 3200G with Radeon Vega Graphics boxes. I had to revert to linux-image-5.15.0-107-generic, as 113 fails to boot (ends in black screen) and while SSH login works, the machines cannot be restarted from CLI.

Revision history for this message

Filip Średnicki (fsrednicki) wrote on 2024-06-27:

#37

Confirmed also by myself - 5.15.0-113 is not solving the problem...

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-06-27:

#38

Hi everyone,

The patch was not present in 5.15.0-113-generic, as this kernel only contained
CVE fixes.

It should be in the next kernel update. Still waiting on the Kernel team to
respin.

I will write back as soon as we have a kernel tagged and in -proposed to test.

Thanks,
Matthew

Revision history for this message

Richard Durso (reefland) wrote on 2024-06-29:

#39

Download full text (5.4 KiB)

Tried to use "nomodeset" or "amd_iommu=off" to GRUB_CMDLINE_LINUX_DEFAULT which allowed screen to work for booting -113 kernel, however PCIe based I225-V network card was not detected. I was left with only a loopback device.

PC: HP T740 Thin PC / AMD Ryzen V1756B

00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 IOMMU
        Subsystem: Hewlett-Packard Company Raven/Raven2 IOMMU
        Flags: bus master, fast devsel, latency 0, IRQ 25
        Capabilities: [40] Secure device <?>
        Capabilities: [64] MSI: Enable+ Count=1/4 Maskable- 64bit+
        Capabilities: [74] HyperTransport: MSI Mapping Enable+ Fixed+

01:00.0 Ethernet controller: Intel Corporation Ethernet Controller I225-V (rev 03)
        Subsystem: Intel Corporation Ethernet Controller I225-V
        Flags: bus master, fast devsel, latency 0, IRQ 30, IOMMU group 7
        Memory at fe700000 (32-bit, non-prefetchable) [size=1M]
        Memory at fe800000 (32-bit, non-prefetchable) [size=16K]
        Expansion ROM at fe600000 [disabled] [size=1M]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number 88-c9-b3-ff-ff-bf-72-fc
        Capabilities: [1c0] Latency Tolerance Reporting
        Capabilities: [1f0] Precision Time Measurement
        Capabilities: [1e0] L1 PM Substates
        Kernel driver in use: igc
        Kernel modules: igc

$ journalctl -b | grep amdgpu
Jun 28 18:00:20 k3s03 kernel: [drm] amdgpu kernel modesetting enabled.
Jun 28 18:00:20 k3s03 kernel: amdgpu: Topology: Add APU node [0x0:0x0]
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: enabling device (0106 -> 0107)
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature enabled
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from VFCT
Jun 28 18:00:20 k3s03 kernel: amdgpu: ATOM BIOS: 113-RAVEN-111
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: vgaarb: deactivate vga console
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: VRAM: 256M 0x000000F400000000 - 0x000000F40FFFFFFF (256M used)
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: GART: 1024M 0x0000000000000000 - 0x000000003FFFFFFF
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
Jun 28 18:00:20 k3s03 kernel: [drm] amdgpu: 256M of VRAM memory ready
Jun 28 18:00:20 k3s03 kernel: [drm] amdgpu: 3072M of GTT memory ready.
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
Jun 28 18:00:20 k3s03 kernel: amdgpu: hwmgr_sw_init smu backed is smu10_smu
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: Will use PSP to load VCN firmware
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
Jun 28 18:00:20 k3s03 kernel: amdgpu 000...

Tried to use "nomodeset" or "amd_iommu=off" to GRUB_CMDLINE_LINUX_DEFAULT which allowed screen to work for booting -113 kernel, however PCIe based I225-V network card was not detected.  I was left with only a loopback device.

PC: HP T740 Thin PC / AMD Ryzen V1756B

00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 IOMMU
        Subsystem: Hewlett-Packard Company Raven/Raven2 IOMMU
        Flags: bus master, fast devsel, latency 0, IRQ 25
        Capabilities: [40] Secure device <?>
        Capabilities: [64] MSI: Enable+ Count=1/4 Maskable- 64bit+
        Capabilities: [74] HyperTransport: MSI Mapping Enable+ Fixed+

01:00.0 Ethernet controller: Intel Corporation Ethernet Controller I225-V (rev 03)
        Subsystem: Intel Corporation Ethernet Controller I225-V
        Flags: bus master, fast devsel, latency 0, IRQ 30, IOMMU group 7
        Memory at fe700000 (32-bit, non-prefetchable) [size=1M]
        Memory at fe800000 (32-bit, non-prefetchable) [size=16K]
        Expansion ROM at fe600000 [disabled] [size=1M]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number 88-c9-b3-ff-ff-bf-72-fc
        Capabilities: [1c0] Latency Tolerance Reporting
        Capabilities: [1f0] Precision Time Measurement
        Capabilities: [1e0] L1 PM Substates
        Kernel driver in use: igc
        Kernel modules: igc

$ journalctl -b  | grep amdgpu
Jun 28 18:00:20 k3s03 kernel: [drm] amdgpu kernel modesetting enabled.
Jun 28 18:00:20 k3s03 kernel: amdgpu: Topology: Add APU node [0x0:0x0]
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: enabling device (0106 -> 0107)
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature enabled
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from VFCT
Jun 28 18:00:20 k3s03 kernel: amdgpu: ATOM BIOS: 113-RAVEN-111
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: vgaarb: deactivate vga console
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: VRAM: 256M 0x000000F400000000 - 0x000000F40FFFFFFF (256M used)
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: GART: 1024M 0x0000000000000000 - 0x000000003FFFFFFF
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
Jun 28 18:00:20 k3s03 kernel: [drm] amdgpu: 256M of VRAM memory ready
Jun 28 18:00:20 k3s03 kernel: [drm] amdgpu: 3072M of GTT memory ready.
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
Jun 28 18:00:20 k3s03 kernel: amdgpu: hwmgr_sw_init smu backed is smu10_smu
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: Will use PSP to load VCN firmware
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Jun 28 18:00:20 k3s03 kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart
Jun 28 18:00:20 k3s03 kernel: amdgpu: HMM registered 256MB device memory
Jun 28 18:00:20 k3s03 kernel: amdgpu: Topology: Add APU node [0x15dd:0x1002]
Jun 28 18:00:20 k3s03 kernel: kfd kfd: amdgpu: added device 1002:15dd
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: SE 1, SH per SE 1, CU per SH 11, active_cu_number 8
Jun 28 18:00:20 k3s03 kernel: fbcon: amdgpudrmfb (fb0) is primary device
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: [drm] fb0: amdgpudrmfb frame buffer device
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 1
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 1
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 1
Jun 28 18:00:20 k3s03 kernel: amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1
Jun 28 18:00:20 k3s03 kernel: [drm] Initialized amdgpu 3.42.0 20150101 for 0000:03:00.0 on minor 0
Jun 28 18:00:22 k3s03 sensors[2700]: amdgpu-pci-0300

Revision history for this message

Daniel (treki) wrote on 2024-07-01:

#40

Did I understand correctly that this bug will be fixed after kernel 5.15.0-113?

Revision history for this message

Pete Orlando (peteola) wrote on 2024-07-01:

#41

Download full text (7.2 KiB)

Still no fix as of July 1, 2024. Have to boot manually with the 107 to get
my Acer laptop up and running. Hope someone can fix this problem soon.
Thanks for all the hard work on this issue.

On Mon, Jul 1, 2024 at 8:10 AM Daniel <email address hidden> wrote:

> Did I understand correctly that this bug will be fixed after kernel
> 5.15.0-113?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2068738
>
> Title:
> AMD GPUs fail with null pointer dereference when IOMMU enabled,
> leading to black screen
>
> Status in linux package in Ubuntu:
> Fix Released
> Status in linux source package in Jammy:
> Fix Committed
>
> Bug description:
> BugLink: https://bugs.launchpad.net/bugs/2068738
>
> [Impact]
>
> On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
> enabled, the system fails to boot correctly, and all users see is a
> black screen.
>
> This is caused by a null pointer dereference when enabling the IOMMU
> after the device has been initialised. It should happen the other way
> around.
>
> AMD-Vi: AMD IOMMUv2 loaded and initialized
> ...
> amdgpu: Topology: Add APU node [0x15d8:0x1002]
> kfd kfd: amdgpu: added device 1002:15d8
> kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
> ...
> amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
> ...
> BUG: kernel NULL pointer dereference, address: 000000000000013c
> ...
> CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic
> #122-Ubuntu
> ...
> RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> ...
> Call Trace:
> <TASK>
> ? srso_return_thunk+0x5/0x10
> ? show_trace_log_lvl+0x28e/0x2ea
> ? show_trace_log_lvl+0x28e/0x2ea
> ? dm_hw_fini+0x23/0x30 [amdgpu]
> ? show_regs.part.0+0x23/0x29
> ? __die_body.cold+0x8/0xd
> ? __die+0x2b/0x37
> ? page_fault_oops+0x13b/0x170
> ? srso_return_thunk+0x5/0x10
> ? do_user_addr_fault+0x321/0x670
> ? srso_return_thunk+0x5/0x10
> ? __free_pages_ok+0x34a/0x4f0
> ? exc_page_fault+0x77/0x170
> ? asm_exc_page_fault+0x27/0x30
> ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> dm_hw_fini+0x23/0x30 [amdgpu]
> amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
> amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
> amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
> amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
> amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
> local_pci_probe+0x4b/0x90
> ? srso_return_thunk+0x5/0x10
> pci_device_probe+0x119/0x200
> really_probe+0x222/0x420
> __driver_probe_device+0xe8/0x140
> driver_probe_device+0x23/0xc0
> __driver_attach+0xf7/0x1f0
> ? __device_attach_driver+0x140/0x140
> bus_for_each_dev+0x7f/0xd0
> driver_attach+0x1e/0x30
> bus_add_driver+0x148/0x220
> ? srso_return_thunk+0x5/0x10
> driver_register+0x95/0x100
> __pci_register_driver+0x68/0x70
> amdgpu_init+0x7c/0x1000 [amdgpu]
> ? 0xffffffffc0e0b000
> do_one_initcall+0x49/0x1e0
> ? srso_return_thunk+0x5/...

Still no fix as of July 1, 2024. Have to boot manually with the 107 to get
my Acer laptop up and running. Hope someone can fix this problem soon.
Thanks for all the hard work on this issue.

On Mon, Jul 1, 2024 at 8:10 AM Daniel <2068738@bugs.launchpad.net> wrote:

> Did I understand correctly that this bug will be fixed after kernel
> 5.15.0-113?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2068738
>
> Title:
>   AMD GPUs fail with null pointer dereference when IOMMU enabled,
>   leading to black screen
>
> Status in linux package in Ubuntu:
>   Fix Released
> Status in linux source package in Jammy:
>   Fix Committed
>
> Bug description:
>   BugLink: https://bugs.launchpad.net/bugs/2068738
>
>   [Impact]
>
>   On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
>   enabled, the system fails to boot correctly, and all users see is a
>   black screen.
>
>   This is caused by a null pointer dereference when enabling the IOMMU
>   after the device has been initialised. It should happen the other way
>   around.
>
>   AMD-Vi: AMD IOMMUv2 loaded and initialized
>   ...
>   amdgpu: Topology: Add APU node [0x15d8:0x1002]
>   kfd kfd: amdgpu: added device 1002:15d8
>   kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
>   ...
>   amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
>   amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
>   amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
>   ...
>   BUG: kernel NULL pointer dereference, address: 000000000000013c
>   ...
>   CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic
> #122-Ubuntu
>   ...
>   RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
>   ...
>   Call Trace:
>    <TASK>
>    ? srso_return_thunk+0x5/0x10
>    ? show_trace_log_lvl+0x28e/0x2ea
>    ? show_trace_log_lvl+0x28e/0x2ea
>    ? dm_hw_fini+0x23/0x30 [amdgpu]
>    ? show_regs.part.0+0x23/0x29
>    ? __die_body.cold+0x8/0xd
>    ? __die+0x2b/0x37
>    ? page_fault_oops+0x13b/0x170
>    ? srso_return_thunk+0x5/0x10
>    ? do_user_addr_fault+0x321/0x670
>    ? srso_return_thunk+0x5/0x10
>    ? __free_pages_ok+0x34a/0x4f0
>    ? exc_page_fault+0x77/0x170
>    ? asm_exc_page_fault+0x27/0x30
>    ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
>    dm_hw_fini+0x23/0x30 [amdgpu]
>    amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
>    amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
>    amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
>    amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
>    amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
>    local_pci_probe+0x4b/0x90
>    ? srso_return_thunk+0x5/0x10
>    pci_device_probe+0x119/0x200
>    really_probe+0x222/0x420
>    __driver_probe_device+0xe8/0x140
>    driver_probe_device+0x23/0xc0
>    __driver_attach+0xf7/0x1f0
>    ? __device_attach_driver+0x140/0x140
>    bus_for_each_dev+0x7f/0xd0
>    driver_attach+0x1e/0x30
>    bus_add_driver+0x148/0x220
>    ? srso_return_thunk+0x5/0x10
>    driver_register+0x95/0x100
>    __pci_register_driver+0x68/0x70
>    amdgpu_init+0x7c/0x1000 [amdgpu]
>    ? 0xffffffffc0e0b000
>    do_one_initcall+0x49/0x1e0
>    ? srso_return_thunk+0x5/0x10
>    ? kmem_cache_alloc_trace+0x19e/0x2e0
>    do_init_module+0x52/0x260
>    load_module+0xb45/0xbe0
>    __do_sys_finit_module+0xbf/0x120
>    __x64_sys_finit_module+0x18/0x20
>    x64_sys_call+0x1ac3/0x1fa0
>    do_syscall_64+0x56/0xb0
>   ...
>    entry_SYSCALL_64_after_hwframe+0x67/0xd1
>
>   A workaround does exist. Users can set "nomodeset" or "amd_iommu=off"
>   to GRUB_CMDLINE_LINUX_DEFAULT, update-grub and reboot.
>
>   [Fix]
>
>   The regression was caused by the following commit that landed in
>   5.15.0-112-generic, and 5.15.150 upstream:
>
>   commit 3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd ubuntu-jammy
>   Author: Yifan Zhang <yifan1.zhang@amd.com>
>   Date: Tue Sep 28 15:42:35 2021 +0800
>   Subject: drm/amdgpu: init iommu after amdkfd device init
>   Link:
> https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?id=3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd
>
>   The fix is to revert this patch, as it was not suppose to be
>   backported to 5.15 stable.
>
>   The mailing list discussion with AMD developers is:
>
>   https://lore.kernel.org/amd-gfx/20240523173031.4212-1-W_Armin@gmx.de/
>
>   The fix hasn't been acknowledged by Greg KH or Sasha Levin yet, so
>   sending as a Ubuntu SAUCE patch. If the upstream status changes, we
>   can NAK and resend.
>
>   [Testcase]
>
>   You need a system with an AMD Picasso/Raven 2 device. It will likely
>   be an APU, and not a discrete graphics card, but any AMD Picasso/Raven
>   2 device is affected.
>
>   Install the kernel and boot. Make sure full modesetting is enabled.
>
>   There is a test kernel available in the ppa below:
>
>   https://launchpad.net/~mruffell/+archive/ubuntu/lp2068738-test
>
>   If you install the test kernel, your system should boot successfully.
>
>   [Where problems could occur]
>
>   We are reverting a problematic patch and going back to how it was
>   before 5.15.0-112-generic. This should not cause any issues for users.
>
>   If a regression were to occur, users can set "nomodeset" or
>   "amd_iommu=off" to GRUB_CMDLINE_LINUX_DEFAULT and reboot, or pin their
>   kernel to a working one.
>
>   The impact of a regression would be high, as users displays could be
>   blank.
>
>   [Other Info]
>
>   User reports:
>   https://forums.linuxmint.com/viewtopic.php?t=421484
>   https://forums.linuxmint.com/viewtopic.php?t=421441
>
> https://www.reddit.com/r/Ubuntu/comments/1d9uviz/had_to_purge_kernel_5150112_could_not_boot/
>
> https://www.reddit.com/r/linuxmint/comments/1d9w6c9/kernel_5150112_boot_failure/
>   https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068735
>   https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068793
>   https://bugs.launchpad.net/bugs/2068812
>
>   As bizarre as it is, this commit was actually originally included in
>   5.15-rc5:
>
>   commit 714d9e4574d54596973ee3b0624ee4a16264d700
>   Author: Yifan Zhang <yifan1.zhang@amd.com>
>   Date: Tue Sep 28 15:42:35 2021 +0800
>   Subject: drm/amdgpu: init iommu after amdkfd device init
>   Link:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=714d9e4574d54596973ee3b0624ee4a16264d700
>
>   It seems to have caused issues back then too, and was removed in the
>   following fixups, in 5.16-rc1:
>
>   commit 93cec184788b0cf3926bc1f7b47fed74ba87990c
>   Author: James Zhu <James.Zhu@amd.com>
>   Date:   Tue Nov 2 21:33:50 2021 -0400
>   Subject: drm/amdgpu: remove duplicated kfd_resume_iommu
>   Link:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=93cec184788b0cf3926bc1f7b47fed74ba87990c
>
>   commit 9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
>   Author: shaoyunl <shaoyun.liu@amd.com>
>   Date:   Fri Nov 5 12:34:14 2021 -0400
>   Subject: drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
>   Link:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
>
>   I'm not exactly in favor of rewriting history twice, so I think we
>   should just revert the upstream stable patch and move on.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068738/+subscriptions
>
>

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2024-07-04:

#42

This bug is awaiting verification that the linux/5.15.0-116.126 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux' to 'verification-done-jammy-linux'. If the problem still exists, change the tag 'verification-needed-jammy-linux' to 'verification-failed-jammy-linux'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: kernel-spammed-jammy-linux-v2 verification-needed-jammy-linux

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-07-04:

#44

Hi everyone,

The Kernel Team have respun the latest 5.15 kernel with the fix, and have placed
it into -proposed for verification.

Could someone more technically minded help test it and let me know if it fixes
the problem?

Instructions to Install (On a jammy system):
1) cat << EOF | sudo tee /etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list
# Enable Ubuntu proposed archive
deb http://archive.ubuntu.com/ubuntu/ $(lsb_release -cs)-proposed main universe
EOF
2) sudo apt update
3) sudo apt install linux-image-5.15.0-116-generic linux-modules-5.15.0-116-generic linux-modules-extra-5.15.0-116-generic linux-headers-5.15.0-116-generic
4) sudo rm /etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list
5) sudo apt update
6) sudo reboot
7) uname -rv
5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:14:24 UTC 2024

Can you let me know if you can boot to your desktop and have a working screen?

We are still on track for a release to -updates next week sometime.

Thanks,
Matthew

Revision history for this message

Mathias Weyland (launchpad-weyland) wrote on 2024-07-04:

#45

Hello. I ran these instructions and was able to boot the -116 kernel on a laptop that was subject to this bug, i.e. where the -113 kernel booted into a black screen (HP EliteBook 735 G5).

Thank you for your efforts and regards!
Matt

Revision history for this message

H A Killenbeck (akillenb) wrote on 2024-07-04:

#46

followed your instructions, and the kernel update seems to work fine on
my system

On 7/3/24 11:50 PM, Matthew Ruffell wrote:
> Hi everyone,
>
> The Kernel Team have respun the latest 5.15 kernel with the fix, and have placed
> it into -proposed for verification.
>
> Could someone more technically minded help test it and let me know if it fixes
> the problem?
>
> Instructions to Install (On a jammy system):
> 1) cat << EOF | sudo tee /etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list
> # Enable Ubuntu proposed archive
> deb http://archive.ubuntu.com/ubuntu/ $(lsb_release -cs)-proposed main universe
> EOF
> 2) sudo apt update
> 3) sudo apt install linux-image-5.15.0-116-generic linux-modules-5.15.0-116-generic linux-modules-extra-5.15.0-116-generic linux-headers-5.15.0-116-generic
> 4) sudo rm /etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list
> 5) sudo apt update
> 6) sudo reboot
> 7) uname -rv
> 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:14:24 UTC 2024
>
> Can you let me know if you can boot to your desktop and have a working
> screen?
>
> We are still on track for a release to -updates next week sometime.
>
> Thanks,
> Matthew
>

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-07-04:

#48

Thanks for testing Matt, H.A.

I marked the bug as verified. We should be all good for a release to -updates
early next week. I'll write a new message as soon as the kernel has been
released for everyone.

Thanks,
Matthew

tags:

added: verification-done-jammy-linux
removed: verification-needed-jammy-linux

Revision history for this message

Pete Orlando (peteola) wrote on 2024-07-05:

#49

Download full text (7.4 KiB)

problem still exists after running your instructions. I am still only able
to boot live by booting into the 107.

On Thu, Jul 4, 2024 at 3:20 PM Matthew Ruffell <email address hidden>
wrote:

> Thanks for testing Matt, H.A.
>
> I marked the bug as verified. We should be all good for a release to
> -updates
> early next week. I'll write a new message as soon as the kernel has been
> released for everyone.
>
> Thanks,
> Matthew
>
> ** Tags removed: verification-needed-jammy-linux
> ** Tags added: verification-done-jammy-linux
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2068738
>
> Title:
> AMD GPUs fail with null pointer dereference when IOMMU enabled,
> leading to black screen
>
> Status in linux package in Ubuntu:
> Fix Released
> Status in linux source package in Jammy:
> Fix Committed
>
> Bug description:
> BugLink: https://bugs.launchpad.net/bugs/2068738
>
> [Impact]
>
> On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
> enabled, the system fails to boot correctly, and all users see is a
> black screen.
>
> This is caused by a null pointer dereference when enabling the IOMMU
> after the device has been initialised. It should happen the other way
> around.
>
> AMD-Vi: AMD IOMMUv2 loaded and initialized
> ...
> amdgpu: Topology: Add APU node [0x15d8:0x1002]
> kfd kfd: amdgpu: added device 1002:15d8
> kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
> ...
> amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
> ...
> BUG: kernel NULL pointer dereference, address: 000000000000013c
> ...
> CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic
> #122-Ubuntu
> ...
> RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> ...
> Call Trace:
> <TASK>
> ? srso_return_thunk+0x5/0x10
> ? show_trace_log_lvl+0x28e/0x2ea
> ? show_trace_log_lvl+0x28e/0x2ea
> ? dm_hw_fini+0x23/0x30 [amdgpu]
> ? show_regs.part.0+0x23/0x29
> ? __die_body.cold+0x8/0xd
> ? __die+0x2b/0x37
> ? page_fault_oops+0x13b/0x170
> ? srso_return_thunk+0x5/0x10
> ? do_user_addr_fault+0x321/0x670
> ? srso_return_thunk+0x5/0x10
> ? __free_pages_ok+0x34a/0x4f0
> ? exc_page_fault+0x77/0x170
> ? asm_exc_page_fault+0x27/0x30
> ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> dm_hw_fini+0x23/0x30 [amdgpu]
> amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
> amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
> amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
> amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
> amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
> local_pci_probe+0x4b/0x90
> ? srso_return_thunk+0x5/0x10
> pci_device_probe+0x119/0x200
> really_probe+0x222/0x420
> __driver_probe_device+0xe8/0x140
> driver_probe_device+0x23/0xc0
> __driver_attach+0xf7/0x1f0
> ? __device_attach_driver+0x140/0x140
> bus_for_each_dev+0x7f/0xd0
> driver_attach+0x1e/0x30
> bus_add_driver+0x148/0x220
> ? srso_return_thunk+0x5/0x10
> dri...

problem still exists after running your instructions. I am still only able
to boot live by booting into the 107.

On Thu, Jul 4, 2024 at 3:20 PM Matthew Ruffell <2068738@bugs.launchpad.net>
wrote:

> Thanks for testing Matt, H.A.
>
> I marked the bug as verified. We should be all good for a release to
> -updates
> early next week. I'll write a new message as soon as the kernel has been
> released for everyone.
>
> Thanks,
> Matthew
>
> ** Tags removed: verification-needed-jammy-linux
> ** Tags added: verification-done-jammy-linux
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2068738
>
> Title:
>   AMD GPUs fail with null pointer dereference when IOMMU enabled,
>   leading to black screen
>
> Status in linux package in Ubuntu:
>   Fix Released
> Status in linux source package in Jammy:
>   Fix Committed
>
> Bug description:
>   BugLink: https://bugs.launchpad.net/bugs/2068738
>
>   [Impact]
>
>   On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
>   enabled, the system fails to boot correctly, and all users see is a
>   black screen.
>
>   This is caused by a null pointer dereference when enabling the IOMMU
>   after the device has been initialised. It should happen the other way
>   around.
>
>   AMD-Vi: AMD IOMMUv2 loaded and initialized
>   ...
>   amdgpu: Topology: Add APU node [0x15d8:0x1002]
>   kfd kfd: amdgpu: added device 1002:15d8
>   kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
>   ...
>   amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
>   amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
>   amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
>   ...
>   BUG: kernel NULL pointer dereference, address: 000000000000013c
>   ...
>   CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic
> #122-Ubuntu
>   ...
>   RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
>   ...
>   Call Trace:
>    <TASK>
>    ? srso_return_thunk+0x5/0x10
>    ? show_trace_log_lvl+0x28e/0x2ea
>    ? show_trace_log_lvl+0x28e/0x2ea
>    ? dm_hw_fini+0x23/0x30 [amdgpu]
>    ? show_regs.part.0+0x23/0x29
>    ? __die_body.cold+0x8/0xd
>    ? __die+0x2b/0x37
>    ? page_fault_oops+0x13b/0x170
>    ? srso_return_thunk+0x5/0x10
>    ? do_user_addr_fault+0x321/0x670
>    ? srso_return_thunk+0x5/0x10
>    ? __free_pages_ok+0x34a/0x4f0
>    ? exc_page_fault+0x77/0x170
>    ? asm_exc_page_fault+0x27/0x30
>    ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
>    dm_hw_fini+0x23/0x30 [amdgpu]
>    amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
>    amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
>    amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
>    amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
>    amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
>    local_pci_probe+0x4b/0x90
>    ? srso_return_thunk+0x5/0x10
>    pci_device_probe+0x119/0x200
>    really_probe+0x222/0x420
>    __driver_probe_device+0xe8/0x140
>    driver_probe_device+0x23/0xc0
>    __driver_attach+0xf7/0x1f0
>    ? __device_attach_driver+0x140/0x140
>    bus_for_each_dev+0x7f/0xd0
>    driver_attach+0x1e/0x30
>    bus_add_driver+0x148/0x220
>    ? srso_return_thunk+0x5/0x10
>    driver_register+0x95/0x100
>    __pci_register_driver+0x68/0x70
>    amdgpu_init+0x7c/0x1000 [amdgpu]
>    ? 0xffffffffc0e0b000
>    do_one_initcall+0x49/0x1e0
>    ? srso_return_thunk+0x5/0x10
>    ? kmem_cache_alloc_trace+0x19e/0x2e0
>    do_init_module+0x52/0x260
>    load_module+0xb45/0xbe0
>    __do_sys_finit_module+0xbf/0x120
>    __x64_sys_finit_module+0x18/0x20
>    x64_sys_call+0x1ac3/0x1fa0
>    do_syscall_64+0x56/0xb0
>   ...
>    entry_SYSCALL_64_after_hwframe+0x67/0xd1
>
>   A workaround does exist. Users can set "nomodeset" or "amd_iommu=off"
>   to GRUB_CMDLINE_LINUX_DEFAULT, update-grub and reboot.
>
>   [Fix]
>
>   The regression was caused by the following commit that landed in
>   5.15.0-112-generic, and 5.15.150 upstream:
>
>   commit 3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd ubuntu-jammy
>   Author: Yifan Zhang <yifan1.zhang@amd.com>
>   Date: Tue Sep 28 15:42:35 2021 +0800
>   Subject: drm/amdgpu: init iommu after amdkfd device init
>   Link:
> https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?id=3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd
>
>   The fix is to revert this patch, as it was not suppose to be
>   backported to 5.15 stable.
>
>   The mailing list discussion with AMD developers is:
>
>   https://lore.kernel.org/amd-gfx/20240523173031.4212-1-W_Armin@gmx.de/
>
>   The fix hasn't been acknowledged by Greg KH or Sasha Levin yet, so
>   sending as a Ubuntu SAUCE patch. If the upstream status changes, we
>   can NAK and resend.
>
>   [Testcase]
>
>   You need a system with an AMD Picasso/Raven 2 device. It will likely
>   be an APU, and not a discrete graphics card, but any AMD Picasso/Raven
>   2 device is affected.
>
>   Install the kernel and boot. Make sure full modesetting is enabled.
>
>   There is a test kernel available in the ppa below:
>
>   https://launchpad.net/~mruffell/+archive/ubuntu/lp2068738-test
>
>   If you install the test kernel, your system should boot successfully.
>
>   [Where problems could occur]
>
>   We are reverting a problematic patch and going back to how it was
>   before 5.15.0-112-generic. This should not cause any issues for users.
>
>   If a regression were to occur, users can set "nomodeset" or
>   "amd_iommu=off" to GRUB_CMDLINE_LINUX_DEFAULT and reboot, or pin their
>   kernel to a working one.
>
>   The impact of a regression would be high, as users displays could be
>   blank.
>
>   [Other Info]
>
>   User reports:
>   https://forums.linuxmint.com/viewtopic.php?t=421484
>   https://forums.linuxmint.com/viewtopic.php?t=421441
>
> https://www.reddit.com/r/Ubuntu/comments/1d9uviz/had_to_purge_kernel_5150112_could_not_boot/
>
> https://www.reddit.com/r/linuxmint/comments/1d9w6c9/kernel_5150112_boot_failure/
>   https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068735
>   https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068793
>   https://bugs.launchpad.net/bugs/2068812
>
>   As bizarre as it is, this commit was actually originally included in
>   5.15-rc5:
>
>   commit 714d9e4574d54596973ee3b0624ee4a16264d700
>   Author: Yifan Zhang <yifan1.zhang@amd.com>
>   Date: Tue Sep 28 15:42:35 2021 +0800
>   Subject: drm/amdgpu: init iommu after amdkfd device init
>   Link:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=714d9e4574d54596973ee3b0624ee4a16264d700
>
>   It seems to have caused issues back then too, and was removed in the
>   following fixups, in 5.16-rc1:
>
>   commit 93cec184788b0cf3926bc1f7b47fed74ba87990c
>   Author: James Zhu <James.Zhu@amd.com>
>   Date:   Tue Nov 2 21:33:50 2021 -0400
>   Subject: drm/amdgpu: remove duplicated kfd_resume_iommu
>   Link:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=93cec184788b0cf3926bc1f7b47fed74ba87990c
>
>   commit 9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
>   Author: shaoyunl <shaoyun.liu@amd.com>
>   Date:   Fri Nov 5 12:34:14 2021 -0400
>   Subject: drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
>   Link:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
>
>   I'm not exactly in favor of rewriting history twice, so I think we
>   should just revert the upstream stable patch and move on.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068738/+subscriptions
>
>

Revision history for this message

Arno Mayrhofer (azrael3000) wrote on 2024-07-05:

#50

Works for me as well Matthew. Thanks for the fix.

Revision history for this message

berni123 (bernd-bernd--zimmermann) wrote on 2024-07-05:

#51

I can confirm, proposed image -116 works without disabling amd_iommu:

berni@enterprise:~$ uname -rv
5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:14:24 UTC 2024

Revision history for this message

rfc (rfc0) wrote on 2024-07-05:

#52

Thank you Matthew & the kernel team,

Proposed image 5.15.0-116-generic resolves the issue on my machine without the need for any workarounds (disabling amd_iommu,)

$ uname -r -v
5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:14:24 UTC 2024

Ubuntu
linux package

AMD GPUs fail with null pointer dereference when IOMMU enabled, leading to black screen

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Fix Released	Undecided	Unassigned
	Jammy	Fix Committed	High	Matthew Ruffell

Ubuntulinux package

AMD GPUs fail with null pointer dereference when IOMMU enabled, leading to black screen

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package