Activity log for bug #2068738

Date Who What changed Old value New value Message
2024-06-07 14:19:11 berni123 bug added bug
2024-06-08 02:37:20 Launchpad Janitor linux (Ubuntu): status New Confirmed
2024-06-08 03:13:29 Moshe Caspi bug added subscriber Moshe Caspi
2024-06-08 17:02:42 Niels Kunstmann bug added subscriber Niels Kunstmann
2024-06-08 23:01:48 Lin Manfu attachment added syslog-for-broken-kernel-112 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068738/+attachment/5787506/+files/syslog-for-broken-kernel-112
2024-06-09 10:41:27 Raphael Camus bug added subscriber Raphael Camus
2024-06-09 11:39:18 Carrington Institute attachment added log.txt https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068738/+attachment/5787652/+files/log.txt
2024-06-09 13:50:19 Frédéric RENARD bug added subscriber Frédéric RENARD
2024-06-09 18:43:51 Jon Vandenburgh bug added subscriber Jon Vandenburgh
2024-06-09 19:35:33 Fenestrae, et vastata annis. attachment added log.txt https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068738/+attachment/5787785/+files/log.txt
2024-06-10 04:35:34 Matthew Ruffell bug added subscriber Matthew Ruffell
2024-06-10 04:40:08 Matthew Ruffell nominated for series Ubuntu Jammy
2024-06-10 04:40:08 Matthew Ruffell bug task added linux (Ubuntu Jammy)
2024-06-10 04:40:17 Matthew Ruffell linux (Ubuntu Jammy): importance Undecided High
2024-06-10 05:17:54 Matthew Ruffell linux (Ubuntu Jammy): status New In Progress
2024-06-10 05:17:58 Matthew Ruffell linux (Ubuntu Jammy): assignee Matthew Ruffell (mruffell)
2024-06-10 05:18:17 Matthew Ruffell linux (Ubuntu): status Confirmed Fix Released
2024-06-11 02:44:39 Pete Orlando bug added subscriber Pete Orlando
2024-06-12 01:31:53 Matthew Ruffell summary Kernel update 5.15.0-112 might cause severe problems with specific AMD GPUs AMD GPUs fail with null pointer dereference when IOMMU enabled, leading to black screen
2024-06-12 01:32:18 Matthew Ruffell description System is not booting while using AMD Picasso/Raven 2 graphics after update to 5.15.0-112 Graphics: Device-1: AMD Picasso/Raven 2 [Radeon Vega Series / Radeon Mobile Series] vendor: Hewlett-Packard driver: amdgpu v: kernel bus-ID: 04:00.0 Display: x11 server: X.Org v: 1.21.1.4 driver: X: loaded: amdgpu,ati unloaded: fbdev,modesetting,vesa gpu: amdgpu resolution: 1600x900~60Hz OpenGL: renderer: AMD Radeon Vega 3 Graphics (raven2 LLVM 15.0.7 DRM 3.42 5.15.0-107-generic) v: 4.6 Mesa 23.2.1-1ubuntu3.1~22.04.2 direct render: Yes Several reports: https://forums.linuxmint.com/viewtopic.php?t=421484 https://forums.linuxmint.com/viewtopic.php?t=421441 BugLink: https://bugs.launchpad.net/bugs/2068738 [Impact] On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is enabled, the system fails to boot correctly, and all users see is a black screen. This is caused by a null pointer dereference when enabling the IOMMU after the device has been initialised. It should happen the other way around. AMD-Vi: AMD IOMMUv2 loaded and initialized ... amdgpu: Topology: Add APU node [0x15d8:0x1002] kfd kfd: amdgpu: added device 1002:15d8 kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8 ... amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device. ... BUG: kernel NULL pointer dereference, address: 000000000000013c ... CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic #122-Ubuntu ... RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu] ... Call Trace: <TASK> ? srso_return_thunk+0x5/0x10 ? show_trace_log_lvl+0x28e/0x2ea ? show_trace_log_lvl+0x28e/0x2ea ? dm_hw_fini+0x23/0x30 [amdgpu] ? show_regs.part.0+0x23/0x29 ? __die_body.cold+0x8/0xd ? __die+0x2b/0x37 ? page_fault_oops+0x13b/0x170 ? srso_return_thunk+0x5/0x10 ? do_user_addr_fault+0x321/0x670 ? srso_return_thunk+0x5/0x10 ? __free_pages_ok+0x34a/0x4f0 ? exc_page_fault+0x77/0x170 ? asm_exc_page_fault+0x27/0x30 ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu] dm_hw_fini+0x23/0x30 [amdgpu] amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu] amdgpu_device_fini_hw+0x156/0x208 [amdgpu] amdgpu_driver_unload_kms+0x69/0x90 [amdgpu] amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu] amdgpu_pci_probe+0x1d1/0x290 [amdgpu] local_pci_probe+0x4b/0x90 ? srso_return_thunk+0x5/0x10 pci_device_probe+0x119/0x200 really_probe+0x222/0x420 __driver_probe_device+0xe8/0x140 driver_probe_device+0x23/0xc0 __driver_attach+0xf7/0x1f0 ? __device_attach_driver+0x140/0x140 bus_for_each_dev+0x7f/0xd0 driver_attach+0x1e/0x30 bus_add_driver+0x148/0x220 ? srso_return_thunk+0x5/0x10 driver_register+0x95/0x100 __pci_register_driver+0x68/0x70 amdgpu_init+0x7c/0x1000 [amdgpu] ? 0xffffffffc0e0b000 do_one_initcall+0x49/0x1e0 ? srso_return_thunk+0x5/0x10 ? kmem_cache_alloc_trace+0x19e/0x2e0 do_init_module+0x52/0x260 load_module+0xb45/0xbe0 __do_sys_finit_module+0xbf/0x120 __x64_sys_finit_module+0x18/0x20 x64_sys_call+0x1ac3/0x1fa0 do_syscall_64+0x56/0xb0 ... entry_SYSCALL_64_after_hwframe+0x67/0xd1 A workaround does exist. Users can set "nomodeset" or "amd_iommu=off" to GRUB_CMDLINE_LINUX_DEFAULT, update-grub and reboot. [Fix] The regression was caused by the following commit that landed in 5.15.0-112-generic, and 5.15.150 upstream: commit 3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd ubuntu-jammy Author: Yifan Zhang <yifan1.zhang@amd.com> Date: Tue Sep 28 15:42:35 2021 +0800 Subject: drm/amdgpu: init iommu after amdkfd device init Link: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?id=3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd The fix is to revert this patch, as it was not suppose to be backported to 5.15 stable. The mailing list discussion with AMD developers is: https://lore.kernel.org/amd-gfx/20240523173031.4212-1-W_Armin@gmx.de/ The fix hasn't been acknowledged by Greg KH or Sasha Levin yet, so sending as a Ubuntu SAUCE patch. If the upstream status changes, we can NAK and resend. [Testcase] You need a system with an AMD Picasso/Raven 2 device. It will likely be an APU, and not a discrete graphics card, but any AMD Picasso/Raven 2 device is affected. Install the kernel and boot. Make sure full modesetting is enabled. There is a test kernel available in the ppa below: https://launchpad.net/~mruffell/+archive/ubuntu/lp2068738-test If you install the test kernel, your system should boot successfully. [Where problems could occur] We are reverting a problematic patch and going back to how it was before 5.15.0-112-generic. This should not cause any issues for users. If a regression were to occur, users can set "nomodeset" or "amd_iommu=off" to GRUB_CMDLINE_LINUX_DEFAULT and reboot, or pin their kernel to a working one. The impact of a regression would be high, as users displays could be blank. [Other Info] User reports: https://forums.linuxmint.com/viewtopic.php?t=421484 https://forums.linuxmint.com/viewtopic.php?t=421441 https://www.reddit.com/r/Ubuntu/comments/1d9uviz/had_to_purge_kernel_5150112_could_not_boot/ https://www.reddit.com/r/linuxmint/comments/1d9w6c9/kernel_5150112_boot_failure/ https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068735 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068793 https://bugs.launchpad.net/bugs/2068812 As bizarre as it is, this commit was actually originally included in 5.15-rc5: commit 714d9e4574d54596973ee3b0624ee4a16264d700 Author: Yifan Zhang <yifan1.zhang@amd.com> Date: Tue Sep 28 15:42:35 2021 +0800 Subject: drm/amdgpu: init iommu after amdkfd device init Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=714d9e4574d54596973ee3b0624ee4a16264d700 It seems to have caused issues back then too, and was removed in the following fixups, in 5.16-rc1: commit 93cec184788b0cf3926bc1f7b47fed74ba87990c Author: James Zhu <James.Zhu@amd.com> Date: Tue Nov 2 21:33:50 2021 -0400 Subject: drm/amdgpu: remove duplicated kfd_resume_iommu Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=93cec184788b0cf3926bc1f7b47fed74ba87990c commit 9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d Author: shaoyunl <shaoyun.liu@amd.com> Date: Fri Nov 5 12:34:14 2021 -0400 Subject: drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d I'm not exactly in favor of rewriting history twice, so I think we should just revert the upstream stable patch and move on.
2024-06-13 12:48:48 Acru bug added subscriber Acru
2024-06-17 09:08:36 Roxana Nicolescu linux (Ubuntu Jammy): status In Progress Fix Committed
2024-06-17 22:31:59 Erv Bendiks bug added subscriber Erv Bendiks
2024-06-25 16:34:21 Filip Średnicki bug added subscriber Filip Średnicki
2024-06-26 19:04:23 Ruslan bug added subscriber Ruslan
2024-06-27 13:15:48 Jean-Max Reymond bug added subscriber Jean-Max Reymond
2024-06-27 13:18:07 Tom Smith bug added subscriber Tom Smith
2024-06-30 21:38:50 Mathias Weyland bug added subscriber Mathias Weyland
2024-07-01 16:29:03 jun wako bug added subscriber jun wako
2024-07-01 19:45:53 Pieter Paul bug added subscriber Pieter Paul
2024-07-04 02:35:36 Ubuntu Kernel Bot tags kernel-spammed-jammy-linux-v2 verification-needed-jammy-linux
2024-07-04 21:09:45 Matthew Ruffell tags kernel-spammed-jammy-linux-v2 verification-needed-jammy-linux kernel-spammed-jammy-linux-v2 verification-done-jammy-linux
2024-07-05 15:27:59 Blake Johal bug added subscriber Blake Johal