Still no fix as of July 1, 2024. Have to boot manually with the 107 to get
my Acer laptop up and running. Hope someone can fix this problem soon.
Thanks for all the hard work on this issue.
On Mon, Jul 1, 2024 at 8:10 AM Daniel <email address hidden> wrote:
> Did I understand correctly that this bug will be fixed after kernel
> 5.15.0-113?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2068738
>
> Title:
> AMD GPUs fail with null pointer dereference when IOMMU enabled,
> leading to black screen
>
> Status in linux package in Ubuntu:
> Fix Released
> Status in linux source package in Jammy:
> Fix Committed
>
> Bug description:
> BugLink: https://bugs.launchpad.net/bugs/2068738
>
> [Impact]
>
> On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
> enabled, the system fails to boot correctly, and all users see is a
> black screen.
>
> This is caused by a null pointer dereference when enabling the IOMMU
> after the device has been initialised. It should happen the other way
> around.
>
> AMD-Vi: AMD IOMMUv2 loaded and initialized
> ...
> amdgpu: Topology: Add APU node [0x15d8:0x1002]
> kfd kfd: amdgpu: added device 1002:15d8
> kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
> ...
> amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
> ...
> BUG: kernel NULL pointer dereference, address: 000000000000013c
> ...
> CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic
> #122-Ubuntu
> ...
> RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> ...
> Call Trace:
> <TASK>
> ? srso_return_thunk+0x5/0x10
> ? show_trace_log_lvl+0x28e/0x2ea
> ? show_trace_log_lvl+0x28e/0x2ea
> ? dm_hw_fini+0x23/0x30 [amdgpu]
> ? show_regs.part.0+0x23/0x29
> ? __die_body.cold+0x8/0xd
> ? __die+0x2b/0x37
> ? page_fault_oops+0x13b/0x170
> ? srso_return_thunk+0x5/0x10
> ? do_user_addr_fault+0x321/0x670
> ? srso_return_thunk+0x5/0x10
> ? __free_pages_ok+0x34a/0x4f0
> ? exc_page_fault+0x77/0x170
> ? asm_exc_page_fault+0x27/0x30
> ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> dm_hw_fini+0x23/0x30 [amdgpu]
> amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
> amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
> amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
> amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
> amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
> local_pci_probe+0x4b/0x90
> ? srso_return_thunk+0x5/0x10
> pci_device_probe+0x119/0x200
> really_probe+0x222/0x420
> __driver_probe_device+0xe8/0x140
> driver_probe_device+0x23/0xc0
> __driver_attach+0xf7/0x1f0
> ? __device_attach_driver+0x140/0x140
> bus_for_each_dev+0x7f/0xd0
> driver_attach+0x1e/0x30
> bus_add_driver+0x148/0x220
> ? srso_return_thunk+0x5/0x10
> driver_register+0x95/0x100
> __pci_register_driver+0x68/0x70
> amdgpu_init+0x7c/0x1000 [amdgpu]
> ? 0xffffffffc0e0b000
> do_one_initcall+0x49/0x1e0
> ? srso_return_thunk+0x5/0x10
> ? kmem_cache_alloc_trace+0x19e/0x2e0
> do_init_module+0x52/0x260
> load_module+0xb45/0xbe0
> __do_sys_finit_module+0xbf/0x120
> __x64_sys_finit_module+0x18/0x20
> x64_sys_call+0x1ac3/0x1fa0
> do_syscall_64+0x56/0xb0
> ...
> entry_SYSCALL_64_after_hwframe+0x67/0xd1
>
> A workaround does exist. Users can set "nomodeset" or "amd_iommu=off"
> to GRUB_CMDLINE_LINUX_DEFAULT, update-grub and reboot.
>
> [Fix]
>
> The regression was caused by the following commit that landed in
> 5.15.0-112-generic, and 5.15.150 upstream:
>
> commit 3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd ubuntu-jammy
> Author: Yifan Zhang <email address hidden>
> Date: Tue Sep 28 15:42:35 2021 +0800
> Subject: drm/amdgpu: init iommu after amdkfd device init
> Link:
> https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?id=3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd
>
> The fix is to revert this patch, as it was not suppose to be
> backported to 5.15 stable.
>
> The mailing list discussion with AMD developers is:
>
> https://<email address hidden>/
>
> The fix hasn't been acknowledged by Greg KH or Sasha Levin yet, so
> sending as a Ubuntu SAUCE patch. If the upstream status changes, we
> can NAK and resend.
>
> [Testcase]
>
> You need a system with an AMD Picasso/Raven 2 device. It will likely
> be an APU, and not a discrete graphics card, but any AMD Picasso/Raven
> 2 device is affected.
>
> Install the kernel and boot. Make sure full modesetting is enabled.
>
> There is a test kernel available in the ppa below:
>
> https://launchpad.net/~mruffell/+archive/ubuntu/lp2068738-test
>
> If you install the test kernel, your system should boot successfully.
>
> [Where problems could occur]
>
> We are reverting a problematic patch and going back to how it was
> before 5.15.0-112-generic. This should not cause any issues for users.
>
> If a regression were to occur, users can set "nomodeset" or
> "amd_iommu=off" to GRUB_CMDLINE_LINUX_DEFAULT and reboot, or pin their
> kernel to a working one.
>
> The impact of a regression would be high, as users displays could be
> blank.
>
> [Other Info]
>
> User reports:
> https://forums.linuxmint.com/viewtopic.php?t=421484
> https://forums.linuxmint.com/viewtopic.php?t=421441
>
> https://www.reddit.com/r/Ubuntu/comments/1d9uviz/had_to_purge_kernel_5150112_could_not_boot/
>
> https://www.reddit.com/r/linuxmint/comments/1d9w6c9/kernel_5150112_boot_failure/
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068735
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068793
> https://bugs.launchpad.net/bugs/2068812
>
> As bizarre as it is, this commit was actually originally included in
> 5.15-rc5:
>
> commit 714d9e4574d54596973ee3b0624ee4a16264d700
> Author: Yifan Zhang <email address hidden>
> Date: Tue Sep 28 15:42:35 2021 +0800
> Subject: drm/amdgpu: init iommu after amdkfd device init
> Link:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=714d9e4574d54596973ee3b0624ee4a16264d700
>
> It seems to have caused issues back then too, and was removed in the
> following fixups, in 5.16-rc1:
>
> commit 93cec184788b0cf3926bc1f7b47fed74ba87990c
> Author: James Zhu <email address hidden>
> Date: Tue Nov 2 21:33:50 2021 -0400
> Subject: drm/amdgpu: remove duplicated kfd_resume_iommu
> Link:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=93cec184788b0cf3926bc1f7b47fed74ba87990c
>
> commit 9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
> Author: shaoyunl <email address hidden>
> Date: Fri Nov 5 12:34:14 2021 -0400
> Subject: drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
> Link:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
>
> I'm not exactly in favor of rewriting history twice, so I think we
> should just revert the upstream stable patch and move on.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068738/+subscriptions
>
>
Still no fix as of July 1, 2024. Have to boot manually with the 107 to get
my Acer laptop up and running. Hope someone can fix this problem soon.
Thanks for all the hard work on this issue.
On Mon, Jul 1, 2024 at 8:10 AM Daniel <email address hidden> wrote:
> Did I understand correctly that this bug will be fixed after kernel /bugs.launchpad .net/bugs/ 2068738 /bugs.launchpad .net/bugs/ 2068738 device_ ip_init failed dm_fini+ 0x149/0x1f0 [amdgpu] thunk+0x5/ 0x10 log_lvl+ 0x28e/0x2ea log_lvl+ 0x28e/0x2ea 0x23/0x30 [amdgpu] part.0+ 0x23/0x29 cold+0x8/ 0xd oops+0x13b/ 0x170 thunk+0x5/ 0x10 addr_fault+ 0x321/0x670 thunk+0x5/ 0x10 pages_ok+ 0x34a/0x4f0 fault+0x77/ 0x170 page_fault+ 0x27/0x30 dm_fini+ 0x149/0x1f0 [amdgpu] 0x23/0x30 [amdgpu] device_ ip_fini_ early.isra. 0+0x278/ 0x312 [amdgpu] device_ fini_hw+ 0x156/0x208 [amdgpu] driver_ unload_ kms+0x69/ 0x90 [amdgpu] driver_ load_kms. cold+0x81/ 0x107 [amdgpu] pci_probe+ 0x1d1/0x290 [amdgpu] probe+0x4b/ 0x90 thunk+0x5/ 0x10 probe+0x119/ 0x200 probe+0x222/ 0x420 probe_device+ 0xe8/0x140 probe_device+ 0x23/0xc0 attach+ 0xf7/0x1f0 attach_ driver+ 0x140/0x140 each_dev+ 0x7f/0xd0 attach+ 0x1e/0x30 driver+ 0x148/0x220 thunk+0x5/ 0x10 register+ 0x95/0x100 driver+ 0x68/0x70 init+0x7c/ 0x1000 [amdgpu] initcall+ 0x49/0x1e0 thunk+0x5/ 0x10 alloc_trace+ 0x19e/0x2e0 module+ 0x52/0x260 0xb45/0xbe0 finit_module+ 0xbf/0x120 finit_module+ 0x18/0x20 call+0x1ac3/ 0x1fa0 64+0x56/ 0xb0 64_after_ hwframe+ 0x67/0xd1 LINUX_DEFAULT, update-grub and reboot. e6e7715414b5f2b 3177881ecd ubuntu-jammy /git.launchpad. net/~ubuntu- kernel/ ubuntu/ +source/ linux/+ git/jammy/ commit/ ?id=3c7e53c0d4b 43ffe6e7715414b 5f2b3177881ecd /launchpad. net/~mruffell/ +archive/ ubuntu/ lp2068738- test LINUX_DEFAULT and reboot, or pin their /forums. linuxmint. com/viewtopic. php?t=421484 /forums. linuxmint. com/viewtopic. php?t=421441 /www.reddit. com/r/Ubuntu/ comments/ 1d9uviz/ had_to_ purge_kernel_ 5150112_ could_not_ boot/ /www.reddit. com/r/linuxmint /comments/ 1d9w6c9/ kernel_ 5150112_ boot_failure/ /bugs.launchpad .net/ubuntu/ +source/ linux/+ bug/2068735 /bugs.launchpad .net/ubuntu/ +source/ linux/+ bug/2068793 /bugs.launchpad .net/bugs/ 2068812 6973ee3b0624ee4 a16264d700 /git.kernel. org/pub/ scm/linux/ kernel/ git/torvalds/ linux.git/ commit/ ?id=714d9e4574d 54596973ee3b062 4ee4a16264d700 3926bc1f7b47fed 74ba87990c /git.kernel. org/pub/ scm/linux/ kernel/ git/torvalds/ linux.git/ commit/ ?id=93cec184788 b0cf3926bc1f7b4 7fed74ba87990c 6b2a9c1c004e0aa ff3609b15d /git.kernel. org/pub/ scm/linux/ kernel/ git/torvalds/ linux.git/ commit/ ?id=9f4f2c1a352 48f56b2a9c1c004 e0aaff3609b15d /bugs.launchpad .net/ubuntu/ +source/ linux/+ bug/2068738/ +subscriptions
> 5.15.0-113?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https:/
>
> Title:
> AMD GPUs fail with null pointer dereference when IOMMU enabled,
> leading to black screen
>
> Status in linux package in Ubuntu:
> Fix Released
> Status in linux source package in Jammy:
> Fix Committed
>
> Bug description:
> BugLink: https:/
>
> [Impact]
>
> On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
> enabled, the system fails to boot correctly, and all users see is a
> black screen.
>
> This is caused by a null pointer dereference when enabling the IOMMU
> after the device has been initialised. It should happen the other way
> around.
>
> AMD-Vi: AMD IOMMUv2 loaded and initialized
> ...
> amdgpu: Topology: Add APU node [0x15d8:0x1002]
> kfd kfd: amdgpu: added device 1002:15d8
> kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
> ...
> amdgpu 0000:06:00.0: amdgpu: amdgpu_
> amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
> ...
> BUG: kernel NULL pointer dereference, address: 000000000000013c
> ...
> CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic
> #122-Ubuntu
> ...
> RIP: 0010:amdgpu_
> ...
> Call Trace:
> <TASK>
> ? srso_return_
> ? show_trace_
> ? show_trace_
> ? dm_hw_fini+
> ? show_regs.
> ? __die_body.
> ? __die+0x2b/0x37
> ? page_fault_
> ? srso_return_
> ? do_user_
> ? srso_return_
> ? __free_
> ? exc_page_
> ? asm_exc_
> ? amdgpu_
> dm_hw_fini+
> amdgpu_
> amdgpu_
> amdgpu_
> amdgpu_
> amdgpu_
> local_pci_
> ? srso_return_
> pci_device_
> really_
> __driver_
> driver_
> __driver_
> ? __device_
> bus_for_
> driver_
> bus_add_
> ? srso_return_
> driver_
> __pci_register_
> amdgpu_
> ? 0xffffffffc0e0b000
> do_one_
> ? srso_return_
> ? kmem_cache_
> do_init_
> load_module+
> __do_sys_
> __x64_sys_
> x64_sys_
> do_syscall_
> ...
> entry_SYSCALL_
>
> A workaround does exist. Users can set "nomodeset" or "amd_iommu=off"
> to GRUB_CMDLINE_
>
> [Fix]
>
> The regression was caused by the following commit that landed in
> 5.15.0-112-generic, and 5.15.150 upstream:
>
> commit 3c7e53c0d4b43ff
> Author: Yifan Zhang <email address hidden>
> Date: Tue Sep 28 15:42:35 2021 +0800
> Subject: drm/amdgpu: init iommu after amdkfd device init
> Link:
> https:/
>
> The fix is to revert this patch, as it was not suppose to be
> backported to 5.15 stable.
>
> The mailing list discussion with AMD developers is:
>
> https://<email address hidden>/
>
> The fix hasn't been acknowledged by Greg KH or Sasha Levin yet, so
> sending as a Ubuntu SAUCE patch. If the upstream status changes, we
> can NAK and resend.
>
> [Testcase]
>
> You need a system with an AMD Picasso/Raven 2 device. It will likely
> be an APU, and not a discrete graphics card, but any AMD Picasso/Raven
> 2 device is affected.
>
> Install the kernel and boot. Make sure full modesetting is enabled.
>
> There is a test kernel available in the ppa below:
>
> https:/
>
> If you install the test kernel, your system should boot successfully.
>
> [Where problems could occur]
>
> We are reverting a problematic patch and going back to how it was
> before 5.15.0-112-generic. This should not cause any issues for users.
>
> If a regression were to occur, users can set "nomodeset" or
> "amd_iommu=off" to GRUB_CMDLINE_
> kernel to a working one.
>
> The impact of a regression would be high, as users displays could be
> blank.
>
> [Other Info]
>
> User reports:
> https:/
> https:/
>
> https:/
>
> https:/
> https:/
> https:/
> https:/
>
> As bizarre as it is, this commit was actually originally included in
> 5.15-rc5:
>
> commit 714d9e4574d5459
> Author: Yifan Zhang <email address hidden>
> Date: Tue Sep 28 15:42:35 2021 +0800
> Subject: drm/amdgpu: init iommu after amdkfd device init
> Link:
> https:/
>
> It seems to have caused issues back then too, and was removed in the
> following fixups, in 5.16-rc1:
>
> commit 93cec184788b0cf
> Author: James Zhu <email address hidden>
> Date: Tue Nov 2 21:33:50 2021 -0400
> Subject: drm/amdgpu: remove duplicated kfd_resume_iommu
> Link:
> https:/
>
> commit 9f4f2c1a35248f5
> Author: shaoyunl <email address hidden>
> Date: Fri Nov 5 12:34:14 2021 -0400
> Subject: drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
> Link:
> https:/
>
> I'm not exactly in favor of rewriting history twice, so I think we
> should just revert the upstream stable patch and move on.
>
> To manage notifications about this bug go to:
> https:/
>
>