Comment 99 for bug 2068738

Revision history for this message
Roger Ramjet (rogerramjet123) wrote : Re: [Bug 2068738] verification-done-jammy-

Yes this kernel update did the trick after running the sudo update-grub command from my Ubuntu software and then going back into my Mint software.

Thanks Linux team.

Ralph Goe

Sent with Proton Mail secure email.

On Wednesday, July 17th, 2024 at 11:29 PM, Ubuntu Kernel Bot <email address hidden> wrote:

> This bug is awaiting verification that the linux-aws/5.15.0-1066.72
> kernel in -proposed solves the problem. Please test the kernel and
> update this bug with the results. If the problem is solved, change the
> tag 'verification-needed-jammy-linux-aws' to 'verification-done-jammy-
> linux-aws'. If the problem still exists, change the tag 'verification-
> needed-jammy-linux-aws' to 'verification-failed-jammy-linux-aws'.
>
>
> If verification is not done by 5 working days from today, this fix will
> be dropped from the source code, and this bug will be closed.
>
>
> See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
> to enable and use -proposed. Thank you!
>
>
> ** Tags added: kernel-spammed-jammy-linux-aws-v2 verification-needed-jammy-linux-aws
>
> --
> You received this bug notification because you are subscribed to a
> duplicate bug report (2069485).
> https://bugs.launchpad.net/bugs/2068738
>
> Title:
> AMD GPUs fail with null pointer dereference when IOMMU enabled,
> leading to black screen
>
> Status in linux package in Ubuntu:
> Fix Released
> Status in linux source package in Jammy:
> Fix Released
>
> Bug description:
> BugLink: https://bugs.launchpad.net/bugs/2068738
>
> [Impact]
>
> On systems with AMD Picasso/Raven 2 GPU devices, when the IOMMU is
> enabled, the system fails to boot correctly, and all users see is a
> black screen.
>
> This is caused by a null pointer dereference when enabling the IOMMU
> after the device has been initialised. It should happen the other way
> around.
>
> AMD-Vi: AMD IOMMUv2 loaded and initialized
> ...
> amdgpu: Topology: Add APU node [0x15d8:0x1002]
> kfd kfd: amdgpu: added device 1002:15d8
> kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
> ...
> amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
> ...
> BUG: kernel NULL pointer dereference, address: 000000000000013c
> ...
> CPU: 1 PID: 223 Comm: systemd-udevd Not tainted 5.15.0-112-generic #122-Ubuntu
> ...
> RIP: 0010:amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> ...
> Call Trace:
> <TASK>
>
> ? srso_return_thunk+0x5/0x10
> ? show_trace_log_lvl+0x28e/0x2ea
> ? show_trace_log_lvl+0x28e/0x2ea
> ? dm_hw_fini+0x23/0x30 [amdgpu]
> ? show_regs.part.0+0x23/0x29
> ? __die_body.cold+0x8/0xd
> ? __die+0x2b/0x37
> ? page_fault_oops+0x13b/0x170
> ? srso_return_thunk+0x5/0x10
> ? do_user_addr_fault+0x321/0x670
> ? srso_return_thunk+0x5/0x10
> ? __free_pages_ok+0x34a/0x4f0
> ? exc_page_fault+0x77/0x170
> ? asm_exc_page_fault+0x27/0x30
> ? amdgpu_dm_fini+0x149/0x1f0 [amdgpu]
> dm_hw_fini+0x23/0x30 [amdgpu]
> amdgpu_device_ip_fini_early.isra.0+0x278/0x312 [amdgpu]
> amdgpu_device_fini_hw+0x156/0x208 [amdgpu]
> amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
> amdgpu_driver_load_kms.cold+0x81/0x107 [amdgpu]
> amdgpu_pci_probe+0x1d1/0x290 [amdgpu]
> local_pci_probe+0x4b/0x90
> ? srso_return_thunk+0x5/0x10
> pci_device_probe+0x119/0x200
> really_probe+0x222/0x420
> __driver_probe_device+0xe8/0x140
> driver_probe_device+0x23/0xc0
> __driver_attach+0xf7/0x1f0
> ? __device_attach_driver+0x140/0x140
> bus_for_each_dev+0x7f/0xd0
> driver_attach+0x1e/0x30
> bus_add_driver+0x148/0x220
> ? srso_return_thunk+0x5/0x10
> driver_register+0x95/0x100
> __pci_register_driver+0x68/0x70
> amdgpu_init+0x7c/0x1000 [amdgpu]
> ? 0xffffffffc0e0b000
> do_one_initcall+0x49/0x1e0
> ? srso_return_thunk+0x5/0x10
> ? kmem_cache_alloc_trace+0x19e/0x2e0
> do_init_module+0x52/0x260
> load_module+0xb45/0xbe0
> __do_sys_finit_module+0xbf/0x120
> __x64_sys_finit_module+0x18/0x20
> x64_sys_call+0x1ac3/0x1fa0
> do_syscall_64+0x56/0xb0
> ...
> entry_SYSCALL_64_after_hwframe+0x67/0xd1
>
> A workaround does exist. Users can set "nomodeset" or "amd_iommu=off"
> to GRUB_CMDLINE_LINUX_DEFAULT, update-grub and reboot.
>
> [Fix]
>
> The regression was caused by the following commit that landed in
> 5.15.0-112-generic, and 5.15.150 upstream:
>
> commit 3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd ubuntu-jammy
> Author: Yifan Zhang <email address hidden>
>
> Date: Tue Sep 28 15:42:35 2021 +0800
> Subject: drm/amdgpu: init iommu after amdkfd device init
> Link: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?id=3c7e53c0d4b43ffe6e7715414b5f2b3177881ecd
>
> The fix is to revert this patch, as it was not suppose to be
> backported to 5.15 stable.
>
> The mailing list discussion with AMD developers is:
>
> https://<email address hidden>/
>
> The fix hasn't been acknowledged by Greg KH or Sasha Levin yet, so
> sending as a Ubuntu SAUCE patch. If the upstream status changes, we
> can NAK and resend.
>
> [Testcase]
>
> You need a system with an AMD Picasso/Raven 2 device. It will likely
> be an APU, and not a discrete graphics card, but any AMD Picasso/Raven
> 2 device is affected.
>
> Install the kernel and boot. Make sure full modesetting is enabled.
>
> There is a test kernel available in the ppa below:
>
> https://launchpad.net/~mruffell/+archive/ubuntu/lp2068738-test
>
> If you install the test kernel, your system should boot successfully.
>
> [Where problems could occur]
>
> We are reverting a problematic patch and going back to how it was
> before 5.15.0-112-generic. This should not cause any issues for users.
>
> If a regression were to occur, users can set "nomodeset" or
> "amd_iommu=off" to GRUB_CMDLINE_LINUX_DEFAULT and reboot, or pin their
> kernel to a working one.
>
> The impact of a regression would be high, as users displays could be
> blank.
>
> [Other Info]
>
> User reports:
> https://forums.linuxmint.com/viewtopic.php?t=421484
> https://forums.linuxmint.com/viewtopic.php?t=421441
> https://www.reddit.com/r/Ubuntu/comments/1d9uviz/had_to_purge_kernel_5150112_could_not_boot/
> https://www.reddit.com/r/linuxmint/comments/1d9w6c9/kernel_5150112_boot_failure/
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068735
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068793
> https://bugs.launchpad.net/bugs/2068812
>
> As bizarre as it is, this commit was actually originally included in
> 5.15-rc5:
>
> commit 714d9e4574d54596973ee3b0624ee4a16264d700
> Author: Yifan Zhang <email address hidden>
>
> Date: Tue Sep 28 15:42:35 2021 +0800
> Subject: drm/amdgpu: init iommu after amdkfd device init
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=714d9e4574d54596973ee3b0624ee4a16264d700
>
> It seems to have caused issues back then too, and was removed in the
> following fixups, in 5.16-rc1:
>
> commit 93cec184788b0cf3926bc1f7b47fed74ba87990c
> Author: James Zhu <email address hidden>
>
> Date: Tue Nov 2 21:33:50 2021 -0400
> Subject: drm/amdgpu: remove duplicated kfd_resume_iommu
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=93cec184788b0cf3926bc1f7b47fed74ba87990c
>
> commit 9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
> Author: shaoyunl <email address hidden>
>
> Date: Fri Nov 5 12:34:14 2021 -0400
> Subject: drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f4f2c1a35248f56b2a9c1c004e0aaff3609b15d
>
> I'm not exactly in favor of rewriting history twice, so I think we
> should just revert the upstream stable patch and move on.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068738/+subscriptions