amdgpu no-retry page fault resulting in black screen and unresponsive system

Bug #1995956 reported by Kuba Pawlak
174
This bug affects 33 people
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Unknown
linux (Ubuntu)
Fix Released
High
Unassigned
Kinetic
Won't Fix
High
Unassigned
linux-hwe-5.19 (Ubuntu)
Won't Fix
High
Unassigned
linux-oem-6.1 (Ubuntu)
Fix Released
High
Unassigned

Bug Description

When using Skype in snap, amdgpu crashed, resulting in black screen and unresponsive system.

Happened on Kinetic Kudu 5.19.0-23-generic with or without latest amdgpu firmware.
Affected laptop is T14 with Ryzen 5850U.

Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:40 vmid:5 pasid:0, for process pid 0 thread pid 0)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000080010142c000 from IH client 0x12 (VMC)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00540051
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x1
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x5
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: RW: 0x1
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:40 vmid:5 pasid:0, for process pid 0 thread pid 0)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000080010142d000 from IH client 0x12 (VMC)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: RW: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:40 vmid:5 pasid:0, for process pid 0 thread pid 0)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000080010142c000 from IH client 0x12 (VMC)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00540051
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x1
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x5
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: RW: 0x1
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:40 vmid:5 pasid:0, for process pid 0 thread pid 0)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000080010142d000 from IH client 0x12 (VMC)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: RW: 0x0

This happens in a loop and eventually leads to GPU reset, which fails.

Nov 03 16:35:55 laptop kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=211509, emitted seq=211512
Nov 03 16:35:55 laptop kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process skypeforlinux pid 154554 thread skypeforli:cs0 pid 154558
Nov 03 16:35:55 laptop kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
Nov 03 16:35:55 laptop kernel: [drm] free PSP TMR buffer
Nov 03 16:35:55 laptop kernel: CPU: 15 PID: 141579 Comm: kworker/u32:1 Tainted: G W 5.19.0-23-generic #24-Ubuntu
Nov 03 16:35:55 laptop kernel: Hardware name: LENOVO 20XK002HPB/20XK002HPB, BIOS R1MET49W (1.19 ) 06/27/2022
Nov 03 16:35:55 laptop kernel: Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
Nov 03 16:35:55 laptop kernel: Call Trace:
Nov 03 16:35:55 laptop kernel: <TASK>
Nov 03 16:35:55 laptop kernel: show_stack+0x4e/0x61
Nov 03 16:35:55 laptop kernel: dump_stack_lvl+0x4a/0x6d
Nov 03 16:35:55 laptop kernel: dump_stack+0x10/0x18
Nov 03 16:35:55 laptop kernel: amdgpu_do_asic_reset+0x2b/0x45e [amdgpu]
Nov 03 16:35:55 laptop kernel: amdgpu_device_gpu_recover_imp.cold+0x748/0x7f0 [amdgpu]
Nov 03 16:35:55 laptop kernel: amdgpu_job_timedout+0x196/0x1d0 [amdgpu]
Nov 03 16:35:55 laptop kernel: ? finish_task_switch.isra.0+0x85/0x290
Nov 03 16:35:55 laptop kernel: drm_sched_job_timedout+0x70/0x120 [gpu_sched]
Nov 03 16:35:55 laptop kernel: process_one_work+0x225/0x400
Nov 03 16:35:55 laptop kernel: worker_thread+0x50/0x3e0
Nov 03 16:35:55 laptop kernel: ? rescuer_thread+0x3c0/0x3c0
Nov 03 16:35:55 laptop kernel: kthread+0xe9/0x110
Nov 03 16:35:55 laptop kernel: ? kthread_complete_and_exit+0x20/0x20
Nov 03 16:35:55 laptop kernel: ret_from_fork+0x22/0x30
Nov 03 16:35:55 laptop kernel: </TASK>
Nov 03 16:35:55 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MODE2 reset
Nov 03 16:35:55 laptop kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume
Nov 03 16:35:55 laptop kernel: [drm] PCIE GART of 1024M enabled.
Nov 03 16:35:55 laptop kernel: [drm] PTB located at 0x000000F400900000
Nov 03 16:35:55 laptop kernel: [drm] VRAM is lost due to GPU reset!
Nov 03 16:35:55 laptop kernel: [drm] PSP is resuming...
Nov 03 16:35:55 laptop kernel: [drm] reserve 0x400000 from 0xf43f800000 for PSP TMR
Nov 03 16:35:55 laptop kernel: amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not available
Nov 03 16:35:55 laptop kernel: amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not available
Nov 03 16:35:55 laptop kernel: amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Nov 03 16:35:55 laptop kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resuming...
Nov 03 16:35:55 laptop kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resumed successfully!
Nov 03 16:35:55 laptop kernel: [drm] DMUB hardware initialized: version=0x0101001F
Nov 03 16:35:56 laptop kernel: [drm] kiq ring mec 2 pipe 1 q 0
Nov 03 16:35:56 laptop kernel: amdgpu 0000:07:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma0 test failed (-110)
Nov 03 16:35:56 laptop kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <sdma_v4_0> failed -110
Nov 03 16:35:56 laptop kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset(1) failed
Nov 03 16:35:56 laptop kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset end with ret = -110
Nov 03 16:35:56 laptop kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -110

and it continues to crash:
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:40 vmid:5 pasid:0, for process pid 0 thread pid 0)
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000080010142c000 from IH client 0x12 (VMC)
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00540051
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0)
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x1
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x5
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: RW: 0x1
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:40 vmid:5 pasid:0, for process pid 0 thread pid 0)
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000080010142d000 from IH client 0x12 (VMC)
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00540051

Timo Aaltonen (tjaalton)
Changed in linux (Ubuntu):
status: New → Triaged
Changed in linux (Ubuntu Kinetic):
status: New → Triaged
Revision history for this message
David R. Hedges (p14nd4) wrote :

Similarly, I was in a video call (via Chromium snap), tried to switch to slack to check messages that just came in, and GUI locked up (but audio continued for several minutes).

journald had a ton of these messages:
Nov 18 10:41:11 tippin kernel: amdgpu 0000:05:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:40 vmid:1 pasid:32778, for process slack pid 11822 thread slack:cs0 pid 11863)
Nov 18 10:41:11 tippin kernel: amdgpu 0000:05:00.0: amdgpu: in page starting at address 0x0000800103a30000 from IH client 0x12 (VMC)
Nov 18 10:41:11 tippin kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00140050
Nov 18 10:41:11 tippin kernel: amdgpu 0000:05:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0)
Nov 18 10:41:11 tippin kernel: amdgpu 0000:05:00.0: amdgpu: MORE_FAULTS: 0x0
Nov 18 10:41:11 tippin kernel: amdgpu 0000:05:00.0: amdgpu: WALKER_ERROR: 0x0
Nov 18 10:41:11 tippin kernel: amdgpu 0000:05:00.0: amdgpu: PERMISSION_FAULTS: 0x5
Nov 18 10:41:11 tippin kernel: amdgpu 0000:05:00.0: amdgpu: MAPPING_ERROR: 0x0
Nov 18 10:41:11 tippin kernel: amdgpu 0000:05:00.0: amdgpu: RW: 0x1

I believe this is tracked upstream in https://gitlab.freedesktop.org/drm/amd/-/issues/2113 .

Changed in linux:
status: Unknown → Fix Released
tags: added: amdgpu fixed-upstream kinetic
tags: added: fixed-in-linux-6.1
Revision history for this message
Reupen Shah (reupen) wrote :

This is also affecting Ubuntu 22.04 now that the 5.19 kernel is rolling out as the HWE kernel. It's quite a severe bug, and I ended up switching to the linux-oem-22.04c kernel instead to get around it.

Changed in linux (Ubuntu):
status: Triaged → Fix Committed
tags: added: jammy
tags: added: regression-release
Revision history for this message
Allard Pruim (allardp) wrote :

Here the same issue. After I got the update to 5.19 my laptop froze when I opened or used Spotify. I also had to install the linux-oem-22.04c kernel.

summary: - amdgpu no-retry page fault in Kinetic Kudu
+ amdgpu no-retry page fault resulting in black screen and unresponsive
+ system
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Kinetic):
importance: Undecided → High
Changed in linux-hwe-5.19 (Ubuntu):
status: New → Fix Released
no longer affects: linux-hwe-5.19 (Ubuntu Kinetic)
Revision history for this message
Daniel van Vugt (vanvugt) wrote (last edit ):

The fix is in jammy linux-hwe-5.19 and also in kinetic 5.19.0-35.36 ... but still not in lunar until 6.1 is released.

Changed in linux (Ubuntu Kinetic):
status: Triaged → Fix Released
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Jaromir Obr (jaromir-obr) wrote :
Download full text (8.3 KiB)

I just tried kernel 5.19.0-35-generic #36 on Ubuntu 22.10 (kinetic) that should contain the fix.
A few times when I suspend my notebook (Yoga Slim 7 14are05), the issue occurred with Slack (a snap app):

/var/log/syslog
---------------
Mar 5 20:33:27 rzbox ModemManager[1095]: <info> [sleep-monitor-systemd] system is about to suspend
Mar 5 20:33:27 rzbox NetworkManager[1078]: <info> [1678044807.6532] manager: sleep: sleep requested (sleeping: no enabled: yes)
Mar 5 20:33:27 rzbox NetworkManager[1078]: <info> [1678044807.6533] device (eno1): state change: unavailable -> unmanaged (reason 'sleeping', sys-iface-state: 'managed')
Mar 5 20:33:27 rzbox kernel: [ 2079.870212] audit: type=1107 audit(1678044807.651:222): pid=998 uid=102 auid=4294967295 ses=4294967295 subj=unconfined msg='apparmor="DENIED" operation="dbus_signal" bus="system" path="/org/freedes
ktop/login1" interface="org.freedesktop.login1.Manager" member="PrepareForSleep" name=":1.4" mask="receive" pid=6112 label="snap.slack.slack" peer_pid=1033 peer_label="unconfined"
Mar 5 20:33:27 rzbox kernel: [ 2079.870212] exe="/usr/bin/dbus-daemon" sauid=102 hostname=? addr=? terminal=?'
Mar 5 20:33:27 rzbox google-chrome.desktop[4369]: [4363:4392:0305/203327.655811:ERROR:connection_factory_impl.cc(472)] ConnectionHandler failed with net error: -2
Mar 5 20:33:27 rzbox NetworkManager[1078]: <info> [1678044807.6618] device (enp1s0): state change: unavailable -> unmanaged (reason 'sleeping', sys-iface-state: 'managed')
Mar 5 20:33:27 rzbox NetworkManager[1078]: <info> [1678044807.6721] manager: NetworkManager state is now ASLEEP
Mar 5 20:33:27 rzbox NetworkManager[1078]: <info> [1678044807.6724] device (wlp3s0): state change: activated -> deactivating (reason 'sleeping', sys-iface-state: 'managed')
Mar 5 20:33:27 rzbox dbus-daemon[998]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.17' (uid=0 pid=1078 comm="/usr/sbin/NetworkMan
ager --no-daemon" label="unconfined")
Mar 5 20:33:27 rzbox systemd[1]: Starting Network Manager Script Dispatcher Service...
Mar 5 20:33:27 rzbox dbus-daemon[998]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Mar 5 20:33:27 rzbox systemd[1]: Started Network Manager Script Dispatcher Service.
Mar 5 20:33:27 rzbox kernel: [ 2079.933073] wlp3s0: deauthenticating from d4:6e:0e:d9:d5:e7 by local choice (Reason: 3=DEAUTH_LEAVING)
Mar 5 20:33:27 rzbox google-chrome.desktop[4369]: [4363:4392:0305/203327.718074:ERROR:connection_factory_impl.cc(427)] Failed to connect to MCS endpoint with error -105
Mar 5 20:33:27 rzbox gsd-media-keys[3663]: Unable to get default sink
Mar 5 20:33:27 rzbox update-notifier[5690]: gtk_widget_get_scale_factor: assertion 'GTK_IS_WIDGET (widget)' failed
Mar 5 20:33:27 rzbox wpa_supplicant[1079]: wlp3s0: CTRL-EVENT-DISCONNECTED bssid=d4:6e:0e:d9:d5:e7 reason=3 locally_generated=1
Mar 5 20:33:27 rzbox wpa_supplicant[1079]: wlp3s0: CTRL-EVENT-DSCP-POLICY clear_all
Mar 5 20:33:27 rzbox wpa_supplicant[1079]: wlp3s0: CTRL-EVENT-REGDOM-CHANGE init=CORE type=WORLD
Mar 5 20:33:27 rzbox NetworkManager[1078]: <inf...

Read more...

Revision history for this message
Alejandro LC (alcales) wrote :

I get this error every time I try to open Spotify.

Laptop: Lenovo ThinkPad L14 Gen 2a | AMD® Ryzen 5 5600u with radeon graphics × 12

Ubuntu: Ubuntu 22.04.2 LTS | 5.19.0-35

Revision history for this message
Daniel van Vugt (vanvugt) wrote :
Changed in linux-hwe-5.19 (Ubuntu):
status: Fix Released → Triaged
importance: Undecided → High
Changed in linux (Ubuntu Kinetic):
status: Fix Released → Triaged
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Ubuntu 22.04 users can install this kernel to get the fix:

  sudo apt install linux-oem-22.04c

Changed in linux-oem-6.1 (Ubuntu):
status: New → Fix Released
importance: Undecided → High
Revision history for this message
Attila Glück (attila-gluck) wrote :

Same problem on 5.19.0-35-generic with slack, unfortunately v6 kernel has another critical bugs (smb cifs).

Changed in linux (Ubuntu Kinetic):
status: Triaged → Confirmed
Changed in linux-hwe-5.19 (Ubuntu):
status: Triaged → Confirmed
Changed in linux (Ubuntu Kinetic):
status: Confirmed → Triaged
Changed in linux-hwe-5.19 (Ubuntu):
status: Confirmed → Triaged
Revision history for this message
Jaromir Obr (jaromir-obr) wrote :

Is there a chance to get a fix in Ubuntu 22.10 (with kernel 5.19)?

tags: added: rls-kk-incoming
Revision history for this message
Pablo (pyabo) wrote :

I am facing this issue. Is there a workaround to avoid it. I have 4 computers crashing every day.

Revision history for this message
Jaromir Obr (jaromir-obr) wrote :

@pyabo I solved it by installation of kernel 6.2: https://github.com/pimlie/ubuntu-mainline-kernel.sh. So far no other issue appeared, I'm using Ubuntu 22.10.

Revision history for this message
sir phobos (sir-phobos) wrote :

very much waiting for the fix to be ported to 5.19 for ubuntu 22.04
keeping 5.15 for now

Revision history for this message
Timon Z. (timonz) wrote :

Please fix this issue as soon as possible. I cannot work like this. My system freezes during meetings.

cat /proc/version
Linux version 5.19.0-50-generic (buildd@lcy02-amd64-030) (x86_64-linux-gnu-gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #50-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 10 18:24:29 UTC 2023

Revision history for this message
Timo Aaltonen (tjaalton) wrote :

hwe-5.19 will be replaced by hwe-6.2 soon, and kinetic (and 5.19 kernel with it) is EOL

Changed in linux-hwe-5.19 (Ubuntu):
status: Triaged → Won't Fix
Changed in linux (Ubuntu Kinetic):
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.