amdgpu no-retry page fault resulting in black screen and unresponsive system
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Linux |
Fix Released
|
Unknown
|
|||
linux (Ubuntu) |
Fix Released
|
High
|
Unassigned | ||
Kinetic |
Won't Fix
|
High
|
Unassigned | ||
linux-hwe-5.19 (Ubuntu) |
Won't Fix
|
High
|
Unassigned | ||
linux-oem-6.1 (Ubuntu) |
Fix Released
|
High
|
Unassigned |
Bug Description
When using Skype in snap, amdgpu crashed, resulting in black screen and unresponsive system.
Happened on Kinetic Kudu 5.19.0-23-generic with or without latest amdgpu firmware.
Affected laptop is T14 with Ryzen 5850U.
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:40 vmid:5 pasid:0, for process pid 0 thread pid 0)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000080010142c000 from IH client 0x12 (VMC)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTIO
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x1
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x5
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: RW: 0x1
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:40 vmid:5 pasid:0, for process pid 0 thread pid 0)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000080010142d000 from IH client 0x12 (VMC)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTIO
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: RW: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:40 vmid:5 pasid:0, for process pid 0 thread pid 0)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000080010142c000 from IH client 0x12 (VMC)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTIO
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x1
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x5
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: RW: 0x1
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:40 vmid:5 pasid:0, for process pid 0 thread pid 0)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000080010142d000 from IH client 0x12 (VMC)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTIO
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0)
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
Nov 03 16:35:44 laptop kernel: amdgpu 0000:07:00.0: amdgpu: RW: 0x0
This happens in a loop and eventually leads to GPU reset, which fails.
Nov 03 16:35:55 laptop kernel: [drm:amdgpu_
Nov 03 16:35:55 laptop kernel: [drm:amdgpu_
Nov 03 16:35:55 laptop kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
Nov 03 16:35:55 laptop kernel: [drm] free PSP TMR buffer
Nov 03 16:35:55 laptop kernel: CPU: 15 PID: 141579 Comm: kworker/u32:1 Tainted: G W 5.19.0-23-generic #24-Ubuntu
Nov 03 16:35:55 laptop kernel: Hardware name: LENOVO 20XK002HPB/
Nov 03 16:35:55 laptop kernel: Workqueue: amdgpu-reset-dev drm_sched_
Nov 03 16:35:55 laptop kernel: Call Trace:
Nov 03 16:35:55 laptop kernel: <TASK>
Nov 03 16:35:55 laptop kernel: show_stack+
Nov 03 16:35:55 laptop kernel: dump_stack_
Nov 03 16:35:55 laptop kernel: dump_stack+
Nov 03 16:35:55 laptop kernel: amdgpu_
Nov 03 16:35:55 laptop kernel: amdgpu_
Nov 03 16:35:55 laptop kernel: amdgpu_
Nov 03 16:35:55 laptop kernel: ? finish_
Nov 03 16:35:55 laptop kernel: drm_sched_
Nov 03 16:35:55 laptop kernel: process_
Nov 03 16:35:55 laptop kernel: worker_
Nov 03 16:35:55 laptop kernel: ? rescuer_
Nov 03 16:35:55 laptop kernel: kthread+0xe9/0x110
Nov 03 16:35:55 laptop kernel: ? kthread_
Nov 03 16:35:55 laptop kernel: ret_from_
Nov 03 16:35:55 laptop kernel: </TASK>
Nov 03 16:35:55 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MODE2 reset
Nov 03 16:35:55 laptop kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume
Nov 03 16:35:55 laptop kernel: [drm] PCIE GART of 1024M enabled.
Nov 03 16:35:55 laptop kernel: [drm] PTB located at 0x000000F400900000
Nov 03 16:35:55 laptop kernel: [drm] VRAM is lost due to GPU reset!
Nov 03 16:35:55 laptop kernel: [drm] PSP is resuming...
Nov 03 16:35:55 laptop kernel: [drm] reserve 0x400000 from 0xf43f800000 for PSP TMR
Nov 03 16:35:55 laptop kernel: amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not available
Nov 03 16:35:55 laptop kernel: amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not available
Nov 03 16:35:55 laptop kernel: amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Nov 03 16:35:55 laptop kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resuming...
Nov 03 16:35:55 laptop kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resumed successfully!
Nov 03 16:35:55 laptop kernel: [drm] DMUB hardware initialized: version=0x0101001F
Nov 03 16:35:56 laptop kernel: [drm] kiq ring mec 2 pipe 1 q 0
Nov 03 16:35:56 laptop kernel: amdgpu 0000:07:00.0: [drm:amdgpu_
Nov 03 16:35:56 laptop kernel: [drm:amdgpu_
Nov 03 16:35:56 laptop kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset(1) failed
Nov 03 16:35:56 laptop kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset end with ret = -110
Nov 03 16:35:56 laptop kernel: [drm:amdgpu_
and it continues to crash:
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:40 vmid:5 pasid:0, for process pid 0 thread pid 0)
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000080010142c000 from IH client 0x12 (VMC)
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTIO
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0)
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MORE_FAULTS: 0x1
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: WALKER_ERROR: 0x0
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: PERMISSION_FAULTS: 0x5
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: MAPPING_ERROR: 0x0
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: RW: 0x1
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:40 vmid:5 pasid:0, for process pid 0 thread pid 0)
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: in page starting at address 0x000080010142d000 from IH client 0x12 (VMC)
Nov 03 16:35:59 laptop kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTIO
Changed in linux (Ubuntu): | |
status: | New → Triaged |
Changed in linux (Ubuntu Kinetic): | |
status: | New → Triaged |
Changed in linux: | |
status: | Unknown → Fix Released |
tags: | added: amdgpu fixed-upstream kinetic |
tags: | added: fixed-in-linux-6.1 |
Changed in linux (Ubuntu): | |
status: | Triaged → Fix Committed |
tags: | added: jammy |
tags: | added: regression-release |
summary: |
- amdgpu no-retry page fault in Kinetic Kudu + amdgpu no-retry page fault resulting in black screen and unresponsive + system |
Changed in linux (Ubuntu): | |
importance: | Undecided → High |
Changed in linux (Ubuntu Kinetic): | |
importance: | Undecided → High |
Changed in linux-hwe-5.19 (Ubuntu): | |
status: | New → Fix Released |
no longer affects: | linux-hwe-5.19 (Ubuntu Kinetic) |
Changed in linux (Ubuntu): | |
status: | Fix Committed → Fix Released |
Changed in linux (Ubuntu Kinetic): | |
status: | Triaged → Confirmed |
Changed in linux-hwe-5.19 (Ubuntu): | |
status: | Triaged → Confirmed |
Changed in linux (Ubuntu Kinetic): | |
status: | Confirmed → Triaged |
Changed in linux-hwe-5.19 (Ubuntu): | |
status: | Confirmed → Triaged |
tags: | added: rls-kk-incoming |
Similarly, I was in a video call (via Chromium snap), tried to switch to slack to check messages that just came in, and GUI locked up (but audio continued for several minutes).
journald had a ton of these messages: N_FAULT_ STATUS: 0x00140050
Nov 18 10:41:11 tippin kernel: amdgpu 0000:05:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:40 vmid:1 pasid:32778, for process slack pid 11822 thread slack:cs0 pid 11863)
Nov 18 10:41:11 tippin kernel: amdgpu 0000:05:00.0: amdgpu: in page starting at address 0x0000800103a30000 from IH client 0x12 (VMC)
Nov 18 10:41:11 tippin kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTIO
Nov 18 10:41:11 tippin kernel: amdgpu 0000:05:00.0: amdgpu: Faulty UTCL2 client ID: MP1 (0x0)
Nov 18 10:41:11 tippin kernel: amdgpu 0000:05:00.0: amdgpu: MORE_FAULTS: 0x0
Nov 18 10:41:11 tippin kernel: amdgpu 0000:05:00.0: amdgpu: WALKER_ERROR: 0x0
Nov 18 10:41:11 tippin kernel: amdgpu 0000:05:00.0: amdgpu: PERMISSION_FAULTS: 0x5
Nov 18 10:41:11 tippin kernel: amdgpu 0000:05:00.0: amdgpu: MAPPING_ERROR: 0x0
Nov 18 10:41:11 tippin kernel: amdgpu 0000:05:00.0: amdgpu: RW: 0x1
I believe this is tracked upstream in https:/ /gitlab. freedesktop. org/drm/ amd/-/issues/ 2113 .