Bug #2039868 “amdgpu reset during usage of firefox” : Bugs : mesa package : Ubuntu

Revision history for this message

In Linux Kernel Bug Tracker #201957, felix.adrianto (felix.adrianto-linux-kernel-bugs) wrote on 2018-12-11:

#5

Error message:
[Dec 5 22:08] amdgpu 0000:23:00.0: GPU fault detected: 146 0x0000480c for process yuzu pid 2920 thread yuzu:cs0 pid 2935
[ +0.000005] amdgpu 0000:23:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000
[ +0.000002] amdgpu 0000:23:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0604800C
[ +0.000003] amdgpu 0000:23:00.0: VM fault (0x0c, vmid 3, pasid 32770) at page 0, read from 'TC4' (0x54433400) (72)
[ +10.053011] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=37241, emitted seq=37244
[ +0.000007] [drm] GPU recovery disabled.

How to reproduce the issue:
1. Playing with yuzu-emulator
2. Load Super Mario Odyssey
3. Start new game
4. When Mario is about to jump for the first time after being woken up by Cappy, this bug must occur.

During the issue, the following occured:
1. Graphic locked up.
2. System can be access through SSH.

System specification:
Debian Sid
Radeon RX 580

I have tried the following combination:
1. Kernel 4.17, 4.18, 4.19, 4.20, drm-next-4.21.wip
2. Mesa 18.2, 18.3, 19.0-development branch

But none of the above combination fixes the issue. Let me know if you need more information and more testing from me.

Revision history for this message

In Linux Kernel Bug Tracker #201957, alexdeucher (alexdeucher-linux-kernel-bugs) wrote on 2018-12-11:

#6

This is more likely a mesa issue than a kernel issue.

Revision history for this message

In Linux Kernel Bug Tracker #201957, felix.adrianto (felix.adrianto-linux-kernel-bugs) wrote on 2018-12-11:

#7

I will try to test with amdgpu-pro sometimes this week with the kernel that I mentioned above. If the application works as expected, it could be an issue with mesa opengl bug.

Revision history for this message

In Linux Kernel Bug Tracker #201957, anode.dev (anode.dev-linux-kernel-bugs) wrote on 2019-03-07:

#8

Download full text (4.5 KiB)

(In reply to Alex Deucher from comment #1)
> This is more likely a mesa issue than a kernel issue.

no, 4.14 kernel with latest mesa libs works very vell without any stucks
but from 4.20.4 and in all latest kernels (including 5.0) OS freezes and stucks every 30s ... 1min for 30s when browsing youtube with HW acceleration enabled(uvd) or playing a game, RX550, Arch, vanilla kernel

  365.021164] amdgpu: [powerplay]
                last message was failed ret is 0
[ 365.045198] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 365.570667] amdgpu: [powerplay]
                failed to send message 133 ret is 0
[ 366.115228] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=9365, emitted seq=9365
[ 366.115377] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
[ 366.115388] [drm] Timeout, but no hardware hang detected.
[ 366.689407] amdgpu: [powerplay]
                last message was failed ret is 0
[ 367.232287] amdgpu: [powerplay]
                failed to send message 306 ret is 0
[ 367.787043] amdgpu: [powerplay]
                last message was failed ret is 0
[ 368.320138] amdgpu: [powerplay]
                failed to send message 5e ret is 0
[ 369.367739] amdgpu: [powerplay]
                last message was failed ret is 0
[ 369.907559] amdgpu: [powerplay]
                failed to send message 145 ret is 0
[ 370.994478] amdgpu: [powerplay]
                last message was failed ret is 0
[ 371.538753] amdgpu: [powerplay]
                failed to send message 146 ret is 0
[ 372.075079] amdgpu: [powerplay]
                last message was failed ret is 0
[ 372.598565] amdgpu: [powerplay]
                failed to send message 148 ret is 0
[ 373.657188] amdgpu: [powerplay]
                last message was failed ret is 0
[ 374.198637] amdgpu: [powerplay]
                failed to send message 145 ret is 0
[ 375.075076] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 375.284948] amdgpu: [powerplay]
                last message was failed ret is 0
[ 375.830347] amdgpu: [powerplay]
                failed to send message 146 ret is 0
[ 376.138428] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=10113, emitted seq=10113
[ 376.138783] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
[ 376.138797] [drm] IP block:sdma_v3_0 is hung!
[ 376.138809] [drm] GPU recovery disabled.
[ 376.394657] amdgpu: [powerplay]
                last message was failed ret is 0
[ 376.934375] amdgpu: [powerplay]
                failed to send message 16a ret is 0
[ 377.463230] amdgpu: [powerplay]
                last message was failed ret is 0
[ 377.977725] amdgpu: [powerplay]
                failed to send message 186 ret is 0
[ 378.518406] amdgpu: [powerplay]
                last message was failed ret is 0
[ 379.060098] amdgpu: [powerplay]
                failed to send message 54 ret is 0
[ 379.556880] amdgpu: [powerplay]
                last message was failed ret is 0
[ 380.075217] amdgpu: [powerp...

(In reply to Alex Deucher from comment #1)
> This is more likely a mesa issue than a kernel issue.

no, 4.14 kernel with latest mesa libs works very vell without any stucks
but from 4.20.4 and in all latest kernels (including 5.0) OS freezes and stucks every 30s ... 1min for 30s when browsing youtube with HW acceleration enabled(uvd) or playing a game, RX550, Arch, vanilla kernel

365.021164] amdgpu: [powerplay] 
                last message was failed ret is 0
[  365.045198] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[  365.570667] amdgpu: [powerplay] 
                failed to send message 133 ret is 0 
[  366.115228] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=9365, emitted seq=9365
[  366.115377] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[  366.115388] [drm] Timeout, but no hardware hang detected.
[  366.689407] amdgpu: [powerplay] 
                last message was failed ret is 0
[  367.232287] amdgpu: [powerplay] 
                failed to send message 306 ret is 0 
[  367.787043] amdgpu: [powerplay] 
                last message was failed ret is 0
[  368.320138] amdgpu: [powerplay] 
                failed to send message 5e ret is 0 
[  369.367739] amdgpu: [powerplay] 
                last message was failed ret is 0
[  369.907559] amdgpu: [powerplay] 
                failed to send message 145 ret is 0 
[  370.994478] amdgpu: [powerplay] 
                last message was failed ret is 0
[  371.538753] amdgpu: [powerplay] 
                failed to send message 146 ret is 0 
[  372.075079] amdgpu: [powerplay] 
                last message was failed ret is 0
[  372.598565] amdgpu: [powerplay] 
                failed to send message 148 ret is 0 
[  373.657188] amdgpu: [powerplay] 
                last message was failed ret is 0
[  374.198637] amdgpu: [powerplay] 
                failed to send message 145 ret is 0 
[  375.075076] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[  375.284948] amdgpu: [powerplay] 
                last message was failed ret is 0
[  375.830347] amdgpu: [powerplay] 
                failed to send message 146 ret is 0 
[  376.138428] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=10113, emitted seq=10113
[  376.138783] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[  376.138797] [drm] IP block:sdma_v3_0 is hung!
[  376.138809] [drm] GPU recovery disabled.
[  376.394657] amdgpu: [powerplay] 
                last message was failed ret is 0
[  376.934375] amdgpu: [powerplay] 
                failed to send message 16a ret is 0 
[  377.463230] amdgpu: [powerplay] 
                last message was failed ret is 0
[  377.977725] amdgpu: [powerplay] 
                failed to send message 186 ret is 0 
[  378.518406] amdgpu: [powerplay] 
                last message was failed ret is 0
[  379.060098] amdgpu: [powerplay] 
                failed to send message 54 ret is 0 
[  379.556880] amdgpu: [powerplay] 
                last message was failed ret is 0
[  380.075217] amdgpu: [powerplay] 
                failed to send message 26b ret is 0 
[  380.605976] amdgpu: [powerplay] 
                last message was failed ret is 0
[  381.134301] amdgpu: [powerplay] 
                failed to send message 13d ret is 0 
[  381.657486] amdgpu: [powerplay] 
                last message was failed ret is 0
[  382.204551] amdgpu: [powerplay] 
                failed to send message 14f ret is 0 
[  382.741827] amdgpu: [powerplay] 
                last message was failed ret is 0
[  383.281165] amdgpu: [powerplay] 
                failed to send message 151 ret is 0 
[  383.824923] amdgpu: [powerplay] 
                last message was failed ret is 0
[  384.362266] amdgpu: [powerplay] 
                failed to send message 135 ret is 0 
[  384.903686] amdgpu: [powerplay] 
                last message was failed ret is 0
[  385.101515] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[  385.461515] amdgpu: [powerplay] 
                failed to send message 190 ret is 0 
[  386.014015] amdgpu: [powerplay] 
                last message was failed ret is 0
[  386.164818] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=10761, emitted seq=10761
[  386.164970] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[  386.164985] [drm] Timeout, but no hardware hang detected.

Revision history for this message

In Linux Kernel Bug Tracker #201957, alexdeucher (alexdeucher-linux-kernel-bugs) wrote on 2019-03-07:

#9

Can you bisect?

Revision history for this message

In Linux Kernel Bug Tracker #201957, kernel (kernel-linux-kernel-bugs) wrote on 2019-03-12:

#10

Download full text (10.4 KiB)

I'm having a very similar issue, running Linux Mint 19.1. The issue has persisted from at least 4.15, I'm currently running 5.0.1 and the issue remains.

Here is the latest syslog of the error:

[37258.615599] gmc_v9_0_process_interrupt: 10 callbacks suppressed
[37258.615608] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615615] amdgpu 0000:06:00.0: in page starting at address 0x0000800107805000 from 27
[37258.615619] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[37258.615629] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615633] amdgpu 0000:06:00.0: in page starting at address 0x0000800107807000 from 27
[37258.615636] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615645] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615648] amdgpu 0000:06:00.0: in page starting at address 0x0000800107801000 from 27
[37258.615651] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615660] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615663] amdgpu 0000:06:00.0: in page starting at address 0x0000800107803000 from 27
[37258.615666] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615675] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615678] amdgpu 0000:06:00.0: in page starting at address 0x0000800107809000 from 27
[37258.615681] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615689] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615692] amdgpu 0000:06:00.0: in page starting at address 0x000080010780b000 from 27
[37258.615695] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615704] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615707] amdgpu 0000:06:00.0: in page starting at address 0x0000800107805000 from 27
[37258.615710] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615740] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615743] amdgpu 0000:06:00.0: in page starting at address 0x0000800107807000 from 27
[37258.615746] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615756] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615759] amdgpu 0000:06:00.0: in page starting at address 0x0000800107801000 from 27
[37258.615762] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615771] amdgpu 0000:06:00.0: [gfxhub] VMC page fau...

I'm having a very similar issue, running Linux Mint 19.1. The issue has persisted from at least 4.15, I'm currently running 5.0.1 and the issue remains.

Here is the latest syslog of the error:

[37258.615599] gmc_v9_0_process_interrupt: 10 callbacks suppressed
[37258.615608] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615615] amdgpu 0000:06:00.0:   in page starting at address 0x0000800107805000 from 27
[37258.615619] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[37258.615629] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615633] amdgpu 0000:06:00.0:   in page starting at address 0x0000800107807000 from 27
[37258.615636] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615645] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615648] amdgpu 0000:06:00.0:   in page starting at address 0x0000800107801000 from 27
[37258.615651] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615660] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615663] amdgpu 0000:06:00.0:   in page starting at address 0x0000800107803000 from 27
[37258.615666] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615675] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615678] amdgpu 0000:06:00.0:   in page starting at address 0x0000800107809000 from 27
[37258.615681] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615689] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615692] amdgpu 0000:06:00.0:   in page starting at address 0x000080010780b000 from 27
[37258.615695] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615704] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615707] amdgpu 0000:06:00.0:   in page starting at address 0x0000800107805000 from 27
[37258.615710] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615740] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615743] amdgpu 0000:06:00.0:   in page starting at address 0x0000800107807000 from 27
[37258.615746] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615756] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615759] amdgpu 0000:06:00.0:   in page starting at address 0x0000800107801000 from 27
[37258.615762] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615771] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615774] amdgpu 0000:06:00.0:   in page starting at address 0x0000800107803000 from 27
[37258.615777] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37268.712339] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37268.712387] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37268.712389] [drm] GPU recovery disabled.
[37278.952537] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37278.952624] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37278.952628] [drm] GPU recovery disabled.
[37289.192390] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37289.192478] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37289.192481] [drm] GPU recovery disabled.
[37299.432447] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37299.432534] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37299.432538] [drm] GPU recovery disabled.
[37309.676431] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37309.676518] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37309.676522] [drm] GPU recovery disabled.
[37319.912444] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37319.912536] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37319.912541] [drm] GPU recovery disabled.
[37330.156619] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37330.156706] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37330.156710] [drm] GPU recovery disabled.
[37340.392424] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37340.392511] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37340.392515] [drm] GPU recovery disabled.
[37350.632424] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37350.632511] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37350.632514] [drm] GPU recovery disabled.
[37360.872417] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37360.872508] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37360.872511] [drm] GPU recovery disabled.
[37371.112436] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37371.112523] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37371.112527] [drm] GPU recovery disabled.
[37381.352427] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37381.352514] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37381.352517] [drm] GPU recovery disabled.
[37391.592410] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37391.592497] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37391.592500] [drm] GPU recovery disabled.
[37401.836426] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37401.836513] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37401.836517] [drm] GPU recovery disabled.
[37412.072433] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37412.072520] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37412.072524] [drm] GPU recovery disabled.
[37422.312442] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37422.312528] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37422.312532] [drm] GPU recovery disabled.
[37432.552428] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37432.552515] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37432.552519] [drm] GPU recovery disabled.
[37442.792418] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37442.792506] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37442.792510] [drm] GPU recovery disabled.
[37453.032397] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37453.032483] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37453.032487] [drm] GPU recovery disabled.
[37463.272534] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37463.272621] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37463.272624] [drm] GPU recovery disabled.
[37473.512589] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37473.512676] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37473.512680] [drm] GPU recovery disabled.
[37483.752954] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37483.753041] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37483.753044] [drm] GPU recovery disabled.
[37493.992566] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=602475, emitted seq=602478
[37493.992654] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1287 thread Xorg:cs0 pid 1317
[37493.992657] [drm] GPU recovery disabled.

During this time the laptop continues to operate (plays music and can SSH in), however the display and any input (keyboard / mouse) do not respond. The caps lock light for example does not toggle. The only way to recover is a force reboot by holding the power button.

I'm unable to provide any steps on how to re-create as the issue happens at completely random times when performing different tasks or when leaving the machine idle.

System specs:
Lenovo ThinkPad A485
AMD Ryzen 7 PRO 2700U with Radeon Vega Mobile Gfx
Linux Mint 19.1
Kernel 5.0.1 (installed via ukuu)

Revision history for this message

In Linux Kernel Bug Tracker #201957, anode.dev (anode.dev-linux-kernel-bugs) wrote on 2019-04-01:

#11

Download full text (15.4 KiB)

tried linux-amd-staging-drm-next-git-5.1.811103.2acb851ad43b and dmes is still has a lot of warnings. Tested also youtube in chrome with UVD, got a minor freeze and long freeze ~30sec of system

Apr 01 21:01:03 kernel: amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring uvd_enc0 test failed (-110)
Apr 01 21:01:03 kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <uvd_v6_0> failed -110
Apr 01 21:01:03 kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-110).

Apr 01 20:26:59 kernel: [drm] amdgpu kernel modesetting enabled.
Apr 01 20:26:59 kernel: vga_switcheroo: detected switching method \_SB_.PCI0.VGA_.ATPX handle
Apr 01 20:26:59 kernel: [drm] initializing kernel modesetting (CARRIZO 0x1002:0x9874 0x1025:0x1201 0xCA).
Apr 01 20:26:59 kernel: [drm] register mmio base: 0xD1500000
Apr 01 20:26:59 kernel: [drm] register mmio size: 262144
Apr 01 20:26:59 kernel: [drm] add ip block number 0 <vi_common>
Apr 01 20:26:59 kernel: [drm] add ip block number 1 <gmc_v8_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 2 <cz_ih>
Apr 01 20:26:59 kernel: [drm] add ip block number 3 <gfx_v8_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 4 <sdma_v3_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 5 <powerplay>
Apr 01 20:26:59 kernel: [drm] add ip block number 6 <dm>
Apr 01 20:26:59 kernel: [drm] add ip block number 7 <uvd_v6_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 8 <vce_v3_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 9 <acp_ip>
Apr 01 20:26:59 kernel: [drm] UVD is enabled in physical mode
Apr 01 20:26:59 kernel: [drm] VCE enabled in physical mode
Apr 01 20:26:59 kernel: ATOM BIOS: 113-C91400-007
Apr 01 20:26:59 kernel: [drm] RAS INFO: ras initialized successfully, hardware ability[0] ras_mask[0]
Apr 01 20:26:59 kernel: [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
Apr 01 20:26:59 kernel: amdgpu 0000:00:01.0: VRAM: 512M 0x000000F400000000 - 0x000000F41FFFFFFF (512M used)
Apr 01 20:26:59 kernel: amdgpu 0000:00:01.0: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
Apr 01 20:26:59 kernel: [drm] Detected VRAM RAM=512M, BAR=512M
Apr 01 20:26:59 kernel: [drm] RAM width 64bits UNKNOWN
Apr 01 20:26:59 kernel: [TTM] Zone kernel: Available graphics memory: 3804974 KiB
Apr 01 20:26:59 kernel: [TTM] Zone dma32: Available graphics memory: 2097152 KiB
Apr 01 20:26:59 kernel: [TTM] Initializing pool allocator
Apr 01 20:26:59 kernel: [TTM] Initializing DMA pool allocator
Apr 01 20:26:59 kernel: [drm] amdgpu: 512M of VRAM memory ready
Apr 01 20:26:59 kernel: [drm] amdgpu: 3072M of GTT memory ready.
Apr 01 20:26:59 kernel: [drm] GART: num cpu pages 262144, num gpu pages 262144
Apr 01 20:26:59 kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F4007E9000).
Apr 01 20:26:59 kernel: [drm] Found UVD firmware Version: 1.91 Family ID: 11
Apr 01 20:26:59 kernel: [drm] UVD ENC is disabled
Apr 01 20:26:59 kernel: [drm] Found VCE firmware Version: 52.4 Binary ID: 3
Apr 01 20:26:59 kernel: smu version 27.17.00
Apr 01 20:26:59 kernel: [drm] DM_PPLIB: values for Engine clock
Apr 01 20:26:59 kernel: [drm] DM_PPLIB: 30000...

tried linux-amd-staging-drm-next-git-5.1.811103.2acb851ad43b and dmes is still has a lot of warnings. Tested also youtube in chrome with UVD, got a minor freeze and long freeze ~30sec of system

Apr 01 21:01:03 kernel: amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring uvd_enc0 test failed (-110)
Apr 01 21:01:03 kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <uvd_v6_0> failed -110
Apr 01 21:01:03 kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-110).

Apr 01 20:26:59 kernel: [drm] amdgpu kernel modesetting enabled.
Apr 01 20:26:59 kernel: vga_switcheroo: detected switching method \_SB_.PCI0.VGA_.ATPX handle
Apr 01 20:26:59 kernel: [drm] initializing kernel modesetting (CARRIZO 0x1002:0x9874 0x1025:0x1201 0xCA).
Apr 01 20:26:59 kernel: [drm] register mmio base: 0xD1500000
Apr 01 20:26:59 kernel: [drm] register mmio size: 262144
Apr 01 20:26:59 kernel: [drm] add ip block number 0 <vi_common>
Apr 01 20:26:59 kernel: [drm] add ip block number 1 <gmc_v8_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 2 <cz_ih>
Apr 01 20:26:59 kernel: [drm] add ip block number 3 <gfx_v8_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 4 <sdma_v3_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 5 <powerplay>
Apr 01 20:26:59 kernel: [drm] add ip block number 6 <dm>
Apr 01 20:26:59 kernel: [drm] add ip block number 7 <uvd_v6_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 8 <vce_v3_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 9 <acp_ip>
Apr 01 20:26:59 kernel: [drm] UVD is enabled in physical mode
Apr 01 20:26:59 kernel: [drm] VCE enabled in physical mode
Apr 01 20:26:59 kernel: ATOM BIOS: 113-C91400-007
Apr 01 20:26:59 kernel: [drm] RAS INFO: ras initialized successfully, hardware ability[0] ras_mask[0]
Apr 01 20:26:59 kernel: [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
Apr 01 20:26:59 kernel: amdgpu 0000:00:01.0: VRAM: 512M 0x000000F400000000 - 0x000000F41FFFFFFF (512M used)
Apr 01 20:26:59 kernel: amdgpu 0000:00:01.0: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
Apr 01 20:26:59 kernel: [drm] Detected VRAM RAM=512M, BAR=512M
Apr 01 20:26:59 kernel: [drm] RAM width 64bits UNKNOWN
Apr 01 20:26:59 kernel: [TTM] Zone  kernel: Available graphics memory: 3804974 KiB
Apr 01 20:26:59 kernel: [TTM] Zone   dma32: Available graphics memory: 2097152 KiB
Apr 01 20:26:59 kernel: [TTM] Initializing pool allocator
Apr 01 20:26:59 kernel: [TTM] Initializing DMA pool allocator
Apr 01 20:26:59 kernel: [drm] amdgpu: 512M of VRAM memory ready
Apr 01 20:26:59 kernel: [drm] amdgpu: 3072M of GTT memory ready.
Apr 01 20:26:59 kernel: [drm] GART: num cpu pages 262144, num gpu pages 262144
Apr 01 20:26:59 kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F4007E9000).
Apr 01 20:26:59 kernel: [drm] Found UVD firmware Version: 1.91 Family ID: 11
Apr 01 20:26:59 kernel: [drm] UVD ENC is disabled
Apr 01 20:26:59 kernel: [drm] Found VCE firmware Version: 52.4 Binary ID: 3
Apr 01 20:26:59 kernel: smu version 27.17.00
Apr 01 20:26:59 kernel: [drm] DM_PPLIB: values for Engine clock
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         300000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         480000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         533340
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         576000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         626090
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         685720
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         720000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         757900
Apr 01 20:26:59 kernel: [drm] DM_PPLIB: Validation clocks:
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:    engine_max_clock: 75790
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:    memory_max_clock: 93300
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:    level           : 8
Apr 01 20:26:59 kernel: [drm] DM_PPLIB: values for Display clock
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         300000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         400000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         496560
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         626090
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         685720
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         757900
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         800000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         847060
Apr 01 20:26:59 kernel: [drm] DM_PPLIB: Validation clocks:
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:    engine_max_clock: 75790
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:    memory_max_clock: 93300
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:    level           : 8
Apr 01 20:26:59 kernel: [drm] DM_PPLIB: values for Memory clock
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         667000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         933000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB: Validation clocks:
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:    engine_max_clock: 75790
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:    memory_max_clock: 93300
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:    level           : 8
Apr 01 20:26:59 kernel: [drm:construct [amdgpu]] *ERROR* construct: Invalid Connector ObjectID from Adapter Service for connector index:2! type 0 expected 3
Apr 01 20:26:59 kernel: [drm] Display Core initialized with v3.2.24!
Apr 01 20:26:59 kernel: [drm] SADs count is: -2, don't need to read it
Apr 01 20:26:59 kernel: [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
Apr 01 20:26:59 kernel: [drm] Driver supports precise vblank timestamp query.

Apr 01 20:26:59 kernel: [drm] UVD initialized successfully.
Apr 01 20:26:59 kernel: [drm] VCE initialized successfully.
Apr 01 20:26:59 kernel: kfd kfd: Allocated 3969056 bytes on gart
Apr 01 20:26:59 kernel: Topology: Add APU node [0x9874:0x1002]
Apr 01 20:26:59 kernel: kfd kfd: added device 1002:9874
Apr 01 20:26:59 kernel: [drm] fb mappable at 0x21FDCD000
Apr 01 20:26:59 kernel: [drm] vram apper at 0x21F000000
Apr 01 20:26:59 kernel: [drm] size 8294400
Apr 01 20:26:59 kernel: [drm] fb depth is 24
Apr 01 20:26:59 kernel: [drm]    pitch is 7680
Apr 01 20:26:59 kernel: fbcon: amdgpudrmfb (fb0) is primary device
Apr 01 20:26:59 kernel: Console: switching to colour frame buffer device 240x67
Apr 01 20:26:59 kernel: amdgpu 0000:00:01.0: fb0: amdgpudrmfb frame buffer device
Apr 01 20:26:59 kernel: [drm] Initialized amdgpu 3.31.0 20150101 for 0000:00:01.0 on minor 0
Apr 01 20:26:59 kernel: amdgpu 0000:03:00.0: enabling device (0002 -> 0003)
Apr 01 20:26:59 kernel: [drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1025:0x1210 0xC3).
Apr 01 20:26:59 kernel: [drm] register mmio base: 0xD1200000
Apr 01 20:26:59 kernel: [drm] register mmio size: 262144
Apr 01 20:26:59 kernel: [drm] add ip block number 0 <vi_common>
Apr 01 20:26:59 kernel: [drm] add ip block number 1 <gmc_v8_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 2 <tonga_ih>
Apr 01 20:26:59 kernel: [drm] add ip block number 3 <gfx_v8_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 4 <sdma_v3_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 5 <powerplay>
Apr 01 20:26:59 kernel: [drm] add ip block number 6 <dm>
Apr 01 20:26:59 kernel: [drm] add ip block number 7 <uvd_v6_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 8 <vce_v3_0>
Apr 01 20:26:59 kernel: kfd kfd: skipped device 1002:699f, PCI rejects atomics
Apr 01 20:26:59 kernel: [drm] UVD is enabled in VM mode
Apr 01 20:26:59 kernel: [drm] UVD ENC is enabled in VM mode
Apr 01 20:26:59 kernel: [drm] VCE enabled in VM mode
Apr 01 20:26:59 kernel: vga_switcheroo: enabled
Apr 01 20:26:59 kernel: ATOM BIOS: SWBRT23054.001
Apr 01 20:26:59 kernel: [drm] GPU posting now...
Apr 01 20:26:59 kernel: [drm] RAS INFO: ras initialized successfully, hardware ability[0] ras_mask[0]
Apr 01 20:26:59 kernel: [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
Apr 01 20:26:59 kernel: amdgpu 0000:03:00.0: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
Apr 01 20:26:59 kernel: amdgpu 0000:03:00.0: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
Apr 01 20:26:59 kernel: [drm] Detected VRAM RAM=2048M, BAR=256M
Apr 01 20:26:59 kernel: [drm] RAM width 128bits GDDR5
Apr 01 20:26:59 kernel: [drm] amdgpu: 2048M of VRAM memory ready
Apr 01 20:26:59 kernel: [drm] amdgpu: 3072M of GTT memory ready.
Apr 01 20:26:59 kernel: [drm] GART: num cpu pages 65536, num gpu pages 65536
Apr 01 20:26:59 kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Apr 01 20:26:59 kernel: [drm] Chained IB support enabled!
Apr 01 20:26:59 kernel: [drm] Found UVD firmware Version: 1.130 Family ID: 16
Apr 01 20:26:59 kernel: [drm] Found VCE firmware Version: 53.26 Binary ID: 3
Apr 01 20:26:59 kernel: amdgpu: [powerplay] Voltage value looks like a Leakage ID but it's not patched 
Apr 01 20:26:59 kernel: amdgpu: [powerplay] Voltage value looks like a Leakage ID but it's not patched 
Apr 01 20:26:59 kernel: amdgpu: [powerplay] Voltage value looks like a Leakage ID but it's not patched 
Apr 01 20:26:59 kernel: amdgpu: [powerplay] Voltage value looks like a Leakage ID but it's not patched 
Apr 01 20:26:59 kernel: amdgpu: [powerplay] Voltage value looks like a Leakage ID but it's not patched 
Apr 01 20:26:59 kernel: amdgpu: [powerplay] Voltage value looks like a Leakage ID but it's not patched 
Apr 01 20:26:59 kernel: amdgpu: [powerplay] Voltage value looks like a Leakage ID but it's not patched 
Apr 01 20:26:59 kernel: [drm] DM_PPLIB: values for Engine clock
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         214000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         551000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         734000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         921000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         980000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         1046000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB: Validation clocks:
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:    engine_max_clock: 104600
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:    memory_max_clock: 125000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:    level           : 8
Apr 01 20:26:59 kernel: [drm] DM_PPLIB: values for Memory clock
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         300000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         625000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:         1250000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB: Validation clocks:
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:    engine_max_clock: 104600
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:    memory_max_clock: 125000
Apr 01 20:26:59 kernel: [drm] DM_PPLIB:    level           : 8
Apr 01 20:26:59 kernel: [drm] Display Core initialized with v3.2.24!
Apr 01 20:26:59 kernel: [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
Apr 01 20:26:59 kernel: [drm] Driver supports precise vblank timestamp query.
Apr 01 20:26:59 kernel: [drm] UVD and UVD ENC initialized successfully.
Apr 01 20:26:59 kernel: [drm] VCE initialized successfully.
Apr 01 20:26:59 kernel: [drm] Initialized amdgpu 3.31.0 20150101 for 0000:03:00.0 on minor 1
Apr 01 20:26:59 kernel: amdgpu: [powerplay] 
                                failed to send message 15b ret is 0 
Apr 01 20:26:59 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 0
Apr 01 20:26:59 kernel: amdgpu: [powerplay] 
                                failed to send message 15a ret is 0 
Apr 01 20:26:59 kernel: [drm:amdgpu_device_ip_late_init_func_handler [amdgpu]] *ERROR* ib ring test failed (-110).
Apr 01 20:26:59 kernel: EXT4-fs (sda3): mounted filesystem with ordered data mode. Opts: (null)
Apr 01 20:26:59 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 0
Apr 01 20:26:59 kernel: amdgpu: [powerplay] 
                                failed to send message 155 ret is 0 
Apr 01 20:26:59 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 0
Apr 01 20:26:59 kernel: amdgpu: [powerplay] 
                                failed to send message 15b ret is 0
Apr 01 20:27:48 kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Apr 01 20:27:48 kernel: amdgpu: [powerplay] 
                                failed to send message 154 ret is 0 
Apr 01 20:27:49 kernel: amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring uvd_enc0 test failed (-110)
Apr 01 20:27:49 kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <uvd_v6_0> failed -110
Apr 01 20:27:49 kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-110).
Apr 01 20:27:50 kernel: amdgpu: [powerplay]
Apr 01 20:28:30 kernel: amdgpu: [powerplay] 
                                failed to send message 261 ret is 0 
Apr 01 20:28:31 kernel: [drm] Fence fallback timer expired on ring sdma0
Apr 01 20:28:31 kernel: [drm] Fence fallback timer expired on ring sdma0
Apr 01 20:28:32 kernel: [drm] Fence fallback timer expired on ring sdma0
Apr 01 20:28:32 kernel: [drm] Fence fallback timer expired on ring sdma0
Apr 01 20:28:33 kernel: [drm] Fence fallback timer expired on ring sdma0
Apr 01 20:28:33 kernel: [drm] Fence fallback timer expired on ring sdma0
Apr 01 20:28:34 kernel: [drm] Fence fallback timer expired on ring sdma0
Apr 01 20:28:34 kernel: [drm] Fence fallback timer expired on ring sdma0
Apr 01 20:28:35 kernel: [drm] Fence fallback timer expired on ring sdma0
Apr 01 20:28:35 kernel: [drm] Fence fallback timer expired on ring sdma0
Apr 01 20:28:36 kernel: [drm] Fence fallback timer expired on ring sdma0
Apr 01 20:28:36 kernel: [drm] Fence fallback timer expired on ring sdma0
Apr 01 20:28:37 kernel: [drm] Fence fallback timer expired on ring sdma0
Apr 01 20:28:37 kernel: [drm] Fence fallback timer expired on ring sdma0
Apr 01 20:28:38 kernel: [drm] Fence fallback timer expired on ring sdma0
Apr 01 20:28:39 kernel: amdgpu: [powerplay]

Apr 01 20:29:12 kernel: amdgpu: [powerplay] 
                                failed to send message 154 ret is 0 
Apr 01 20:29:13 kernel: amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring uvd_enc0 test failed (-110)
Apr 01 20:29:13 kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <uvd_v6_0> failed -110
Apr 01 20:29:13 kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-110).

Apr 01 20:30:06 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 0
Apr 01 20:30:06 kernel: amdgpu: [powerplay] 
                                failed to send message 135 ret is 0 
Apr 01 20:30:07 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 0
Apr 01 20:30:07 kernel: amdgpu: [powerplay] 
                                failed to send message 190 ret is 0 
Apr 01 20:30:08 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 0
Apr 01 20:30:08 kernel: amdgpu: [powerplay] 
                                failed to send message 63 ret is 0 
Apr 01 20:30:09 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 0
Apr 01 20:30:09 kernel: amdgpu: [powerplay] 
                                failed to send message 84 ret is 0 
Apr 01 20:30:09 kernel: amdgpu 0000:03:00.0: GPU pci config reset
Apr 01 20:34:17 kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Apr 01 20:34:18 kernel: amdgpu: [powerplay] 
                                failed to send message 154 ret is 0 
Apr 01 20:34:18 kernel: amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring uvd_enc0 test failed (-110)
Apr 01 20:34:18 kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <uvd_v6_0> failed -110
Apr 01 20:34:18 kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-110).

Revision history for this message

In Linux Kernel Bug Tracker #201957, anode.dev (anode.dev-linux-kernel-bugs) wrote on 2019-04-01:

#12

(In reply to Alex Deucher from comment #4)
> Can you bisect?

Unfortunately this is not possible as all latest kernels are now shipped with Display Core enabled by default and as I told 4.14 vanilla kernel works like a charm on same HW and with same mesa libs - no lags, no stucks or freezes and no warnings like listed above. So it's no sense to do "git bisect" as it's not a single commit which works incorrectly with GPU. DC - this a completely new functionality which replaces old amdgpu code

Revision history for this message

In Linux Kernel Bug Tracker #201957, au1064 (au1064-linux-kernel-bugs) wrote on 2019-08-20:

#13

Hi, i have a very similar problem. My system is working with 4.15 and with 5.1.16 but not with other 5.x kernels:

The System does not boot with 5.x kernels. With 5.1.16 the gui system freezes sometimes but sshd and mouse is still working.

CPU: Ryzen 5 2400g, BOARD: AORUS B450 I PRO WIFI, X Server 1.19.6

Kernel 5.0.x not working (blank screen after boot)
Kernel 5.2.x ( x <= 9 ) is not working (blank screen after boot)

but Kernel 5.1.16 is working (mostly)!

Error LOG with 5.1.16:
[Mi Aug 14 14:22:21 2019] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[Mi Aug 14 14:22:21 2019] amdgpu 0000:09:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1848 thread Xorg:cs0 pid 1849)
[Mi Aug 14 14:22:21 2019] amdgpu 0000:09:00.0: in page starting at address 0x000080010c205000 from 27
[Mi Aug 14 14:22:21 2019] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[Mi Aug 14 14:22:31 2019] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=840738, emitted seq=840740
[Mi Aug 14 14:22:31 2019] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1848 thread Xorg:cs0 pid 1849
[Mi Aug 14 14:22:31 2019] [drm] GPU recovery disabled.

Revision history for this message

In Linux Kernel Bug Tracker #201957, ungu_93 (ungu93-linux-kernel-bugs) wrote on 2019-09-11:

#14

Just got something similar while playing Left 4 Dead. The system simply froze with altered colors on the screen and the sound just looping over the last second or so. Cannot confirm SSH access.

journalctl -b -1 ends with

[drm:gfx_v8_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2225992, emitted seq=2225993
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process hl2_linux pid 12532 thread hl2_

OS: Ubuntu 19.04 on
Kernel: 5.0.0-27-generic
GPU: Radeon RX580
CPU: Ryzen 5 1600x

Thanks!

Revision history for this message

In Linux Kernel Bug Tracker #201957, anode.dev (anode.dev-linux-kernel-bugs) wrote on 2019-09-20:

#15

(In reply to Ungureanu Alexandru from comment #9)
> Just got something similar while playing Left 4 Dead. The system simply
> froze with altered colors on the screen and the sound just looping over the
> last second or so. Cannot confirm SSH access.

> Kernel: 5.0.0-27-generic
> GPU: Radeon RX580
> CPU: Ryzen 5 1600x

5.0 is very outdated kernel, use latest from kernel.org

as for me all works perfectly in 5.3 (Chip polaris RX540)
finally I have no more any errors like these ones:
- ERROR* resume of IP block <uvd_v6_0> failed -110
- [drm] Fence fallback timer expired on ring sdma0
- last message was failed ret is **
- [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq...
- IP block:sdma_v3_0 is hung!
- Timeout, but no hardware hang detected.

Tested on youtube with HW accelerated video and in several games
Thank you guys from AMD a lot, I had to wait 1y+ to get these bugs fixed

Revision history for this message

In Linux Kernel Bug Tracker #201957, lekto (lekto-linux-kernel-bugs) wrote on 2019-10-02:

#16

Same problem here. It happens when I run looking-glass [1], but not everytime. I tied downgrading my kernel from 5.3.1 to 5.2.11 (I'm pretty sure it worked then), downgrading mesa from 19.2.0 to 19.1.7 (I'm sure it worked with 19.2.0-rc) and downgrading my firmware to 2019-09-23 (oldest in repo).

When it happens looking glass starts blinking and sometimes my other monitor stuck that I can only move cursor on it.

Spec:
Gentoo ~amd64
Ryzen 1600 (other have Ryzen too, coincidence?)
Linux GPU: R7 240 (with radeon driver)
Windows GPU: RX580
ASRock X370 Gaming X

[1] https://looking-glass.hostfission.com/

Revision history for this message

In Linux Kernel Bug Tracker #201957, mh (mh-linux-kernel-bugs) wrote on 2019-10-11:

#17

Hi,

I think I have the same bug and opened https://bugzilla.kernel.org/show_bug.cgi?id=204683.

At first it looked a bit different, because in newer kernels the error message has changed. But as you can see I did some testing and this seems to go way back. Sadly I couldn't test a 4.18 kernel.

Can somebody mark my report as duplicate? Because I think it is.

And Would some more debug info help?

Revision history for this message

In Linux Kernel Bug Tracker #201957, mh (mh-linux-kernel-bugs) wrote on 2019-10-14:

#18

*** Bug 204683 has been marked as a duplicate of this bug. ***

Revision history for this message

In Linux Kernel Bug Tracker #201957, perk11 (perk11-linux-kernel-bugs) wrote on 2019-10-24:

#19

Download full text (6.0 KiB)

Also experiencing this with Radeon RX 5700 XT and amdgpu 19.1.0+git1910111930.b467d2~oibaf~b

Didn't have any heavy load for the GPU to do.

First I had some artifacts appeared on Plasma Hard Disk Monitor widget and CPU Load Widget (here is a screenshot: https://i.perk11.info/20191024_193152_kernel.png) while PC was idle and screen was locked, but everything else continued to work fine.

I checked the logs for the period when this could've happened, but the only logs from that period are from KScreen that start like this:

Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Oct 24 16:34:58 perk11-home org.kde.KScreen... RRNotify_OutputProperty (ignored)
Output: 88
Property: EDID
State (newValue, Deleted): 1
RRNotify_OutputProperty (ignored)
Output: 88
Property: EDID
State (newValue, Deleted): 1
RRNotify_OutputChange
Output: 88
CRTC: 81
Mode: 97
Rotation: "Rotate_0"
Connection: "Disconnected"
Subpixel Order: 0
RRScreenChangeNotify
Window: 18874373
Root: 1744
Rotation: "Rotate_0"
Size ID: 65535
Size: 7280 1440
SizeMM: 1926 381
RRNotify_OutputChange
Output: 88
CRTC: 81
Mode: 97
Rotation: "Rotate_0"

Also experiencing this with Radeon RX 5700 XT and amdgpu  19.1.0+git1910111930.b467d2~oibaf~b

Didn't have any heavy load for the GPU to do.

First I had some artifacts appeared on Plasma Hard Disk Monitor widget and CPU Load Widget (here is a screenshot: https://i.perk11.info/20191024_193152_kernel.png) while PC was idle and screen was locked, but everything else continued to work fine.

I checked the logs for the period when this could've happened, but the only logs from that period are from KScreen that start like this:

Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: RRNotify_OutputProperty (ignored)
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Output:  88
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Property:  EDID
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         State (newValue, Deleted):  1
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: RRNotify_OutputProperty (ignored)
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Output:  88
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Property:  EDID
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         State (newValue, Deleted):  1
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: RRNotify_OutputChange
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Output:  88
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         CRTC:  81
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Mode:  97
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Rotation:  "Rotate_0"
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Connection:  "Disconnected"
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Subpixel Order:  0
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: RRScreenChangeNotify
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Window: 18874373
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Root: 1744
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Rotation:  "Rotate_0"
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Size ID: 65535
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Size:  7280 1440
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         SizeMM:  1926 381
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: RRNotify_OutputChange
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Output:  88
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         CRTC:  81
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Mode:  97
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Rotation:  "Rotate_0"
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Connection:  "Disconnected"
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Subpixel Order:  0
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xrandr: XRandROutput 88 update
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          m_connected: 0
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          m_crtc XRandRCrtc(0x5655577da9f0)
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          CRTC: 81
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          MODE: 97
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          Connection: 1
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          Primary: false
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xrandr: Output 88 : connected = false , enabled = true
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xrandr: XRandROutput 88 update
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          m_connected: 1
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          m_crtc XRandRCrtc(0x5655577da9f0)
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          CRTC: 81
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          MODE: 97
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          Connection: 1
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          Primary: false

90 minutes later, the system became unresponsive while I was typing a message in Skype, but the audio I had playing in Audacity continued to play and the cron jobs continued running normally for a few minutes while I was trying to get the system unstuck without rebooting it which I couldn't.

Here are the errors:

Oct 24 19:04:10 perk11-home kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Oct 24 19:04:10 perk11-home kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Oct 24 19:04:15 perk11-home kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!

Oct 24 19:04:10 perk11-home kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Oct 24 19:04:10 perk11-home kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Oct 24 19:04:15 perk11-home kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Oct 24 19:04:15 perk11-home kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=3485981, emitted seq=3485983
Oct 24 19:04:15 perk11-home kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 2469 thread Xorg:cs0 pid 2491
Oct 24 19:04:15 perk11-home kernel: [drm] GPU recovery disabled.

Revision history for this message

In Linux Kernel Bug Tracker #201957, perk11 (perk11-linux-kernel-bugs) wrote on 2019-10-24:

#20

My kernel version is 5.3.7-050307-generic running KDE Neon User edition with latest updates.

Revision history for this message

In Linux Kernel Bug Tracker #201957, shallowaloe (shallowaloe-linux-kernel-bugs) wrote on 2019-10-27:

#21

Created attachment 285665
5 second video clip that triggers a crash

Hi,

I think I'm having the same problem as you guys. I run a mythbackend where I record cable television and those recordings often crash my system when hardware decoding is enabled. Usually it's just the screen that freezes and I can still ssh to it.

Kernel 5.1.6 was an exception for me too, with that kernel I'm able to restart the display manager and recover without having to reboot.

Attached is a short video that crashes my system. I can trigger the alert by running:

mpv --vo=vaapi out.ts

I'm wondering if it crashes your systems too and if it's related.

Revision history for this message

In Linux Kernel Bug Tracker #201957, jmstylr (jmstylr-linux-kernel-bugs) wrote on 2019-11-10:

#22

Download full text (4.5 KiB)

(In reply to shallowaloe from comment #16)
> Created attachment 285665 [details]
> 5 second video clip that triggers a crash
>
> Hi,
>
> I think I'm having the same problem as you guys. I run a mythbackend where
> I record cable television and those recordings often crash my system when
> hardware decoding is enabled. Usually it's just the screen that freezes and
> I can still ssh to it.
>
> Kernel 5.1.6 was an exception for me too, with that kernel I'm able to
> restart the display manager and recover without having to reboot.
>
> Attached is a short video that crashes my system. I can trigger the alert
> by running:
>
> mpv --vo=vaapi out.ts
>
> I'm wondering if it crashes your systems too and if it's related.

Just to add a data point, I tried running `mpv --vo=vaapi out.ts` against your file, and while it crashed the application, it did not freeze the system.

My hardware is a Ryzen 3700X with a Radeon RX 5700, running Ubuntu 19.10 with default kernel (5.3.0-19-generic).

The command did result in the following lines in /var/log/syslog repeated every 5 seconds:

Nov 10 07:04:23 redacted kernel: [ 2266.802162] gmc_v10_0_process_interrupt: 23900 callbacks suppressed
Nov 10 07:04:23 redacted kernel: [ 2266.802166] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802170] amdgpu 0000:0b:00.0: at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802171] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0000213D
Nov 10 07:04:23 redacted kernel: [ 2266.802176] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802178] amdgpu 0000:0b:00.0: at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802179] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Nov 10 07:04:23 redacted kernel: [ 2266.802566] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802568] amdgpu 0000:0b:00.0: at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802569] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0000213D
Nov 10 07:04:23 redacted kernel: [ 2266.802573] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802575] amdgpu 0000:0b:00.0: at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802576] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Nov 10 07:04:23 redacted kernel: [ 2266.802984] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802985] amdgpu 0000:0b:00.0: at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802987] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0000213D
Nov 10 07:04:23 redacted kernel: [ 2266.802993] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802994] amdgpu 0000:0b:00.0: at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802995] amdg...

(In reply to shallowaloe from comment #16)
> Created attachment 285665 [details]
> 5 second video clip that triggers a crash
> 
> Hi,
> 
> I think I'm having the same problem as you guys.  I run a mythbackend where
> I record cable television and those recordings often crash my system when
> hardware decoding is enabled.  Usually it's just the screen that freezes and
> I can still ssh to it.  
> 
> Kernel 5.1.6 was an exception for me too, with that kernel I'm able to
> restart the display manager and recover without having to reboot.
> 
> Attached is a short video that crashes my system.  I can trigger the alert
> by running:
> 
> mpv --vo=vaapi out.ts
> 
> I'm wondering if it crashes your systems too and if it's related.

Just to add a data point, I tried running `mpv --vo=vaapi out.ts` against your file, and while it crashed the application, it did not freeze the system.

My hardware is a Ryzen 3700X with a Radeon RX 5700, running Ubuntu 19.10 with default kernel (5.3.0-19-generic).

The command did result in the following lines in /var/log/syslog repeated every 5 seconds:

Nov 10 07:04:23 redacted kernel: [ 2266.802162] gmc_v10_0_process_interrupt: 23900 callbacks suppressed
Nov 10 07:04:23 redacted kernel: [ 2266.802166] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802170] amdgpu 0000:0b:00.0:   at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802171] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0000213D
Nov 10 07:04:23 redacted kernel: [ 2266.802176] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802178] amdgpu 0000:0b:00.0:   at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802179] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Nov 10 07:04:23 redacted kernel: [ 2266.802566] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802568] amdgpu 0000:0b:00.0:   at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802569] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0000213D
Nov 10 07:04:23 redacted kernel: [ 2266.802573] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802575] amdgpu 0000:0b:00.0:   at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802576] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Nov 10 07:04:23 redacted kernel: [ 2266.802984] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802985] amdgpu 0000:0b:00.0:   at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802987] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0000213D
Nov 10 07:04:23 redacted kernel: [ 2266.802993] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802994] amdgpu 0000:0b:00.0:   at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802995] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Nov 10 07:04:23 redacted kernel: [ 2266.803403] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.803404] amdgpu 0000:0b:00.0:   at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.803406] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0000213D
Nov 10 07:04:23 redacted kernel: [ 2266.803410] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.803411] amdgpu 0000:0b:00.0:   at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.803412] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Nov 10 07:04:23 redacted kernel: [ 2266.803822] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.803824] amdgpu 0000:0b:00.0:   at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.803825] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0000213D
Nov 10 07:04:23 redacted kernel: [ 2266.803831] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.803833] amdgpu 0000:0b:00.0:   at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.803834] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000

Revision history for this message

In Linux Kernel Bug Tracker #201957, mh (mh-linux-kernel-bugs) wrote on 2019-11-25:

#23

Hi,

I recently built a 5.4.0-rc7 from drm-next (my HEAD was 17eee668b3cad423a47c090fe2275733c55db910) and also updated Mesa to 19.3.0-RC1.

Since then I didn't get any crashes. I have tested this for a few hours now, but it's entirely possible that I just didn't run into the bug for some reason, although it usually appeared after half an hour.

If possible please try this setup and see if it is fixed.

Revision history for this message

In Linux Kernel Bug Tracker #201957, j.cordoba (j.cordoba-linux-kernel-bugs) wrote on 2019-12-03:

#24

Hi,

This issue is still present in the latest kernels:

5.4.1, 5.4, 5.3.14

Last usable kernel for me is 4.20.17

System Specs

- Gigabyte b450-ds3h
- Ryzen 5 3400G (with RX Vega 11)
- Mesa 19.1.2 - padoka PPA (Stable)
- Ubuntu 18.04.3 LTS

Revision history for this message

In Linux Kernel Bug Tracker #201957, mh (mh-linux-kernel-bugs) wrote on 2019-12-03:

#25

Dear j.cordoba,

is it possible that you try to build 5.4.0-rc7 from drm-next and give it a test as I mentioned in Comment 18?

I'm running on this for some time now and the bug should have appeared by now, so I'm getting more confident that it is fixed.

Best regards
Matthias

Revision history for this message

In Linux Kernel Bug Tracker #201957, lukasz (lukasz-linux-kernel-bugs) wrote on 2019-12-03:

#26

Same is happening to me on 5.4.1. No issue with 4.9.

[ 44.172714] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[ 49.292694] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 58.469316] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[ 63.586055] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 156.606591] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

Revision history for this message

In Linux Kernel Bug Tracker #201957, pierre-eric.pelloux-prayer (pierre-eric.pelloux-prayer-linux-kernel-bugs) wrote on 2019-12-04:

#27

(In reply to shallowaloe from comment #16)
> Created attachment 285665 [details]
> 5 second video clip that triggers a crash
>
> Hi,
>
> I think I'm having the same problem as you guys. I run a mythbackend where
> I record cable television and those recordings often crash my system when
> hardware decoding is enabled. Usually it's just the screen that freezes and
> I can still ssh to it.
>
> Kernel 5.1.6 was an exception for me too, with that kernel I'm able to
> restart the display manager and recover without having to reboot.
>
> Attached is a short video that crashes my system. I can trigger the alert
> by running:
>
> mpv --vo=vaapi out.ts
>
> I'm wondering if it crashes your systems too and if it's related.

This one is probably a Mesa issue, see https://gitlab.freedesktop.org/mesa/mesa/issues/2177

What Mesa version are you using?

Revision history for this message

In Linux Kernel Bug Tracker #201957, shallowaloe (shallowaloe-linux-kernel-bugs) wrote on 2019-12-08:

#28

Created attachment 286227
attachment-25111-0.html

Thanks for the link to the bug. I'm running an ubuntu based system and am
using the oibaf ppa. The current version is 20.0.

On Wed, Dec 4, 2019 at 1:54 AM <email address hidden> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=201957
>
> Pierre-Eric Pelloux-Prayer (<email address hidden>) changed:
>
> What |Removed |Added
>
> ----------------------------------------------------------------------------
> CC|
> |pierre-eric.pelloux-prayer@
> | |amd.com
>
> --- Comment #22 from Pierre-Eric Pelloux-Prayer (
> <email address hidden>) ---
> (In reply to shallowaloe from comment #16)
> > Created attachment 285665 [details]
> > 5 second video clip that triggers a crash
> >
> > Hi,
> >
> > I think I'm having the same problem as you guys. I run a mythbackend
> where
> > I record cable television and those recordings often crash my system when
> > hardware decoding is enabled. Usually it's just the screen that freezes
> and
> > I can still ssh to it.
> >
> > Kernel 5.1.6 was an exception for me too, with that kernel I'm able to
> > restart the display manager and recover without having to reboot.
> >
> > Attached is a short video that crashes my system. I can trigger the
> alert
> > by running:
> >
> > mpv --vo=vaapi out.ts
> >
> > I'm wondering if it crashes your systems too and if it's related.
>
>
> This one is probably a Mesa issue, see
> https://gitlab.freedesktop.org/mesa/mesa/issues/2177
>
> What Mesa version are you using?
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

Revision history for this message

In Linux Kernel Bug Tracker #201957, janpieter.sollie (janpieter.sollie-linux-kernel-bugs) wrote on 2020-01-02:

#29

Hi everyone,

I have the same issue with a Fiji Nano GPU: UVD6 and VCE3 timeout in ring buffer test @ boot with the AMDGPU driver. Other rings seem to work correctly.
To make sure the hardware functions like it should, and it's not a HW error, where (in the amdgpu driver) can I increase the timeout value?

Revision history for this message

In Linux Kernel Bug Tracker #201957, janpieter.sollie (janpieter.sollie-linux-kernel-bugs) wrote on 2020-01-02:

#30

Created attachment 286575
kernel config 5.4.7 Fiji

Some additional info for my case:
- Running kernel 5.4.7 (vanilla), firmware 20191108 on gentoo
- Dmesg | grep -E "(drm)|(amdgpu)":
[ 3.930023] [drm] amdgpu kernel modesetting enabled.
[ 3.930217] amdgpu 0000:0a:00.0: remove_conflicting_pci_framebuffers: bar 0: 0xe0000000 -> 0xefffffff
[ 3.930219] amdgpu 0000:0a:00.0: remove_conflicting_pci_framebuffers: bar 2: 0xf0000000 -> 0xf01fffff
[ 3.930221] amdgpu 0000:0a:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xfce00000 -> 0xfce3ffff
[ 3.930224] fb0: switching to amdgpudrmfb from EFI VGA
[ 3.930475] [drm] initializing kernel modesetting (FIJI 0x1002:0x7300 0x1002:0x0B36 0xCA).
[ 3.930486] [drm] register mmio base: 0xFCE00000
[ 3.930486] [drm] register mmio size: 262144
[ 3.930495] [drm] add ip block number 0 <vi_common>
[ 3.930495] [drm] add ip block number 1 <gmc_v8_0>
[ 3.930496] [drm] add ip block number 2 <tonga_ih>
[ 3.930497] [drm] add ip block number 3 <gfx_v8_0>
[ 3.930498] [drm] add ip block number 4 <sdma_v3_0>
[ 3.930498] [drm] add ip block number 5 <powerplay>
[ 3.930499] [drm] add ip block number 6 <dm>
[ 3.930500] [drm] add ip block number 7 <uvd_v6_0>
[ 3.930500] [drm] add ip block number 8 <vce_v3_0>
[ 3.930715] [drm] UVD is enabled in physical mode
[ 3.930715] [drm] VCE enabled in physical mode
[ 3.930743] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[ 3.930751] amdgpu 0000:0a:00.0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
[ 3.930753] amdgpu 0000:0a:00.0: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
[ 3.930758] [drm] Detected VRAM RAM=4096M, BAR=256M
[ 3.930759] [drm] RAM width 512bits HBM
[ 3.930838] [drm] amdgpu: 4096M of VRAM memory ready
[ 3.930841] [drm] amdgpu: 4096M of GTT memory ready.
[ 3.930860] [drm] GART: num cpu pages 262144, num gpu pages 262144
[ 3.930928] [drm] PCIE GART of 1024M enabled (table at 0x000000F4001D5000).
[ 3.934174] [drm] Chained IB support enabled!
[ 3.940198] amdgpu: [powerplay] hwmgr_sw_init smu backed is fiji_smu
[ 3.941748] [drm] Found UVD firmware Version: 1.91 Family ID: 12
[ 3.941752] [drm] UVD ENC is disabled
[ 3.943542] [drm] Found VCE firmware Version: 55.2 Binary ID: 3
[ 4.009146] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!
[ 4.040084] [drm] Display Core initialized with v3.2.48!
[ 4.040542] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[ 4.040543] [drm] Driver supports precise vblank timestamp query.
[ 4.067774] [drm] UVD initialized successfully.
[ 4.168780] [drm] VCE initialized successfully.
[ 4.170163] [drm] Cannot find any crtc or sizes
[ 4.171948] [drm] Initialized amdgpu 3.35.0 20150101 for 0000:0a:00.0 on minor 0
[ 7.280062] amdgpu 0000:0a:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on uvd (-110).
[ 8.400365] amdgpu 0000:0a:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on vce0 (-110).
[ 8.400370] [drm:process_one_work] *ERROR* ib ring test failed (-110).

Created attachment 286575
kernel config 5.4.7 Fiji

Some additional info for my case:
- Running kernel 5.4.7 (vanilla), firmware 20191108 on gentoo
- Dmesg | grep -E "(drm)|(amdgpu)":
[    3.930023] [drm] amdgpu kernel modesetting enabled.
[    3.930217] amdgpu 0000:0a:00.0: remove_conflicting_pci_framebuffers: bar 0: 0xe0000000 -> 0xefffffff
[    3.930219] amdgpu 0000:0a:00.0: remove_conflicting_pci_framebuffers: bar 2: 0xf0000000 -> 0xf01fffff
[    3.930221] amdgpu 0000:0a:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xfce00000 -> 0xfce3ffff
[    3.930224] fb0: switching to amdgpudrmfb from EFI VGA
[    3.930475] [drm] initializing kernel modesetting (FIJI 0x1002:0x7300 0x1002:0x0B36 0xCA).
[    3.930486] [drm] register mmio base: 0xFCE00000
[    3.930486] [drm] register mmio size: 262144
[    3.930495] [drm] add ip block number 0 <vi_common>
[    3.930495] [drm] add ip block number 1 <gmc_v8_0>
[    3.930496] [drm] add ip block number 2 <tonga_ih>
[    3.930497] [drm] add ip block number 3 <gfx_v8_0>
[    3.930498] [drm] add ip block number 4 <sdma_v3_0>
[    3.930498] [drm] add ip block number 5 <powerplay>
[    3.930499] [drm] add ip block number 6 <dm>
[    3.930500] [drm] add ip block number 7 <uvd_v6_0>
[    3.930500] [drm] add ip block number 8 <vce_v3_0>
[    3.930715] [drm] UVD is enabled in physical mode
[    3.930715] [drm] VCE enabled in physical mode
[    3.930743] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[    3.930751] amdgpu 0000:0a:00.0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
[    3.930753] amdgpu 0000:0a:00.0: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
[    3.930758] [drm] Detected VRAM RAM=4096M, BAR=256M
[    3.930759] [drm] RAM width 512bits HBM
[    3.930838] [drm] amdgpu: 4096M of VRAM memory ready
[    3.930841] [drm] amdgpu: 4096M of GTT memory ready.
[    3.930860] [drm] GART: num cpu pages 262144, num gpu pages 262144
[    3.930928] [drm] PCIE GART of 1024M enabled (table at 0x000000F4001D5000).
[    3.934174] [drm] Chained IB support enabled!
[    3.940198] amdgpu: [powerplay] hwmgr_sw_init smu backed is fiji_smu
[    3.941748] [drm] Found UVD firmware Version: 1.91 Family ID: 12
[    3.941752] [drm] UVD ENC is disabled
[    3.943542] [drm] Found VCE firmware Version: 55.2 Binary ID: 3
[    4.009146] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!
[    4.040084] [drm] Display Core initialized with v3.2.48!
[    4.040542] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    4.040543] [drm] Driver supports precise vblank timestamp query.
[    4.067774] [drm] UVD initialized successfully.
[    4.168780] [drm] VCE initialized successfully.
[    4.170163] [drm] Cannot find any crtc or sizes
[    4.171948] [drm] Initialized amdgpu 3.35.0 20150101 for 0000:0a:00.0 on minor 0
[    7.280062] amdgpu 0000:0a:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on uvd (-110).
[    8.400365] amdgpu 0000:0a:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on vce0 (-110).
[    8.400370] [drm:process_one_work] *ERROR* ib ring test failed (-110).

Revision history for this message

In Linux Kernel Bug Tracker #201957, delentef (delentef-linux-kernel-bugs) wrote on 2020-01-19:

#31

Hello, I have the same problem on a Huawei Matebook D lapop, processor is an AMD Ryzen 5 with an integrated Radeon Vega Mobile GPU.

I use Fedora 31. The problem appeared when upgrading from then 5.3.16 kernel to the 5.4.6 kernel. Reverting to 5.3.16 solved the issue.

At some moments the UI (XFCE) freezes for about 5 seconds; I can move the mouse cursor but I can't get any keyboard input (not in X, not by switching console). Each time the freeze occurs dmesg shows the messages

[ 45.530374] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[ 50.139408] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

I include /proc/cpuinfo and lspci outputs.

Revision history for this message

In Linux Kernel Bug Tracker #201957, delentef (delentef-linux-kernel-bugs) wrote on 2020-01-19:

#32

Created attachment 286899
/proc/cpuinfo

Revision history for this message

In Linux Kernel Bug Tracker #201957, delentef (delentef-linux-kernel-bugs) wrote on 2020-01-19:

#33

Created attachment 286901
lspci output

Revision history for this message

In Linux Kernel Bug Tracker #201957, mh (mh-linux-kernel-bugs) wrote on 2020-01-19:

#34

Hi. This bug is already reported here by me https://gitlab.freedesktop.org/drm/amd/issues/953

If possible try a 5.5-rc kernel and see if it's fixed there. It's fixed - at least for me - in the drm-tree.

Best regards
Matthias

Revision history for this message

In Linux Kernel Bug Tracker #201957, sellis (sellis-linux-kernel-bugs) wrote on 2020-04-04:

#35

I"m seeing the same issue on Ubuntu 18.04 with

Upstream PPA "sudo add-apt-repository ppa:oibaf/graphics-drivers"

[ 321.412530] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[ 326.286306] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=4447, emitted seq=4449
[ 326.286395] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process mythfrontend.re pid 2410 thread mythfronte:cs0 pid 2880

AMDGPUPRO driver 19.50-967956

[20913.330563] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[20918.450513] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[20923.570306] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[20928.690699] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

Revision history for this message

In Linux Kernel Bug Tracker #201957, mh (mh-linux-kernel-bugs) wrote on 2020-05-01:

#36

Hi,

for me this bug is fixed with a 5.5 kernel. And I'm wondering if this is fixed for all of you, too.

Best
Matthias

Revision history for this message

In Linux Kernel Bug Tracker #201957, j.cordoba (j.cordoba-linux-kernel-bugs) wrote on 2020-05-01:

#37

I agree. Fixed for me too

Revision history for this message

In Linux Kernel Bug Tracker #201957, udovdh (udovdh-linux-kernel-bugs) wrote on 2020-05-25:

#38

Download full text (7.4 KiB)

I still see them on 5.6.13:

[191571.372560] sd 11:0:0:0: [sde] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
[205796.424607] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=4518280, emitted seq=4518282
[205796.424637] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process mpv pid 488243 thread mpv:cs0 pid 488257
[205796.424640] amdgpu 0000:0a:00.0: GPU reset begin!
[205800.840504] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[205800.937565] amdgpu 0000:0a:00.0: GPU reset succeeded, trying to resume
[205800.938060] [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000).
[205800.938849] [drm] PSP is resuming...
[205800.958729] [drm] reserve 0x400000 from 0xf47f800000 for PSP TMR
[205800.972414] [drm] psp command (0x5) failed and response status is (0xFFFF0007)
[205801.176411] amdgpu 0000:0a:00.0: RAS: ras ta ucode is not available
[205801.460775] [drm] kiq ring mec 2 pipe 1 q 0
[205801.460986] amdgpu 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0002 address=0x800002300 flags=0x0000]
[205801.516698] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[205801.516709] amdgpu 0000:0a:00.0: ring gfx uses VM inv eng 0 on hub 0
[205801.516713] amdgpu 0000:0a:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[205801.516717] amdgpu 0000:0a:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[205801.516720] amdgpu 0000:0a:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[205801.516724] amdgpu 0000:0a:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[205801.516727] amdgpu 0000:0a:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[205801.516730] amdgpu 0000:0a:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[205801.516733] amdgpu 0000:0a:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[205801.516736] amdgpu 0000:0a:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[205801.516740] amdgpu 0000:0a:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[205801.516743] amdgpu 0000:0a:00.0: ring sdma0 uses VM inv eng 0 on hub 1
[205801.516746] amdgpu 0000:0a:00.0: ring vcn_dec uses VM inv eng 1 on hub 1
[205801.516749] amdgpu 0000:0a:00.0: ring vcn_enc0 uses VM inv eng 4 on hub 1
[205801.516752] amdgpu 0000:0a:00.0: ring vcn_enc1 uses VM inv eng 5 on hub 1
[205801.516755] amdgpu 0000:0a:00.0: ring jpeg_dec uses VM inv eng 6 on hub 1
[205801.525996] [drm] recover vram bo from shadow start
[205801.525998] [drm] recover vram bo from shadow done
[205801.526008] [drm] Skip scheduling IBs!
[205801.526051] amdgpu 0000:0a:00.0: GPU reset(1) succeeded!
[205802.536444] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=4518342, emitted seq=4518344
[205802.536523] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 3825 thread gnome-shel:cs0 pid 3834
[205802.536531] amdgpu 0000:0a:00.0: GPU reset begin!
[205806.728558] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[205806.821326] amdgpu 0000:0a:00.0: GPU reset succeeded, trying to resume
[205806.821578] [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000).
[205806.821899] [drm] PSP is...

I still see them on 5.6.13:

[191571.372560] sd 11:0:0:0: [sde] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
[205796.424607] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=4518280, emitted seq=4518282
[205796.424637] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process mpv pid 488243 thread mpv:cs0 pid 488257
[205796.424640] amdgpu 0000:0a:00.0: GPU reset begin!
[205800.840504] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[205800.937565] amdgpu 0000:0a:00.0: GPU reset succeeded, trying to resume
[205800.938060] [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000).
[205800.938849] [drm] PSP is resuming...
[205800.958729] [drm] reserve 0x400000 from 0xf47f800000 for PSP TMR
[205800.972414] [drm] psp command (0x5) failed and response status is (0xFFFF0007)
[205801.176411] amdgpu 0000:0a:00.0: RAS: ras ta ucode is not available
[205801.460775] [drm] kiq ring mec 2 pipe 1 q 0
[205801.460986] amdgpu 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0002 address=0x800002300 flags=0x0000]
[205801.516698] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[205801.516709] amdgpu 0000:0a:00.0: ring gfx uses VM inv eng 0 on hub 0
[205801.516713] amdgpu 0000:0a:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[205801.516717] amdgpu 0000:0a:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[205801.516720] amdgpu 0000:0a:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[205801.516724] amdgpu 0000:0a:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[205801.516727] amdgpu 0000:0a:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[205801.516730] amdgpu 0000:0a:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[205801.516733] amdgpu 0000:0a:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[205801.516736] amdgpu 0000:0a:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[205801.516740] amdgpu 0000:0a:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[205801.516743] amdgpu 0000:0a:00.0: ring sdma0 uses VM inv eng 0 on hub 1
[205801.516746] amdgpu 0000:0a:00.0: ring vcn_dec uses VM inv eng 1 on hub 1
[205801.516749] amdgpu 0000:0a:00.0: ring vcn_enc0 uses VM inv eng 4 on hub 1
[205801.516752] amdgpu 0000:0a:00.0: ring vcn_enc1 uses VM inv eng 5 on hub 1
[205801.516755] amdgpu 0000:0a:00.0: ring jpeg_dec uses VM inv eng 6 on hub 1
[205801.525996] [drm] recover vram bo from shadow start
[205801.525998] [drm] recover vram bo from shadow done
[205801.526008] [drm] Skip scheduling IBs!
[205801.526051] amdgpu 0000:0a:00.0: GPU reset(1) succeeded!
[205802.536444] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=4518342, emitted seq=4518344
[205802.536523] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 3825 thread gnome-shel:cs0 pid 3834
[205802.536531] amdgpu 0000:0a:00.0: GPU reset begin!
[205806.728558] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[205806.821326] amdgpu 0000:0a:00.0: GPU reset succeeded, trying to resume
[205806.821578] [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000).
[205806.821899] [drm] PSP is resuming...
[205806.841769] [drm] reserve 0x400000 from 0xf47f800000 for PSP TMR
[205806.856213] [drm] psp command (0x5) failed and response status is (0xFFFF0007)
[205807.072210] amdgpu 0000:0a:00.0: RAS: ras ta ucode is not available
[205807.355997] [drm] kiq ring mec 2 pipe 1 q 0
[205807.356308] amdgpu 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0002 address=0x800072f00 flags=0x0000]
[205807.409389] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[205807.409401] amdgpu 0000:0a:00.0: ring gfx uses VM inv eng 0 on hub 0
[205807.409406] amdgpu 0000:0a:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[205807.409410] amdgpu 0000:0a:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[205807.409415] amdgpu 0000:0a:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[205807.409418] amdgpu 0000:0a:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[205807.409422] amdgpu 0000:0a:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[205807.409425] amdgpu 0000:0a:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[205807.409429] amdgpu 0000:0a:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[205807.409432] amdgpu 0000:0a:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[205807.409436] amdgpu 0000:0a:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[205807.409440] amdgpu 0000:0a:00.0: ring sdma0 uses VM inv eng 0 on hub 1
[205807.409444] amdgpu 0000:0a:00.0: ring vcn_dec uses VM inv eng 1 on hub 1
[205807.409447] amdgpu 0000:0a:00.0: ring vcn_enc0 uses VM inv eng 4 on hub 1
[205807.409451] amdgpu 0000:0a:00.0: ring vcn_enc1 uses VM inv eng 5 on hub 1
[205807.409454] amdgpu 0000:0a:00.0: ring jpeg_dec uses VM inv eng 6 on hub 1
[205807.418547] [drm] recover vram bo from shadow start
[205807.418549] [drm] recover vram bo from shadow done
[205807.418567] [drm] Skip scheduling IBs!
[205807.418569] [drm] Skip scheduling IBs!
[205807.418592] [drm] Skip scheduling IBs!
[205807.418613] amdgpu 0000:0a:00.0: GPU reset(2) succeeded!
[205808.428469] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[205809.458201] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=11463546, emitted seq=11463549
[205809.458282] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 3513 thread Xorg:cs0 pid 3514
[205809.458289] amdgpu 0000:0a:00.0: GPU reset begin!
[205812.872123] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[205812.981471] amdgpu 0000:0a:00.0: GPU reset succeeded, trying to resume
[205812.981823] [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000).
[205812.982264] [drm] PSP is resuming...
[205813.002134] [drm] reserve 0x400000 from 0xf47f800000 for PSP TMR
[205813.012088] [drm] psp command (0x5) failed and response status is (0xFFFF0007)
[205813.208005] amdgpu 0000:0a:00.0: RAS: ras ta ucode is not available
[205813.497603] [drm] kiq ring mec 2 pipe 1 q 0
[205813.551494] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[205813.551506] amdgpu 0000:0a:00.0: ring gfx uses VM inv eng 0 on hub 0
[205813.551510] amdgpu 0000:0a:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[205813.551514] amdgpu 0000:0a:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[205813.551517] amdgpu 0000:0a:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[205813.551520] amdgpu 0000:0a:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[205813.551524] amdgpu 0000:0a:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[205813.551526] amdgpu 0000:0a:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[205813.551529] amdgpu 0000:0a:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[205813.551532] amdgpu 0000:0a:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[205813.551535] amdgpu 0000:0a:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[205813.551538] amdgpu 0000:0a:00.0: ring sdma0 uses VM inv eng 0 on hub 1
[205813.551541] amdgpu 0000:0a:00.0: ring vcn_dec uses VM inv eng 1 on hub 1
[205813.551543] amdgpu 0000:0a:00.0: ring vcn_enc0 uses VM inv eng 4 on hub 1
[205813.551546] amdgpu 0000:0a:00.0: ring vcn_enc1 uses VM inv eng 5 on hub 1
[205813.551549] amdgpu 0000:0a:00.0: ring jpeg_dec uses VM inv eng 6 on hub 1
[205902.384966] traps: Bluez D-Bus thr[409727] trap invalid opcode ip:555cd19202af sp:7f265cf9de10 error:0 in skypeforlinux[555ccfa02000+542a000]

Revision history for this message

In Linux Kernel Bug Tracker #201957, panospolychronis (panospolychronis-linux-kernel-bugs) wrote on 2020-06-19:

#39

Download full text (21.6 KiB)

The problem still exists with Linux Kernel 5.8-rc1 from git. (My graphics card is Radeon 5600XT)

[20581.087159] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2768656, emitted seq=2768658
[20581.087212] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process DOOMEternalx64v pid 8875 thread DOOMEternalx64v pid 8875
[20581.087217] amdgpu 0000:29:00.0: amdgpu: GPU reset begin!
[20583.381257] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[20585.087232] amdgpu 0000:29:00.0: amdgpu: failed to suspend display audio
[20585.156036] snd_hda_codec_hdmi hdaudioC0D0: HDMI: ELD buf size is 0, force 128
[20585.156052] snd_hda_codec_hdmi hdaudioC0D0: HDMI: invalid ELD data byte 0
[20585.463157] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[20585.463205] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[20585.694999] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[20585.695047] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[20585.926951] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[20588.045497] amdgpu 0000:29:00.0: amdgpu: GPU reset succeeded, trying to resume
[20588.045605] [drm] PCIE GART of 512M enabled (table at 0x0000008000E10000).
[20588.045682] [drm] VRAM is lost due to GPU reset!
[20588.048023] [drm] PSP is resuming...
[20588.218089] [drm] reserve 0x900000 from 0x817e400000 for PSP TMR
[20588.287093] amdgpu 0000:29:00.0: amdgpu: RAS: optional ras ta ucode is not available
[20588.293101] amdgpu: SMU is resuming...
[20588.295088] amdgpu: SMU is resumed successfully!
[20588.413155] [drm] kiq ring mec 2 pipe 1 q 0
[20588.417493] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[20588.417632] [drm] JPEG decode initialized successfully.
[20588.417690] amdgpu 0000:29:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[20588.417693] amdgpu 0000:29:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[20588.417697] amdgpu 0000:29:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[20588.417700] amdgpu 0000:29:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[20588.417703] amdgpu 0000:29:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[20588.417707] amdgpu 0000:29:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[20588.417709] amdgpu 0000:29:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[20588.417713] amdgpu 0000:29:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[20588.417716] amdgpu 0000:29:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[20588.417719] amdgpu 0000:29:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[20588.417721] amdgpu 0000:29:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[20588.417724] amdgpu 0000:29:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[20588.417726] amdgpu 0000:29:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 1
[20588.417728] amdgpu 0000:29:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 1
[20588.417730] amdgpu 0000:29:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on h...

The problem still exists with Linux Kernel 5.8-rc1 from git. (My graphics card is Radeon 5600XT)

[20581.087159] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2768656, emitted seq=2768658
[20581.087212] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process DOOMEternalx64v pid 8875 thread DOOMEternalx64v pid 8875
[20581.087217] amdgpu 0000:29:00.0: amdgpu: GPU reset begin!
[20583.381257] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[20585.087232] amdgpu 0000:29:00.0: amdgpu: failed to suspend display audio
[20585.156036] snd_hda_codec_hdmi hdaudioC0D0: HDMI: ELD buf size is 0, force 128
[20585.156052] snd_hda_codec_hdmi hdaudioC0D0: HDMI: invalid ELD data byte 0
[20585.463157] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[20585.463205] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[20585.694999] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[20585.695047] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[20585.926951] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[20588.045497] amdgpu 0000:29:00.0: amdgpu: GPU reset succeeded, trying to resume
[20588.045605] [drm] PCIE GART of 512M enabled (table at 0x0000008000E10000).
[20588.045682] [drm] VRAM is lost due to GPU reset!
[20588.048023] [drm] PSP is resuming...
[20588.218089] [drm] reserve 0x900000 from 0x817e400000 for PSP TMR
[20588.287093] amdgpu 0000:29:00.0: amdgpu: RAS: optional ras ta ucode is not available
[20588.293101] amdgpu: SMU is resuming...
[20588.295088] amdgpu: SMU is resumed successfully!
[20588.413155] [drm] kiq ring mec 2 pipe 1 q 0
[20588.417493] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[20588.417632] [drm] JPEG decode initialized successfully.
[20588.417690] amdgpu 0000:29:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[20588.417693] amdgpu 0000:29:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[20588.417697] amdgpu 0000:29:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[20588.417700] amdgpu 0000:29:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[20588.417703] amdgpu 0000:29:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[20588.417707] amdgpu 0000:29:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[20588.417709] amdgpu 0000:29:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[20588.417713] amdgpu 0000:29:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[20588.417716] amdgpu 0000:29:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[20588.417719] amdgpu 0000:29:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[20588.417721] amdgpu 0000:29:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[20588.417724] amdgpu 0000:29:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[20588.417726] amdgpu 0000:29:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 1
[20588.417728] amdgpu 0000:29:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 1
[20588.417730] amdgpu 0000:29:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 1
[20588.417732] amdgpu 0000:29:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
[20588.421588] [drm] recover vram bo from shadow start
[20588.427530] [drm] recover vram bo from shadow done
[20588.427534] [drm] Skip scheduling IBs!
[20588.427537] [drm] Skip scheduling IBs!
[20588.427565] [drm] Skip scheduling IBs!
[20588.427573] [drm] Skip scheduling IBs!
[20588.427583] [drm] Skip scheduling IBs!
[20588.427591] [drm] Skip scheduling IBs!
[20588.427597] [drm] Skip scheduling IBs!
[20588.427649] [drm] Skip scheduling IBs!
[20588.427669] [drm] Skip scheduling IBs!
[20588.427680] [drm] Skip scheduling IBs!
[20588.427692] [drm] Skip scheduling IBs!
[20588.427693] [drm] Skip scheduling IBs!
[20588.427699] [drm] Skip scheduling IBs!
[20588.427703] [drm] Skip scheduling IBs!
[20588.427710] [drm] Skip scheduling IBs!
[20588.427714] amdgpu 0000:29:00.0: amdgpu: GPU reset(2) succeeded!
[20588.427719] [drm] Skip scheduling IBs!
[20588.427721] [drm] Skip scheduling IBs!
[20588.427724] [drm] Skip scheduling IBs!
[20588.427726] [drm] Skip scheduling IBs!
[20600.095254] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2768668, emitted seq=2768669
[20600.095404] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process plasmashell pid 1570 thread plasmashel:cs0 pid 1713
[20600.095413] amdgpu 0000:29:00.0: amdgpu: GPU reset begin!
[20604.095435] amdgpu 0000:29:00.0: amdgpu: failed to suspend display audio
[20604.448799] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[20604.448848] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[20604.681029] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[20604.681078] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[20604.913262] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[20605.288303] Disabling lock debugging due to kernel taint
[20605.288325] mce: [Hardware Error]: Machine check events logged
[20605.288327] [Hardware Error]: Uncorrected, software restartable error.
[20605.288330] [Hardware Error]: CPU:1 (17:8:2) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135
[20605.288335] [Hardware Error]: Error Addr: 0x00000000e8ac0000
[20605.288337] [Hardware Error]: IPID: 0x000000b000000000
[20605.288339] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
[20605.288341] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
[20605.288345] mce: Uncorrected hardware memory error in user-access at e8ac0000
[20605.288347] Memory failure: 0xe8ac0: memory outside kernel control
[20605.288348] mce: Memory error not recovered
[20605.288361] amdgpu 0000:29:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0003 address=0x8ac0000 flags=0x0000]
[20605.288375] amdgpu 0000:29:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0003 address=0x8ac0000 flags=0x0000]
[20607.031477] amdgpu 0000:29:00.0: amdgpu: GPU reset succeeded, trying to resume
[20607.031591] [drm] PCIE GART of 512M enabled (table at 0x0000008000E10000).
[20607.031613] [drm] VRAM is lost due to GPU reset!
[20607.034094] [drm] PSP is resuming...
[20607.204092] [drm] reserve 0x900000 from 0x817e400000 for PSP TMR
[20607.273093] amdgpu 0000:29:00.0: amdgpu: RAS: optional ras ta ucode is not available
[20607.279097] amdgpu: SMU is resuming...
[20607.281035] amdgpu: SMU is resumed successfully!
[20607.397649] [drm] kiq ring mec 2 pipe 1 q 0
[20607.402090] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[20607.402494] [drm] JPEG decode initialized successfully.
[20607.402540] amdgpu 0000:29:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[20607.402542] amdgpu 0000:29:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[20607.402544] amdgpu 0000:29:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[20607.402546] amdgpu 0000:29:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[20607.402548] amdgpu 0000:29:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[20607.402549] amdgpu 0000:29:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[20607.402551] amdgpu 0000:29:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[20607.402553] amdgpu 0000:29:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[20607.402554] amdgpu 0000:29:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[20607.402556] amdgpu 0000:29:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[20607.402558] amdgpu 0000:29:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[20607.402559] amdgpu 0000:29:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[20607.402561] amdgpu 0000:29:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 1
[20607.402563] amdgpu 0000:29:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 1
[20607.402564] amdgpu 0000:29:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 1
[20607.402566] amdgpu 0000:29:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
[20607.405742] [drm] recover vram bo from shadow start
[20607.409317] [drm] recover vram bo from shadow done
[20607.409320] [drm] Skip scheduling IBs!
[20607.409410] amdgpu 0000:29:00.0: amdgpu: GPU reset(4) succeeded!
[20607.493800] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* got no status for stream 00000000fbb3d792 on acrtc00000000bb69f545
[20607.494597] ------------[ cut here ]------------
[20607.494599] WARNING: CPU: 10 PID: 999 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:7429 amdgpu_dm_atomic_commit_tail+0x1ada/0x22b0 [amdgpu]
[20607.494599] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq fuse joydev mousedev input_leds hid_generic usbhid hid uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_usb_audio videobuf2_common videodev snd_usbmidi_lib snd_rawmidi snd_seq_device mc rfkill squashfs nls_iso8859_1 snd_hda_codec_realtek nls_cp437 vfat snd_hda_codec_generic fat ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg loop snd_hda_codec edac_mce_amd amd_energy snd_hda_core kvm_amd snd_hwdep kvm wmi_bmof snd_pcm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel snd_timer aesni_intel r8169 snd crypto_simd realtek cryptd ccp glue_helper sp5100_tco k10temp soundcore libphy i2c_piix4 rng_core pcspkr wmi evdev mac_hid pinctrl_amd gpio_amdpt acpi_cpufreq uinput sg crypto_user ip_tables x_tables xhci_pci xhci_pci_renesas xhci_hcd amdgpu gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm
[20607.494633] CPU: 10 PID: 999 Comm: Xorg Tainted: G   M              5.8.0-rc1-MANJARO+ #2
[20607.494634] Hardware name: Micro-Star International Co., Ltd. MS-7B79/X470 GAMING PLUS (MS-7B79), BIOS A.G0 11/11/2019
[20607.494635] RIP: 0010:amdgpu_dm_atomic_commit_tail+0x1ada/0x22b0 [amdgpu]
[20607.494636] Code: 8b bd e8 fc ff ff e8 d5 7f 10 00 48 85 c0 0f 85 23 e9 ff ff 49 8b b5 e8 01 00 00 4c 89 e2 48 c7 c7 e0 5c 91 c0 e8 f6 74 d0 ff <0f> 0b 49 8b 4f 08 e9 10 e9 ff ff 49 8b 45 00 48 8d b8 78 01 00 00
[20607.494637] RSP: 0018:ffffa6b781987838 EFLAGS: 00010246
[20607.494638] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[20607.494639] RDX: 0000000000000000 RSI: ffffffffaaf63047 RDI: 00000000ffffffff
[20607.494640] RBP: ffffa6b781987ba8 R08: 000000000000053e R09: 0000000000000001
[20607.494641] R10: 0000000000000000 R11: 0000000000000001 R12: ffff941201964000
[20607.494641] R13: ffff9410db79d400 R14: ffff94110b71bc00 R15: ffff9410fcc69880
[20607.494642] FS:  00007f87fbe2be80(0000) GS:ffff94120ea80000(0000) knlGS:0000000000000000
[20607.494643] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[20607.494644] CR2: 0000000000fb1fe8 CR3: 0000000402700000 CR4: 00000000003406e0
[20607.494644] Call Trace:
[20607.494644]  ? sched_clock+0x5/0x10
[20607.494645]  ? irqtime_account_irq+0x90/0xc0
[20607.494646]  ? preempt_count_add+0x68/0xa0
[20607.494646]  commit_tail+0x94/0x130 [drm_kms_helper]
[20607.494647]  drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
[20607.494648]  drm_atomic_helper_update_plane+0xe9/0x140 [drm_kms_helper]
[20607.494648]  drm_mode_cursor_universal+0x128/0x240 [drm]
[20607.494649]  drm_mode_cursor_common+0x102/0x230 [drm]
[20607.494650]  ? drm_mode_cursor_ioctl+0x70/0x70 [drm]
[20607.494650]  drm_ioctl_kernel+0xb2/0x100 [drm]
[20607.494651]  drm_ioctl+0x208/0x360 [drm]
[20607.494651]  ? drm_mode_cursor_ioctl+0x70/0x70 [drm]
[20607.494652]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[20607.494652]  ksys_ioctl+0x82/0xc0
[20607.494653]  __x64_sys_ioctl+0x16/0x20
[20607.494653]  do_syscall_64+0x44/0x70
[20607.494654]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[20607.494655] RIP: 0033:0x7f87fca658eb
[20607.494655] Code: Bad RIP value.
[20607.494656] RSP: 002b:00007ffc20a98628 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[20607.494657] RAX: ffffffffffffffda RBX: 00007ffc20a98660 RCX: 00007f87fca658eb
[20607.494658] RDX: 00007ffc20a98660 RSI: 00000000c02464bb RDI: 000000000000000d
[20607.494659] RBP: 00000000c02464bb R08: 000055c87121c270 R09: 000000000000007f
[20607.494659] R10: 0000000000000a00 R11: 0000000000000246 R12: 000055c87109aad0
[20607.494660] R13: 000000000000000d R14: 0000000000000004 R15: 000055c87109b210
[20607.494661] ---[ end trace 96f7cc95700c9634 ]---
[20610.652685] GpuWatchdog[5225]: segfault at 0 ip 000055f7f6e6f76d sp 00007fa63e0b05d0 error 6 in chrome[55f7f27c2000+785b000]
[20610.652696] Code: Bad RIP value.
[20610.652994] audit: type=1701 audit(1592593154.666:113): auid=1000 uid=1000 gid=1000 ses=2 subj==unconfined pid=5147 comm="GpuWatchdog" exe="/opt/google/chrome/chrome" sig=11 res=1
[20610.674438] audit: type=1334 audit(1592593154.687:114): prog-id=15 op=LOAD
[20610.674597] audit: type=1334 audit(1592593154.687:115): prog-id=16 op=LOAD
[20610.675951] audit: type=1130 audit(1592593154.688:116): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=systemd-coredump@0-10631-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[20611.663071] audit: type=1131 audit(1592593155.675:117): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=systemd-coredump@0-10631-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[20611.701231] audit: type=1334 audit(1592593155.714:118): prog-id=16 op=UNLOAD
[20611.701236] audit: type=1334 audit(1592593155.714:119): prog-id=15 op=UNLOAD
[20617.685151] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:62:crtc-0] flip_done timed out
[20617.694549] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[20627.925351] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:62:crtc-0] flip_done timed out
[20638.165634] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CONNECTOR:80:DP-2] flip_done timed out
[20648.405154] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:55:plane-5] flip_done timed out
[20658.645157] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:61:plane-7] flip_done timed out
[20658.646471] ------------[ cut here ]------------
[20658.646473] WARNING: CPU: 10 PID: 999 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:7016 amdgpu_dm_atomic_commit_tail+0x2139/0x22b0 [amdgpu]
[20658.646474] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq fuse joydev mousedev input_leds hid_generic usbhid hid uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_usb_audio videobuf2_common videodev snd_usbmidi_lib snd_rawmidi snd_seq_device mc rfkill squashfs nls_iso8859_1 snd_hda_codec_realtek nls_cp437 vfat snd_hda_codec_generic fat ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg loop snd_hda_codec edac_mce_amd amd_energy snd_hda_core kvm_amd snd_hwdep kvm wmi_bmof snd_pcm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel snd_timer aesni_intel r8169 snd crypto_simd realtek cryptd ccp glue_helper sp5100_tco k10temp soundcore libphy i2c_piix4 rng_core pcspkr wmi evdev mac_hid pinctrl_amd gpio_amdpt acpi_cpufreq uinput sg crypto_user ip_tables x_tables xhci_pci xhci_pci_renesas xhci_hcd amdgpu gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm
[20658.646503] CPU: 10 PID: 999 Comm: Xorg Tainted: G   M    W         5.8.0-rc1-MANJARO+ #2
[20658.646504] Hardware name: Micro-Star International Co., Ltd. MS-7B79/X470 GAMING PLUS (MS-7B79), BIOS A.G0 11/11/2019
[20658.646505] RIP: 0010:amdgpu_dm_atomic_commit_tail+0x2139/0x22b0 [amdgpu]
[20658.646506] Code: 22 ef ff ff 41 8b 4c 24 60 48 c7 c2 20 bc 89 c0 bf 02 00 00 00 48 c7 c6 88 58 91 c0 e8 e0 6d d0 ff 49 8b 4f 08 e9 8f e0 ff ff <0f> 0b e9 0a f0 ff ff 0f 0b 0f 0b e9 21 f0 ff ff 48 8b 85 f0 fc ff
[20658.646506] RSP: 0018:ffffa6b781987948 EFLAGS: 00010002
[20658.646507] RAX: 0000000000000286 RBX: 0000000000000bfc RCX: 0000000000000000
[20658.646508] RDX: 0000000000000002 RSI: 0000000000000206 RDI: 0000000000000000
[20658.646509] RBP: ffffa6b781987cb8 R08: 0000000000000005 R09: 0000000000000000
[20658.646509] R10: ffffa6b7819878b0 R11: ffffa6b7819878b4 R12: 0000000000000286
[20658.646510] R13: ffff941201964000 R14: ffff9410db79c000 R15: ffff9410fcc69600
[20658.646511] FS:  00007f87fbe2be80(0000) GS:ffff94120ea80000(0000) knlGS:0000000000000000
[20658.646511] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[20658.646512] CR2: 00001a0ee45cb008 CR3: 0000000402700000 CR4: 00000000003406e0
[20658.646512] Call Trace:
[20658.646513]  commit_tail+0x94/0x130 [drm_kms_helper]
[20658.646514]  drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
[20658.646514]  drm_mode_obj_set_property_ioctl+0x156/0x320 [drm]
[20658.646515]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
[20658.646515]  drm_ioctl_kernel+0xb2/0x100 [drm]
[20658.646516]  drm_ioctl+0x208/0x360 [drm]
[20658.646516]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
[20658.646517]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[20658.646517]  ksys_ioctl+0x82/0xc0
[20658.646518]  __x64_sys_ioctl+0x16/0x20
[20658.646518]  do_syscall_64+0x44/0x70
[20658.646519]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[20658.646519] RIP: 0033:0x7f87fca658eb
[20658.646520] Code: Bad RIP value.
[20658.646520] RSP: 002b:00007ffc20a995c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[20658.646521] RAX: ffffffffffffffda RBX: 00007ffc20a99600 RCX: 00007f87fca658eb
[20658.646522] RDX: 00007ffc20a99600 RSI: 00000000c01864ba RDI: 000000000000000d
[20658.646523] RBP: 00000000c01864ba R08: 000000000000006c R09: 00000000cccccccc
[20658.646523] R10: 0000000000000fff R11: 0000000000000246 R12: 000055c87121db90
[20658.646524] R13: 000000000000000d R14: 0000000000000000 R15: 0000000000000003
[20658.646525] ---[ end trace 96f7cc95700c9635 ]---
[20658.646525] ------------[ cut here ]------------
[20658.646526] WARNING: CPU: 10 PID: 999 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:6613 amdgpu_dm_atomic_commit_tail+0x2142/0x22b0 [amdgpu]
[20658.646527] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq fuse joydev mousedev input_leds hid_generic usbhid hid uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_usb_audio videobuf2_common videodev snd_usbmidi_lib snd_rawmidi snd_seq_device mc rfkill squashfs nls_iso8859_1 snd_hda_codec_realtek nls_cp437 vfat snd_hda_codec_generic fat ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg loop snd_hda_codec edac_mce_amd amd_energy snd_hda_core kvm_amd snd_hwdep kvm wmi_bmof snd_pcm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel snd_timer aesni_intel r8169 snd crypto_simd realtek cryptd ccp glue_helper sp5100_tco k10temp soundcore libphy i2c_piix4 rng_core pcspkr wmi evdev mac_hid pinctrl_amd gpio_amdpt acpi_cpufreq uinput sg crypto_user ip_tables x_tables xhci_pci xhci_pci_renesas xhci_hcd amdgpu gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm
[20658.646556] CPU: 10 PID: 999 Comm: Xorg Tainted: G   M    W         5.8.0-rc1-MANJARO+ #2
[20658.646557] Hardware name: Micro-Star International Co., Ltd. MS-7B79/X470 GAMING PLUS (MS-7B79), BIOS A.G0 11/11/2019
[20658.646557] RIP: 0010:amdgpu_dm_atomic_commit_tail+0x2142/0x22b0 [amdgpu]
[20658.646558] Code: 48 c7 c2 20 bc 89 c0 bf 02 00 00 00 48 c7 c6 88 58 91 c0 e8 e0 6d d0 ff 49 8b 4f 08 e9 8f e0 ff ff 0f 0b e9 0a f0 ff ff 0f 0b <0f> 0b e9 21 f0 ff ff 48 8b 85 f0 fc ff ff 48 8d 8d 64 fd ff ff 48
[20658.646559] RSP: 0018:ffffa6b781987948 EFLAGS: 00010082
[20658.646560] RAX: 0000000000000001 RBX: 0000000000000bfc RCX: 0000000000000000
[20658.646561] RDX: 0000000000000002 RSI: 0000000000000206 RDI: 0000000000000000
[20658.646561] RBP: ffffa6b781987cb8 R08: 0000000000000005 R09: 0000000000000000
[20658.646562] R10: ffffa6b7819878b0 R11: ffffa6b7819878b4 R12: 0000000000000286
[20658.646563] R13: ffff941201964000 R14: ffff9410db79c000 R15: ffff9410fcc69600
[20658.646563] FS:  00007f87fbe2be80(0000) GS:ffff94120ea80000(0000) knlGS:0000000000000000
[20658.646564] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[20658.646564] CR2: 00001a0ee45cb008 CR3: 0000000402700000 CR4: 00000000003406e0
[20658.646565] Call Trace:
[20658.646565]  commit_tail+0x94/0x130 [drm_kms_helper]
[20658.646566]  drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
[20658.646567]  drm_mode_obj_set_property_ioctl+0x156/0x320 [drm]
[20658.646567]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
[20658.646568]  drm_ioctl_kernel+0xb2/0x100 [drm]
[20658.646568]  drm_ioctl+0x208/0x360 [drm]
[20658.646569]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
[20658.646569]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[20658.646570]  ksys_ioctl+0x82/0xc0
[20658.646570]  __x64_sys_ioctl+0x16/0x20
[20658.646571]  do_syscall_64+0x44/0x70
[20658.646571]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[20658.646572] RIP: 0033:0x7f87fca658eb
[20658.646572] Code: Bad RIP value.
[20658.646573] RSP: 002b:00007ffc20a995c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[20658.646574] RAX: ffffffffffffffda RBX: 00007ffc20a99600 RCX: 00007f87fca658eb
[20658.646574] RDX: 00007ffc20a99600 RSI: 00000000c01864ba RDI: 000000000000000d
[20658.646575] RBP: 00000000c01864ba R08: 000000000000006c R09: 00000000cccccccc
[20658.646576] R10: 0000000000000fff R11: 0000000000000246 R12: 000055c87121db90
[20658.646576] R13: 000000000000000d R14: 0000000000000000 R15: 0000000000000003
[20658.646577] ---[ end trace 96f7cc95700c9636 ]---
[20668.885142] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:62:crtc-0] flip_done timed out
[20684.245559] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:62:crtc-0] flip_done timed out
[20694.485139] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:61:plane-7] flip_done timed out

Revision history for this message

In Linux Kernel Bug Tracker #201957, randyk161 (randyk161-linux-kernel-bugs) wrote on 2020-08-10:

#40

Download full text (30.3 KiB)

I've been getting "ring gfx timeouts" for some time, most of the time it's when the computer has not had any input for a while (while I'm away from it). When it freezes I can SSH into it but when I try to do a: "shutdown -h now" it boots me out of SSH as it should but the computer never seems to actually shutdown. The screen stays frozen with whatever was on the display when it froze. Any help would be greatly appreciated, here is my info:

Mobo: AsRock AB350 Pro4 UEFI: 5.80
Video card: Sapphire Nitro+ RX580 (8GB)
Distro: Manjaro
Kernel: 5.7.9-1-MANJARO

Aug 09 21:33:06.054857 kernel: pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:00.0
Aug 09 21:33:06.068305 kernel: pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
Aug 09 21:33:06.068636 kernel: pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00200000/00000000
Aug 09 21:33:06.068863 kernel: pcieport 0000:00:03.1: AER: [21] ACSViol (First)
Aug 09 21:33:06.069137 kernel: amdgpu 0000:0a:00.0: AER: can't recover (no error_detected callback)
Aug 09 21:33:06.069421 kernel: snd_hda_intel 0000:0a:00.1: AER: can't recover (no error_detected callback)
Aug 09 21:33:06.069633 kernel: pcieport 0000:00:03.1: AER: device recovery failed
Aug 09 21:33:16.258283 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=9087, emitted seq=9089
Aug 09 21:33:16.258412 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Aug 09 21:33:16.258446 kernel: amdgpu 0000:0a:00.0: GPU reset begin!
Aug 09 21:33:16.258741 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Aug 09 21:33:16.258773 kernel: amdgpu: [powerplay]
                                last message was failed ret is 65535
Aug 09 21:33:16.258803 kernel: amdgpu: [powerplay]
                                failed to send message 261 ret is 65535
Aug 09 21:33:16.258835 kernel: amdgpu: [powerplay]
                                last message was failed ret is 65535
Aug 09 21:33:16.258869 kernel: amdgpu: [powerplay]
                                failed to send message 261 ret is 65535
Aug 09 21:33:16.258896 kernel: amdgpu: [powerplay]
                                last message was failed ret is 65535
Aug 09 21:33:16.258925 kernel: amdgpu: [powerplay]
                                failed to send message 261 ret is 65535
Aug 09 21:33:16.258951 kernel: amdgpu: [powerplay]
                                last message was failed ret is 65535
Aug 09 21:33:16.258977 kernel: amdgpu: [powerplay]
                                failed to send message 261 ret is 65535
Aug 09 21:33:16.259009 kernel: amdgpu: [powerplay]
                                last message was failed ret is 65535
Aug 09 21:33:16.259035 kernel: amdgpu: [powerplay]
                                failed to send message 261 ret is 65535
Aug 09 21:33:16.259060 kernel: amdgpu: [powerplay]
                                last message was failed ret is 65535
Aug 09 21:33:16.259084 kernel: amdgpu: [powerplay]
                            ...

I've been getting "ring gfx timeouts" for some time, most of the time it's when the computer has not had any input for a while (while I'm away from it).  When it freezes I can SSH into it but when I try to do a: "shutdown -h now" it boots me out of SSH as it should but the computer never seems to actually shutdown.  The screen stays frozen with whatever was on the display when it froze.  Any help would be greatly appreciated, here is my info:

Mobo: AsRock AB350 Pro4 UEFI: 5.80
Video card: Sapphire Nitro+ RX580 (8GB)
Distro: Manjaro
Kernel: 5.7.9-1-MANJARO

Aug 09 21:33:06.054857 kernel: pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:00.0
Aug 09 21:33:06.068305 kernel: pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
Aug 09 21:33:06.068636 kernel: pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00200000/00000000
Aug 09 21:33:06.068863 kernel: pcieport 0000:00:03.1: AER:    [21] ACSViol                (First)
Aug 09 21:33:06.069137 kernel: amdgpu 0000:0a:00.0: AER: can't recover (no error_detected callback)
Aug 09 21:33:06.069421 kernel: snd_hda_intel 0000:0a:00.1: AER: can't recover (no error_detected callback)
Aug 09 21:33:06.069633 kernel: pcieport 0000:00:03.1: AER: device recovery failed
Aug 09 21:33:16.258283 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=9087, emitted seq=9089
Aug 09 21:33:16.258412 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Aug 09 21:33:16.258446 kernel: amdgpu 0000:0a:00.0: GPU reset begin!
Aug 09 21:33:16.258741 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Aug 09 21:33:16.258773 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.258803 kernel: amdgpu: [powerplay] 
                                failed to send message 261 ret is 65535 
Aug 09 21:33:16.258835 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.258869 kernel: amdgpu: [powerplay] 
                                failed to send message 261 ret is 65535 
Aug 09 21:33:16.258896 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.258925 kernel: amdgpu: [powerplay] 
                                failed to send message 261 ret is 65535 
Aug 09 21:33:16.258951 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.258977 kernel: amdgpu: [powerplay] 
                                failed to send message 261 ret is 65535 
Aug 09 21:33:16.259009 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.259035 kernel: amdgpu: [powerplay] 
                                failed to send message 261 ret is 65535 
Aug 09 21:33:16.259060 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.259084 kernel: amdgpu: [powerplay] 
                                failed to send message 261 ret is 65535 
Aug 09 21:33:16.259108 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.259131 kernel: amdgpu: [powerplay] 
                                failed to send message 261 ret is 65535 
Aug 09 21:33:16.259156 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.259186 kernel: amdgpu: [powerplay] 
                                failed to send message 261 ret is 65535 
Aug 09 21:33:16.259213 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.259242 kernel: amdgpu: [powerplay] 
                                failed to send message 261 ret is 65535 
Aug 09 21:33:16.259272 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.259298 kernel: amdgpu: [powerplay] 
                                failed to send message 261 ret is 65535 
Aug 09 21:33:16.259324 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.259350 kernel: amdgpu: [powerplay] 
                                failed to send message 261 ret is 65535 
Aug 09 21:33:16.259373 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.259400 kernel: amdgpu: [powerplay] 
                                failed to send message 261 ret is 65535 
Aug 09 21:33:16.259426 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.259456 kernel: amdgpu: [powerplay] 
                                failed to send message 261 ret is 65535 
Aug 09 21:33:16.259483 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.259509 kernel: amdgpu: [powerplay] 
                                failed to send message 261 ret is 65535 
Aug 09 21:33:16.259540 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.259566 kernel: amdgpu: [powerplay] 
                                failed to send message 261 ret is 65535 
Aug 09 21:33:16.259592 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.259617 kernel: amdgpu: [powerplay] 
                                failed to send message 261 ret is 65535 
Aug 09 21:33:16.259642 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.259671 kernel: amdgpu: [powerplay] 
                                failed to send message 261 ret is 65535 
Aug 09 21:33:16.259697 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.259723 kernel: amdgpu: [powerplay] 
                                failed to send message 306 ret is 65535 
Aug 09 21:33:16.259754 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.259785 kernel: amdgpu: [powerplay] 
                                failed to send message 5e ret is 65535 
Aug 09 21:33:16.259816 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.259860 kernel: amdgpu: [powerplay] 
                                failed to send message 145 ret is 65535 
Aug 09 21:33:16.259913 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.259947 kernel: amdgpu: [powerplay] 
                                failed to send message 146 ret is 65535 
Aug 09 21:33:16.259976 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.260003 kernel: amdgpu: [powerplay] 
                                failed to send message 148 ret is 65535 
Aug 09 21:33:16.260034 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.260061 kernel: amdgpu: [powerplay] 
                                failed to send message 145 ret is 65535 
Aug 09 21:33:16.260088 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:16.260114 kernel: amdgpu: [powerplay] 
                                failed to send message 146 ret is 65535 
Aug 09 21:33:16.291929 kernel: [drm] REG_WAIT timeout 10us * 3000 tries - dce110_stream_encoder_dp_blank line:955
Aug 09 21:33:16.292012 kernel: ------------[ cut here ]------------
Aug 09 21:33:16.292044 kernel: WARNING: CPU: 3 PID: 154 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:526 generic_reg_wait.cold+0x26/0x2d [amdgpu]
Aug 09 21:33:16.292070 kernel: Modules linked in: snd_seq_dummy snd_hrtimer snd_seq fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_mangle iptable_raw iptable_security nfnetlink ip6table_filter ip6_tables iptable_filter squashfs loop nls_iso8859_1 nls_cp437 vfat fat uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common snd_usb_audio videodev snd_usbmidi_lib snd_rawmidi snd_seq_device mc joydev mousedev input_leds wmi_bmof amdgpu snd_hda_codec_realtek snd_hda_codec_generic wl(POE) ledtrig_audio snd_hda_codec_hdmi snd_hda_intel gpu_sched i2c_algo_bit edac_mce_amd snd_intel_dspcfg ttm snd_hda_codec kvm_amd drm_kms_helper r8169 snd_hda_core kvm cfg80211 snd_hwdep snd_pcm cec realtek irqbypass rc_core snd_timer libphy syscopyarea snd rfkill sysfillrect k10temp
Aug 09 21:33:16.292112 kernel:  pcspkr sysimgblt sp5100_tco i2c_piix4 fb_sys_fops soundcore wmi evdev mac_hid gpio_amdpt pinctrl_amd acpi_cpufreq drm uinput sg crypto_user agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 dm_crypt dm_mod uas usb_storage hid_logitech ff_memless hid_generic usbhid hid crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper ccp xhci_pci mpt3sas rng_core xhci_hcd raid_class scsi_transport_sas
Aug 09 21:33:16.292141 kernel: CPU: 3 PID: 154 Comm: kworker/3:1 Tainted: P           OE     5.7.9-1-MANJARO #1
Aug 09 21:33:16.292164 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AB350 Pro4, BIOS P5.80 06/14/2019
Aug 09 21:33:16.292188 kernel: Workqueue: events drm_sched_job_timedout [gpu_sched]
Aug 09 21:33:16.292213 kernel: RIP: 0010:generic_reg_wait.cold+0x26/0x2d [amdgpu]
Aug 09 21:33:16.292240 kernel: Code: a7 41 fd ff 44 8b 44 24 24 48 8b 4c 24 18 89 ee 48 c7 c7 08 14 cd c1 8b 54 24 20 e8 7a 91 d2 f9 83 7b 20 01 0f 84 c3 52 fd ff <0f> 0b e9 bc 52 fd ff 48 c7 c7 fd 4c c8 c1 e8 f3 c2 12 fa e8 4a 29
Aug 09 21:33:16.292263 kernel: RSP: 0018:ffffab9b806c3610 EFLAGS: 00010297
Aug 09 21:33:16.292284 kernel: RAX: 0000000000000052 RBX: ffff92334ad7fa40 RCX: 0000000000000000
Aug 09 21:33:16.292306 kernel: RDX: 0000000000000000 RSI: ffff92334e8d9ac8 RDI: 00000000ffffffff
Aug 09 21:33:16.292335 kernel: RBP: 000000000000000a R08: 0000000000000561 R09: 0000000000000001
Aug 09 21:33:16.292356 kernel: R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
Aug 09 21:33:16.292376 kernel: R13: 0000000000010000 R14: 0000000000004ea4 R15: 0000000000000bb9
Aug 09 21:33:16.292398 kernel: FS:  0000000000000000(0000) GS:ffff92334e8c0000(0000) knlGS:0000000000000000
Aug 09 21:33:16.292421 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 09 21:33:16.292446 kernel: CR2: 00007f494fc04000 CR3: 00000003af1ce000 CR4: 00000000003406e0
Aug 09 21:33:16.292466 kernel: Call Trace:
Aug 09 21:33:16.292485 kernel:  dce110_stream_encoder_dp_blank+0xea/0x140 [amdgpu]
Aug 09 21:33:16.292507 kernel:  core_link_disable_stream+0x9c/0x200 [amdgpu]
Aug 09 21:33:16.292525 kernel:  dce110_reset_hw_ctx_wrap+0xbe/0x240 [amdgpu]
Aug 09 21:33:16.292543 kernel:  dce110_apply_ctx_to_hw+0x4f/0x570 [amdgpu]
Aug 09 21:33:16.292560 kernel:  ? hwmgr_handle_task+0x98/0xf0 [amdgpu]
Aug 09 21:33:16.292578 kernel:  ? pp_dpm_dispatch_tasks+0x45/0x60 [amdgpu]
Aug 09 21:33:16.292598 kernel:  ? dm_pp_apply_display_requirements+0x19e/0x1c0 [amdgpu]
Aug 09 21:33:16.292619 kernel:  dc_commit_state+0x323/0x970 [amdgpu]
Aug 09 21:33:16.292640 kernel:  amdgpu_dm_atomic_commit_tail+0x38c/0x2310 [amdgpu]
Aug 09 21:33:16.292662 kernel:  ? free_one_page+0x57/0xd0
Aug 09 21:33:16.292680 kernel:  ? kfree+0x219/0x250
Aug 09 21:33:16.292698 kernel:  ? bw_calcs+0xa30/0x4380 [amdgpu]
Aug 09 21:33:16.292718 kernel:  ? dc_validate_global_state+0x2f2/0x390 [amdgpu]
Aug 09 21:33:16.292736 kernel:  commit_tail+0x94/0x130 [drm_kms_helper]
Aug 09 21:33:16.292757 kernel:  drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
Aug 09 21:33:16.292776 kernel:  drm_atomic_helper_disable_all+0x175/0x190 [drm_kms_helper]
Aug 09 21:33:16.292792 kernel:  drm_atomic_helper_suspend+0x78/0x150 [drm_kms_helper]
Aug 09 21:33:16.292810 kernel:  dm_suspend+0x1c/0x60 [amdgpu]
Aug 09 21:33:16.292869 kernel:  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
Aug 09 21:33:16.292889 kernel:  ? _raw_spin_lock+0x13/0x30
Aug 09 21:33:16.292908 kernel:  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
Aug 09 21:33:16.292926 kernel:  amdgpu_device_pre_asic_reset+0x16b/0x182 [amdgpu]
Aug 09 21:33:16.292944 kernel:  amdgpu_device_gpu_recover.cold+0x42a/0xc74 [amdgpu]
Aug 09 21:33:16.292962 kernel:  amdgpu_job_timedout+0x105/0x130 [amdgpu]
Aug 09 21:33:16.292981 kernel:  drm_sched_job_timedout+0x64/0xe0 [gpu_sched]
Aug 09 21:33:16.293001 kernel:  process_one_work+0x1da/0x3d0
Aug 09 21:33:16.293017 kernel:  worker_thread+0x4d/0x3e0
Aug 09 21:33:16.293036 kernel:  ? rescuer_thread+0x3f0/0x3f0
Aug 09 21:33:16.293057 kernel:  kthread+0x13e/0x160
Aug 09 21:33:16.293074 kernel:  ? __kthread_bind_mask+0x60/0x60
Aug 09 21:33:16.293097 kernel:  ret_from_fork+0x22/0x40
Aug 09 21:33:16.293123 kernel: ---[ end trace aa4b924a702f7188 ]---
Aug 09 21:33:26.298272 kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 10secs aborting
Aug 09 21:33:26.298425 kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DB6E (len 824, WS 0, PS 0) @ 0xDCEE
Aug 09 21:33:26.298470 kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DA28 (len 326, WS 0, PS 0) @ 0xDB18
Aug 09 21:33:26.298505 kernel: [drm:dce110_link_encoder_disable_output [amdgpu]] *ERROR* dce110_link_encoder_disable_output: Failed to execute VBIOS command table!
Aug 09 21:33:26.298535 kernel: ------------[ cut here ]------------
Aug 09 21:33:26.298571 kernel: WARNING: CPU: 3 PID: 154 at drivers/gpu/drm/amd/amdgpu/../display/dc/dce/dce_link_encoder.c:1099 dce110_link_encoder_disable_output+0x141/0x150 [amdgpu]
Aug 09 21:33:26.298607 kernel: Modules linked in: snd_seq_dummy snd_hrtimer snd_seq fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_mangle iptable_raw iptable_security nfnetlink ip6table_filter ip6_tables iptable_filter squashfs loop nls_iso8859_1 nls_cp437 vfat fat uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common snd_usb_audio videodev snd_usbmidi_lib snd_rawmidi snd_seq_device mc joydev mousedev input_leds wmi_bmof amdgpu snd_hda_codec_realtek snd_hda_codec_generic wl(POE) ledtrig_audio snd_hda_codec_hdmi snd_hda_intel gpu_sched i2c_algo_bit edac_mce_amd snd_intel_dspcfg ttm snd_hda_codec kvm_amd drm_kms_helper r8169 snd_hda_core kvm cfg80211 snd_hwdep snd_pcm cec realtek irqbypass rc_core snd_timer libphy syscopyarea snd rfkill sysfillrect k10temp
Aug 09 21:33:26.298656 kernel:  pcspkr sysimgblt sp5100_tco i2c_piix4 fb_sys_fops soundcore wmi evdev mac_hid gpio_amdpt pinctrl_amd acpi_cpufreq drm uinput sg crypto_user agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 dm_crypt dm_mod uas usb_storage hid_logitech ff_memless hid_generic usbhid hid crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper ccp xhci_pci mpt3sas rng_core xhci_hcd raid_class scsi_transport_sas
Aug 09 21:33:26.298691 kernel: CPU: 3 PID: 154 Comm: kworker/3:1 Tainted: P        W  OE     5.7.9-1-MANJARO #1
Aug 09 21:33:26.298722 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AB350 Pro4, BIOS P5.80 06/14/2019
Aug 09 21:33:26.298753 kernel: Workqueue: events drm_sched_job_timedout [gpu_sched]
Aug 09 21:33:26.298783 kernel: RIP: 0010:dce110_link_encoder_disable_output+0x141/0x150 [amdgpu]
Aug 09 21:33:26.298811 kernel: Code: 44 24 38 65 48 2b 04 25 28 00 00 00 75 20 48 83 c4 40 5b 5d 41 5c c3 48 c7 c6 60 4a c4 c1 48 c7 c7 30 f2 cb c1 e8 4f 2c bd fe <0f> 0b eb d0 e8 76 01 db f9 66 0f 1f 44 00 00 0f 1f 44 00 00 41 57
Aug 09 21:33:26.298840 kernel: RSP: 0018:ffffab9b806c3600 EFLAGS: 00010246
Aug 09 21:33:26.298865 kernel: RAX: 0000000000000000 RBX: 0000000000000020 RCX: 0000000000000000
Aug 09 21:33:26.298896 kernel: RDX: 0000000000000000 RSI: 0000000000000086 RDI: 00000000ffffffff
Aug 09 21:33:26.298926 kernel: RBP: ffff923349574720 R08: 0000000000000598 R09: 0000000000000001
Aug 09 21:33:26.298954 kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffffab9b806c3604
Aug 09 21:33:26.298979 kernel: R13: 0000000000000000 R14: ffff923251500000 R15: ffff92334c016900
Aug 09 21:33:26.299004 kernel: FS:  0000000000000000(0000) GS:ffff92334e8c0000(0000) knlGS:0000000000000000
Aug 09 21:33:26.299032 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 09 21:33:26.299059 kernel: CR2: 00007f494fc04000 CR3: 000000038dd62000 CR4: 00000000003406e0
Aug 09 21:33:26.299087 kernel: Call Trace:
Aug 09 21:33:26.299111 kernel:  dp_disable_link_phy+0x83/0x150 [amdgpu]
Aug 09 21:33:26.299142 kernel:  disable_link+0x4f/0xa0 [amdgpu]
Aug 09 21:33:26.299170 kernel:  core_link_disable_stream+0xd6/0x200 [amdgpu]
Aug 09 21:33:26.299203 kernel:  dce110_reset_hw_ctx_wrap+0xbe/0x240 [amdgpu]
Aug 09 21:33:26.299231 kernel:  dce110_apply_ctx_to_hw+0x4f/0x570 [amdgpu]
Aug 09 21:33:26.299258 kernel:  ? hwmgr_handle_task+0x98/0xf0 [amdgpu]
Aug 09 21:33:26.299283 kernel:  ? pp_dpm_dispatch_tasks+0x45/0x60 [amdgpu]
Aug 09 21:33:26.299309 kernel:  ? dm_pp_apply_display_requirements+0x19e/0x1c0 [amdgpu]
Aug 09 21:33:26.299361 kernel:  dc_commit_state+0x323/0x970 [amdgpu]
Aug 09 21:33:26.299392 kernel:  amdgpu_dm_atomic_commit_tail+0x38c/0x2310 [amdgpu]
Aug 09 21:33:26.299421 kernel:  ? free_one_page+0x57/0xd0
Aug 09 21:33:26.299448 kernel:  ? kfree+0x219/0x250
Aug 09 21:33:26.299476 kernel:  ? bw_calcs+0xa30/0x4380 [amdgpu]
Aug 09 21:33:26.299502 kernel:  ? dc_validate_global_state+0x2f2/0x390 [amdgpu]
Aug 09 21:33:26.299532 kernel:  commit_tail+0x94/0x130 [drm_kms_helper]
Aug 09 21:33:26.299555 kernel:  drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
Aug 09 21:33:26.299581 kernel:  drm_atomic_helper_disable_all+0x175/0x190 [drm_kms_helper]
Aug 09 21:33:26.299606 kernel:  drm_atomic_helper_suspend+0x78/0x150 [drm_kms_helper]
Aug 09 21:33:26.299633 kernel:  dm_suspend+0x1c/0x60 [amdgpu]
Aug 09 21:33:26.299660 kernel:  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
Aug 09 21:33:26.299685 kernel:  ? _raw_spin_lock+0x13/0x30
Aug 09 21:33:26.299710 kernel:  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
Aug 09 21:33:26.299736 kernel:  amdgpu_device_pre_asic_reset+0x16b/0x182 [amdgpu]
Aug 09 21:33:26.299761 kernel:  amdgpu_device_gpu_recover.cold+0x42a/0xc74 [amdgpu]
Aug 09 21:33:26.299787 kernel:  amdgpu_job_timedout+0x105/0x130 [amdgpu]
Aug 09 21:33:26.299818 kernel:  drm_sched_job_timedout+0x64/0xe0 [gpu_sched]
Aug 09 21:33:26.299844 kernel:  process_one_work+0x1da/0x3d0
Aug 09 21:33:26.299872 kernel:  worker_thread+0x4d/0x3e0
Aug 09 21:33:26.299898 kernel:  ? rescuer_thread+0x3f0/0x3f0
Aug 09 21:33:26.299925 kernel:  kthread+0x13e/0x160
Aug 09 21:33:26.299951 kernel:  ? __kthread_bind_mask+0x60/0x60
Aug 09 21:33:26.299979 kernel:  ret_from_fork+0x22/0x40
Aug 09 21:33:26.300004 kernel: ---[ end trace aa4b924a702f7189 ]---
Aug 09 21:33:36.301609 kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 10secs aborting
Aug 09 21:33:36.301729 kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C51A (len 62, WS 0, PS 0) @ 0xC536
Aug 09 21:33:36.334815 kernel: [drm] REG_WAIT timeout 10us * 3000 tries - dce110_stream_encoder_dp_blank line:955
Aug 09 21:33:46.338270 kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 10secs aborting
Aug 09 21:33:46.338400 kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DB6E (len 824, WS 0, PS 0) @ 0xDCEE
Aug 09 21:33:46.338434 kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DA28 (len 326, WS 0, PS 0) @ 0xDB18
Aug 09 21:33:46.338466 kernel: [drm:dce110_link_encoder_disable_output [amdgpu]] *ERROR* dce110_link_encoder_disable_output: Failed to execute VBIOS command table!
Aug 09 21:33:56.339196 plasmashell[1403]: qrc:/plasma/plasmoids/org.kde.plasma.volume/contents/ui/ListItemBase.qml:151: TypeError: Cannot read property 'ports' of undefined
Aug 09 21:33:56.346378 kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 10secs aborting
Aug 09 21:33:56.346481 kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C51A (len 62, WS 0, PS 0) @ 0xC536
Aug 09 21:33:56.346519 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:56.346572 kernel: amdgpu: [powerplay] 
                                failed to send message 148 ret is 65535 
Aug 09 21:33:56.346606 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:56.346632 kernel: amdgpu: [powerplay] 
                                failed to send message 145 ret is 65535 
Aug 09 21:33:56.346659 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:56.346692 kernel: amdgpu: [powerplay] 
                                failed to send message 146 ret is 65535 
Aug 09 21:33:56.345571 plasmashell[1403]: qrc:/plasma/plasmoids/org.kde.plasma.volume/contents/ui/main.qml:550:39: QML DeviceListItem: Binding loop detected for property "width"
Aug 09 21:33:56.591481 kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vce_v3_0> failed -110
Aug 09 21:33:57.054823 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:57.054914 kernel: amdgpu: [powerplay] 
                                failed to send message 133 ret is 65535 
Aug 09 21:33:57.054952 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:57.054971 kernel: amdgpu: [powerplay] 
                                failed to send message 306 ret is 65535 
Aug 09 21:33:57.054990 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:57.055010 kernel: amdgpu: [powerplay] 
                                failed to send message 5e ret is 65535 
Aug 09 21:33:57.055027 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:57.055047 kernel: amdgpu: [powerplay] 
                                failed to send message 145 ret is 65535 
Aug 09 21:33:57.055064 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:57.055080 kernel: amdgpu: [powerplay] 
                                failed to send message 146 ret is 65535 
Aug 09 21:33:57.055097 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:57.055113 kernel: amdgpu: [powerplay] 
                                failed to send message 148 ret is 65535 
Aug 09 21:33:57.055134 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:57.055151 kernel: amdgpu: [powerplay] 
                                failed to send message 145 ret is 65535 
Aug 09 21:33:57.055165 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:57.055180 kernel: amdgpu: [powerplay] 
                                failed to send message 146 ret is 65535 
Aug 09 21:33:57.055195 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:57.055208 kernel: amdgpu: [powerplay] 
                                failed to send message 16a ret is 65535 
Aug 09 21:33:57.055225 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:57.055238 kernel: amdgpu: [powerplay] 
                                failed to send message 186 ret is 65535 
Aug 09 21:33:57.055253 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:57.055267 kernel: amdgpu: [powerplay] 
                                failed to send message 54 ret is 65535 
Aug 09 21:33:57.558146 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:57.558240 kernel: amdgpu: [powerplay] 
                                failed to send message 26b ret is 65535 
Aug 09 21:33:57.558260 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:57.558279 kernel: amdgpu: [powerplay] 
                                failed to send message 13d ret is 65535 
Aug 09 21:33:57.558297 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:57.558313 kernel: amdgpu: [powerplay] 
                                failed to send message 14f ret is 65535 
Aug 09 21:33:57.558329 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:57.558342 kernel: amdgpu: [powerplay] 
                                failed to send message 151 ret is 65535 
Aug 09 21:33:57.558356 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:57.558369 kernel: amdgpu: [powerplay] 
                                failed to send message 135 ret is 65535 
Aug 09 21:33:57.558384 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:57.558398 kernel: amdgpu: [powerplay] 
                                failed to send message 190 ret is 65535 
Aug 09 21:33:57.558415 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:57.558428 kernel: amdgpu: [powerplay] 
                                failed to send message 63 ret is 65535 
Aug 09 21:33:57.558442 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:33:57.558454 kernel: amdgpu: [powerplay] 
                                failed to send message 84 ret is 65535 
Aug 09 21:33:57.558468 kernel: amdgpu: [powerplay] Failed to force to switch arbf0!
Aug 09 21:33:57.558485 kernel: amdgpu: [powerplay] [disable_dpm_tasks] Failed to disable DPM!
Aug 09 21:33:57.558502 kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <powerplay> failed -22
Aug 09 21:33:57.811494 kernel: amdgpu 0000:0a:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Aug 09 21:33:57.811816 kernel: [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Aug 09 21:33:58.314928 kernel: cp is busy, skip halt cp
Aug 09 21:33:58.564823 kernel: rlc is busy, skip halt rlc
Aug 09 21:33:58.818145 kernel: amdgpu 0000:0a:00.0: GPU BACO reset
Aug 09 21:34:59.601512 kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 10secs aborting
Aug 09 21:34:59.601664 kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C51A (len 62, WS 0, PS 0) @ 0xC536
Aug 09 21:34:59.601700 kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing ADA0 (len 142, WS 0, PS 8) @ 0xADBB
Aug 09 21:34:59.601732 kernel: [drm] asic atom init failed!
Aug 09 21:34:59.601767 kernel: amdgpu 0000:0a:00.0: GPU reset succeeded, trying to resume
Aug 09 21:34:59.851491 kernel: amdgpu 0000:0a:00.0: Wait for MC idle timedout !
Aug 09 21:35:00.101588 kernel: amdgpu 0000:0a:00.0: Wait for MC idle timedout !
Aug 09 21:35:00.104823 kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F4007E9000).
Aug 09 21:35:00.104893 kernel: [drm] VRAM is lost due to GPU reset!
Aug 09 21:35:00.121493 kernel: amdgpu: [powerplay] Failed to send Message.
Aug 09 21:35:00.121580 kernel: amdgpu: [powerplay] SMC address must be 4 byte aligned.
Aug 09 21:35:00.121616 kernel: amdgpu: [powerplay] [AVFS][Polaris10_SetupGfxLvlStruct] Problems copying VRConfig value over to SMC
Aug 09 21:35:00.121645 kernel: amdgpu: [powerplay] [AVFS][Polaris10_AVFSEventMgr] Could not Copy Graphics Level table over to SMU
Aug 09 21:35:00.121672 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:35:00.121706 kernel: amdgpu: [powerplay] 
                                failed to send message 252 ret is 65535 
Aug 09 21:35:00.121740 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:35:00.121767 kernel: amdgpu: [powerplay] 
                                failed to send message 253 ret is 65535 
Aug 09 21:35:00.121796 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:35:00.121822 kernel: amdgpu: [powerplay] 
                                failed to send message 250 ret is 65535 
Aug 09 21:35:00.121853 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:35:00.121879 kernel: amdgpu: [powerplay] 
                                failed to send message 251 ret is 65535 
Aug 09 21:35:00.121911 kernel: amdgpu: [powerplay] 
                                last message was failed ret is 65535
Aug 09 21:35:00.121940 kernel: amdgpu: [powerplay] 
                                failed to send message 254 ret is 65535 
Aug 09 21:35:00.374824 kernel: [drm] Timeout wait for RLC serdes 0,0
Aug 09 21:35:00.624828 kernel: amdgpu 0000:0a:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
Aug 09 21:35:00.625100 kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -110
Aug 09 21:35:00.625130 kernel: [drm] Skip scheduling IBs!
Aug 09 21:35:00.625152 kernel: [drm] Skip scheduling IBs!
Aug 09 21:35:00.625166 kernel: [drm] Skip scheduling IBs!
Aug 09 21:35:00.625180 kernel: amdgpu 0000:0a:00.0: GPU reset(2) failed
Aug 09 21:35:00.625307 kernel: [drm] Skip scheduling IBs!
Aug 09 21:35:00.625320 kernel: amdgpu 0000:0a:00.0: GPU reset end with ret = -110
Aug 09 21:35:10.818142 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=9089, emitted seq=9089
Aug 09 21:35:10.818255 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Aug 09 21:35:10.818280 kernel: amdgpu 0000:0a:00.0: GPU reset begin!

Revision history for this message

In Linux Kernel Bug Tracker #201957, dushistov (dushistov-linux-kernel-bugs) wrote on 2020-09-01:

#41

Linux kernel 5.4.61/amd64 /
Radeon RX 560 got the same problem today:

[86631.543134] [drm] Fence fallback timer expired on ring gfx
[86642.133543] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1349762, emitted seq=1349767
[86642.133628] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 8032 thread Xorg:cs0 pid 8155
[86642.133634] amdgpu 0000:41:00.0: GPU reset begin!
[86642.134073] amdgpu: [powerplay]
last message was failed ret is 65535
[86642.134075] amdgpu: [powerplay]
failed to send message 281 ret is 65535

I have never seen a similar problem before.

Revision history for this message

In Linux Kernel Bug Tracker #201957, juan.zenos (juan.zenos-linux-kernel-bugs) wrote on 2020-09-13:

#42

I have this problem with 2 different brand new rx580s in a brand new asus prime-p x570 and an old asus p9x79 with various ubuntu 20.04 kernels 5.4.x - 5.8.x - ...

I wanted to play these games on Linux so badly, the heartbreaking solution is to purchase a windows license... ;_;

Revision history for this message

In Linux Kernel Bug Tracker #201957, majorgonzo (majorgonzo-linux-kernel-bugs) wrote on 2020-11-23:

#43

I have a similar problem, a cascade of errors that typically starts with one of these:
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1093546, emitted seq=1093548

This used to occur only when playing Dauntless, and only after my MSI Radeon RX580 ran hot for a while. Warframe never crashed. Totally different methods of running the games (Dauntless=Lutris and Epic Games Store, Warframe = Steam and Proton). Something then changed after one of the updates within the last month, and now it crashes on both Warframe and Dauntless well before the card is at a high temp. Basically can't run more than about 5 minutes.

I was running Ubuntu 18.04, so I figured maybe a newer kernel would fix this, but updating to 20.10 did nothing but waste a couple of days of reloading everything.

System: Ryzen 5 3600 on Gigabyte x570 UD with a MSI Radeon RX580 8GB

I'm willing to work with whoever sending whatever info/logs are necessary to get this fixed.

Revision history for this message

In Linux Kernel Bug Tracker #201957, randyk161 (randyk161-linux-kernel-bugs) wrote on 2021-01-24:

#44

There doesn't appear to be any progress on this bug, does anyone have any suggestions with regards on how to fix this issue?

Revision history for this message

In Linux Kernel Bug Tracker #201957, j.cordoba (j.cordoba-linux-kernel-bugs) wrote on 2021-01-24:

#45

(In reply to Randune from comment #39)
> There doesn't appear to be any progress on this bug, does anyone have any
> suggestions with regards on how to fix this issue?

Try to add iommu=pt as parameter

Revision history for this message

In Linux Kernel Bug Tracker #201957, panospolychronis (panospolychronis-linux-kernel-bugs) wrote on 2021-01-24:

#46

(In reply to j.cordoba from comment #40)
> (In reply to Randune from comment #39)
> > There doesn't appear to be any progress on this bug, does anyone have any
> > suggestions with regards on how to fix this issue?
>
> Try to add iommu=pt as parameter

I'm running Linux Kernel 5.10.9 with those kernel parameters "amdgpu.ppfeaturemask=0xffffbffb amdgpu.noretry=0 amdgpu.lockup_timeout=0 amdgpu.gpu_recovery=1 amdgpu.audio=0 amdgpu.deep_color=1 amd_iommu=on iommu=pt" My graphics card is Radeon 5600XT and i can confirm that this issue still exist :)
Meanwhile i looked at https://lists.freedesktop.org/archives/amd-gfx/2021-January/date.html and there are some patches about ring timeout which i think they aren't yet merged for the next Linux Kernel release. Probably Alex Deucher will merge them later.

Revision history for this message

In Linux Kernel Bug Tracker #201957, majorgonzo (majorgonzo-linux-kernel-bugs) wrote on 2021-01-24:

#47

I made a change a while back. I added:
amdgpu.gpu_recovery=1
as a grub parameter. I have no other (of the many suggested) parameters set:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.ppfeaturemask=0xfffd7fff amdgpu.gpu_recovery=1"

The feature mask was used to enable reducing the top speed of my video card to reduce heating, and I was using corectrl for that. However, it was something I had to set manually after each boot. Of course, I forgot to do so, and yet it still stopped occurring. So in reality, I don't think I need that anymore, either.

Just checked my linux logs grepping for "ring gfx". Before the change, I had multiples each day up to Dec 10th. Since then, I've had 3.

Also of note - for the last two, it was when I WASN'T playing. Well, I was playing a game, but I was AFK. It seemed when I returned and did something, it went black then.

Lastly, just to confirm, I checked my change log (my own log), and I did, indeed, make that change on 10 Dec.

Revision history for this message

In Linux Kernel Bug Tracker #201957, randyk161 (randyk161-linux-kernel-bugs) wrote on 2021-01-25:

#48

(In reply to Panagiotis Polychronis from comment #41)
> (In reply to j.cordoba from comment #40)
> > (In reply to Randune from comment #39)
> > > There doesn't appear to be any progress on this bug, does anyone have any
> > > suggestions with regards on how to fix this issue?
> >
> > Try to add iommu=pt as parameter
>
> I'm running Linux Kernel 5.10.9 with those kernel parameters
> "amdgpu.ppfeaturemask=0xffffbffb amdgpu.noretry=0 amdgpu.lockup_timeout=0
> amdgpu.gpu_recovery=1 amdgpu.audio=0 amdgpu.deep_color=1 amd_iommu=on
> iommu=pt" My graphics card is Radeon 5600XT and i can confirm that this
> issue still exist :)
> Meanwhile i looked at
> https://lists.freedesktop.org/archives/amd-gfx/2021-January/date.html and
> there are some patches about ring timeout which i think they aren't yet
> merged for the next Linux Kernel release. Probably Alex Deucher will merge
> them later.

Thanks for the suggestion Panagliotis Polychronis, I've tried that in the past and it didn't seem to help. I'm running Manjaro currently and I'm on the Linux 5.11.rc3 kernel as supposedly there are many changes regarding AMDGPU (I'm not sure if there are many changes for my RX580) but it's worth a shot, I'm basically shooting in the dark at this point :).

Revision history for this message

In Linux Kernel Bug Tracker #201957, majorgonzo (majorgonzo-linux-kernel-bugs) wrote on 2021-01-26:

#49

Here's another thing I tried which also may have made a difference. Gonna sound weird, but worth a try. I had a 675VA UPS that my system was plugged into. One time, it started shrieking (weird beepish sounds) as I was doing heavy gaming with lots of visual effects going on. I looked it up, and it seems that if your UPS, or your power strip, can't deliver enough power, it can cause the issues with these GPU cards. I mentioned Dec 10th as the date I made the change for my boot parameters, but it's also the date I plugged my system directly into the wall. Responding yesterday reminded me I have a new, more powerful UPS and I plugged my system into that today. I'll see if it changes anything.

P.S. I know the argument...power is power...but it's not. If the surge protector, or UPS has cheap, thin wiring, then that restricts the amount of amps that can flow though them.

Revision history for this message

In Linux Kernel Bug Tracker #201957, playdohcrow (playdohcrow-linux-kernel-bugs) wrote on 2021-02-14:

#50

I still have this issue when I play "Interstellar Marines"

kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=163824, emitted seq=163826
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process InterstellarMar pid 4378 thread Interstell:cs0 pid 4382

kernel: 5.10.14-200.fc33.

videocard: Radeon HD7770

When this happens, the image freezes, the system stops responding to keypresses but the background music plays for a few minutes and I have to hit <reset>.

Revision history for this message

In Linux Kernel Bug Tracker #201957, fice (fice-linux-kernel-bugs) wrote on 2021-02-28:

#51

Download full text (5.5 KiB)

(In reply to MajorGonzo from comment #44)
> Here's another thing I tried which also may have made a difference. Gonna
> sound weird, but worth a try. I had a 675VA UPS that my system was plugged
> into. One time, it started shrieking (weird beepish sounds) as I was doing
> heavy gaming with lots of visual effects going on. I looked it up, and it
> seems that if your UPS, or your power strip, can't deliver enough power, it
> can cause the issues with these GPU cards. I mentioned Dec 10th as the date
> I made the change for my boot parameters, but it's also the date I plugged
> my system directly into the wall. Responding yesterday reminded me I have a
> new, more powerful UPS and I plugged my system into that today. I'll see if
> it changes anything.
>
> P.S. I know the argument...power is power...but it's not. If the surge
> protector, or UPS has cheap, thin wiring, then that restricts the amount of
> amps that can flow though them.

I had an old PSU, which was repaired once, so I replaced it. That did not resolve the issue. The PSU is connected directly to the wall socket.

Kernel 5.10.18-200.fc33
AMD Ryzen 3 2200G with Radeon Vega Graphics

The bug is most often triggered when using Firefox.

[42174.187004] amdgpu 0000:06:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32772, for process firefox pid 21156 thread firefox:cs0 pid 21244)
[42174.187007] amdgpu 0000:06:00.0: amdgpu: in page starting at address 0x0000000000200000 from client 27
[42174.187008] amdgpu 0000:06:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00100431
[42174.187009] amdgpu 0000:06:00.0: amdgpu: Faulty UTCL2 client ID: IA (0x2)
[42174.187010] amdgpu 0000:06:00.0: amdgpu: MORE_FAULTS: 0x1
[42174.187010] amdgpu 0000:06:00.0: amdgpu: WALKER_ERROR: 0x0
[42174.187011] amdgpu 0000:06:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[42174.187012] amdgpu 0000:06:00.0: amdgpu: MAPPING_ERROR: 0x0
[42174.187012] amdgpu 0000:06:00.0: amdgpu: RW: 0x0
... (the above messages are repeated many times)
[42184.187655] amdgpu 0000:06:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32772, for process firefox pid 21156 thread firefox:cs0 pid 21244)
[42184.187656] amdgpu 0000:06:00.0: amdgpu: in page starting at address 0x0000000000200000 from client 27
[42184.187656] amdgpu 0000:06:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00100431
[42184.187657] amdgpu 0000:06:00.0: amdgpu: Faulty UTCL2 client ID: IA (0x2)
[42184.187657] amdgpu 0000:06:00.0: amdgpu: MORE_FAULTS: 0x1
[42184.187658] amdgpu 0000:06:00.0: amdgpu: WALKER_ERROR: 0x0
[42184.187658] amdgpu 0000:06:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[42184.187659] amdgpu 0000:06:00.0: amdgpu: MAPPING_ERROR: 0x0
[42184.187660] amdgpu 0000:06:00.0: amdgpu: RW: 0x0
[42184.328388] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=109568, emitted seq=109570
[42184.328538] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 21156 thread firefox:cs0 pid 21244
[42184.328542] amdgpu 0000:06:00.0: amdgpu: GPU reset begin!
[42184.330868] amdgpu 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x10cd079a0 ...

(In reply to MajorGonzo from comment #44)
> Here's another thing I tried which also may have made a difference.  Gonna
> sound weird, but worth a try.  I had a 675VA UPS that my system was plugged
> into.  One time, it started shrieking (weird beepish sounds) as I was doing
> heavy gaming with lots of visual effects going on.  I looked it up, and it
> seems that if your UPS, or your power strip, can't deliver enough power, it
> can cause the issues with these GPU cards.  I mentioned Dec 10th as the date
> I made the change for my boot parameters, but it's also the date I plugged
> my system directly into the wall.  Responding yesterday reminded me I have a
> new, more powerful UPS and I plugged my system into that today.  I'll see if
> it changes anything.
> 
> P.S.  I know the argument...power is power...but it's not.  If the surge
> protector, or UPS has cheap, thin wiring, then that restricts the amount of
> amps that can flow though them.

I had an old PSU, which was repaired once, so I replaced it. That did not resolve the issue. The PSU is connected directly to the wall socket.

Kernel 5.10.18-200.fc33
AMD Ryzen 3 2200G with Radeon Vega Graphics

The bug is most often triggered when using Firefox.

[42174.187004] amdgpu 0000:06:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32772, for process firefox pid 21156 thread firefox:cs0 pid 21244)
[42174.187007] amdgpu 0000:06:00.0: amdgpu:   in page starting at address 0x0000000000200000 from client 27
[42174.187008] amdgpu 0000:06:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00100431
[42174.187009] amdgpu 0000:06:00.0: amdgpu: 	 Faulty UTCL2 client ID: IA (0x2)
[42174.187010] amdgpu 0000:06:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[42174.187010] amdgpu 0000:06:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[42174.187011] amdgpu 0000:06:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[42174.187012] amdgpu 0000:06:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[42174.187012] amdgpu 0000:06:00.0: amdgpu: 	 RW: 0x0
... (the above messages are repeated many times)
[42184.187655] amdgpu 0000:06:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32772, for process firefox pid 21156 thread firefox:cs0 pid 21244)
[42184.187656] amdgpu 0000:06:00.0: amdgpu:   in page starting at address 0x0000000000200000 from client 27
[42184.187656] amdgpu 0000:06:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00100431
[42184.187657] amdgpu 0000:06:00.0: amdgpu: 	 Faulty UTCL2 client ID: IA (0x2)
[42184.187657] amdgpu 0000:06:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[42184.187658] amdgpu 0000:06:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[42184.187658] amdgpu 0000:06:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[42184.187659] amdgpu 0000:06:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[42184.187660] amdgpu 0000:06:00.0: amdgpu: 	 RW: 0x0
[42184.328388] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=109568, emitted seq=109570
[42184.328538] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 21156 thread firefox:cs0 pid 21244
[42184.328542] amdgpu 0000:06:00.0: amdgpu: GPU reset begin!
[42184.330868] amdgpu 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x10cd079a0 flags=0x0070]
[42184.330878] amdgpu 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x10cd079c0 flags=0x0070]
[42184.330894] amdgpu 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x10cd40000 flags=0x0070]
[42184.330901] amdgpu 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x10cd079e0 flags=0x0070]
[42184.330909] amdgpu 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x10cd40000 flags=0x0070]
[42184.330917] amdgpu 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x10cd07a00 flags=0x0070]
[42184.330924] amdgpu 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x10cd40000 flags=0x0070]
[42184.330942] amdgpu 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x10cd07a20 flags=0x0070]
[42184.330950] amdgpu 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x10cd40000 flags=0x0070]
[42184.330966] amdgpu 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x10cd07a40 flags=0x0070]
[42184.421882] [drm] free PSP TMR buffer
[42184.451954] amdgpu 0000:06:00.0: amdgpu: GPU reset succeeded, trying to resume
[42184.452090] [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000).
[42184.452275] [drm] PSP is resuming...
[42184.472305] [drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR
[42184.537825] amdgpu 0000:06:00.0: amdgpu: RAS: optional ras ta ucode is not available
[42184.546811] amdgpu 0000:06:00.0: amdgpu: RAP: optional rap ta ucode is not available
[42184.759724] [drm] kiq ring mec 2 pipe 1 q 0
[42184.958535] amdgpu 0000:06:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
[42184.958584] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110
[42184.958597] amdgpu 0000:06:00.0: amdgpu: GPU reset(2) failed
[42184.958667] amdgpu 0000:06:00.0: amdgpu: GPU reset end with ret = -110
[42195.061025] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[42205.292585] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[42243.148200] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=21546, emitted seq=21548
[42243.148346] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[42243.148351] amdgpu 0000:06:00.0: amdgpu: GPU reset begin!

Revision history for this message

In Linux Kernel Bug Tracker #201957, csaba.timar01 (csaba.timar01-linux-kernel-bugs) wrote on 2021-03-28:

#52

Download full text (5.7 KiB)

I have something very similar with my Vega56. I can reproduce it with Win10 too.
I think it's an AMD Hw issue.

march 28 15:07:35 PC-home kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
march 28 15:07:35 PC-home kernel: qcm fence wait loop timeout expired
march 28 15:07:35 PC-home kernel: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
march 28 15:07:35 PC-home kernel: amdgpu: Failed to evict process queues
march 28 15:07:35 PC-home kernel: amdgpu 0000:0a:00.0: amdgpu: GPU reset begin!
march 28 15:07:35 PC-home kernel: amdgpu: Failed to quiesce KFD
march 28 15:07:35 PC-home kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=567492, emitted seq=567494
march 28 15:07:35 PC-home kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process vkcube pid 7677 thread vkcube pid 7677
march 28 15:07:35 PC-home kernel: amdgpu 0000:0a:00.0: amdgpu: GPU reset begin!
march 28 15:07:35 PC-home kernel: amdgpu 0000:0a:00.0: amdgpu: Bailing on TDR for s_job:869c2, as another already in progress
march 28 15:07:36 PC-home kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page1 timeout, signaled seq=20352, emitted seq=20353
march 28 15:07:36 PC-home kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
march 28 15:07:36 PC-home kernel: amdgpu 0000:0a:00.0: amdgpu: GPU reset begin!
march 28 15:07:36 PC-home kernel: amdgpu 0000:0a:00.0: amdgpu: Bailing on TDR for s_job:4f80, as another already in progress
march 28 15:07:39 PC-home kernel: amdgpu 0000:0a:00.0: amdgpu: failed to suspend display audio
march 28 15:07:39 PC-home kernel: BUG: unable to handle page fault for address: ffffa9c54bb4f910
march 28 15:07:39 PC-home kernel: #PF: supervisor write access in kernel mode
march 28 15:07:39 PC-home kernel: #PF: error_code(0x0002) - not-present page
march 28 15:07:39 PC-home kernel: PGD 100000067 P4D 100000067 PUD 1001b9067 PMD 1cdabb067 PTE 0
march 28 15:07:39 PC-home kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
march 28 15:07:39 PC-home kernel: CPU: 9 PID: 8586 Comm: kworker/9:0 Tainted: G OE 5.11.6-1-MANJARO #1

march 28 15:07:39 PC-home kernel: Hardware name: System manufacturer System Product Name/PRIME A320M-K, BIOS 5603 10/14/2020
march 28 15:07:39 PC-home kernel: Workqueue: events kfd_process_hw_exception [amdgpu]
march 28 15:07:39 PC-home kernel: RIP: 0010:amdgpu_device_lock_adev+0x2b/0x83 [amdgpu]
march 28 15:07:39 PC-home kernel: Code: 1f 44 00 00 31 c0 ba 01 00 00 00 f0 0f b1 97 f4 77 01 00 45 31 c0 85 c0 75 64 53 48 89 fb 48 8d bf 00 78 01 00 e8 e7 16 27 c9 <f0> ff 83 40 >
march 28 15:07:39 PC-home kernel: RSP: 0018:ffffa9c54c73be00 EFLAGS: 00010246
march 28 15:07:39 PC-home kernel: RAX: ffff951f0c155dc0 RBX: ffffa9c54bb495d0 RCX: 0000000000000001
march 28 15:07:39 PC-home kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffa9c54bb60dd0
march 28 15:07:39 PC-home kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
march 28 15:07:39 PC-home kernel: R10: 0000000000000003 R11: 0000000000000000 R12: ffffa9c54bb495d0
march 28 15:07:39 PC-home...

I have something very similar with my Vega56. I can reproduce it with Win10 too. 
I think it's an AMD Hw issue.

march 28 15:07:35 PC-home kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
march 28 15:07:35 PC-home kernel: qcm fence wait loop timeout expired
march 28 15:07:35 PC-home kernel: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
march 28 15:07:35 PC-home kernel: amdgpu: Failed to evict process queues
march 28 15:07:35 PC-home kernel: amdgpu 0000:0a:00.0: amdgpu: GPU reset begin!
march 28 15:07:35 PC-home kernel: amdgpu: Failed to quiesce KFD
march 28 15:07:35 PC-home kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=567492, emitted seq=567494
march 28 15:07:35 PC-home kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process vkcube pid 7677 thread vkcube pid 7677
march 28 15:07:35 PC-home kernel: amdgpu 0000:0a:00.0: amdgpu: GPU reset begin!
march 28 15:07:35 PC-home kernel: amdgpu 0000:0a:00.0: amdgpu: Bailing on TDR for s_job:869c2, as another already in progress
march 28 15:07:36 PC-home kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page1 timeout, signaled seq=20352, emitted seq=20353
march 28 15:07:36 PC-home kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
march 28 15:07:36 PC-home kernel: amdgpu 0000:0a:00.0: amdgpu: GPU reset begin!
march 28 15:07:36 PC-home kernel: amdgpu 0000:0a:00.0: amdgpu: Bailing on TDR for s_job:4f80, as another already in progress
march 28 15:07:39 PC-home kernel: amdgpu 0000:0a:00.0: amdgpu: failed to suspend display audio
march 28 15:07:39 PC-home kernel: BUG: unable to handle page fault for address: ffffa9c54bb4f910
march 28 15:07:39 PC-home kernel: #PF: supervisor write access in kernel mode
march 28 15:07:39 PC-home kernel: #PF: error_code(0x0002) - not-present page
march 28 15:07:39 PC-home kernel: PGD 100000067 P4D 100000067 PUD 1001b9067 PMD 1cdabb067 PTE 0
march 28 15:07:39 PC-home kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
march 28 15:07:39 PC-home kernel: CPU: 9 PID: 8586 Comm: kworker/9:0 Tainted: G           OE     5.11.6-1-MANJARO #1

march 28 15:07:39 PC-home kernel: Hardware name: System manufacturer System Product Name/PRIME A320M-K, BIOS 5603 10/14/2020
march 28 15:07:39 PC-home kernel: Workqueue: events kfd_process_hw_exception [amdgpu]
march 28 15:07:39 PC-home kernel: RIP: 0010:amdgpu_device_lock_adev+0x2b/0x83 [amdgpu]
march 28 15:07:39 PC-home kernel: Code: 1f 44 00 00 31 c0 ba 01 00 00 00 f0 0f b1 97 f4 77 01 00 45 31 c0 85 c0 75 64 53 48 89 fb 48 8d bf 00 78 01 00 e8 e7 16 27 c9 <f0> ff 83 40 >
march 28 15:07:39 PC-home kernel: RSP: 0018:ffffa9c54c73be00 EFLAGS: 00010246
march 28 15:07:39 PC-home kernel: RAX: ffff951f0c155dc0 RBX: ffffa9c54bb495d0 RCX: 0000000000000001
march 28 15:07:39 PC-home kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffa9c54bb60dd0
march 28 15:07:39 PC-home kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
march 28 15:07:39 PC-home kernel: R10: 0000000000000003 R11: 0000000000000000 R12: ffffa9c54bb495d0
march 28 15:07:39 PC-home kernel: R13: ffff951e19160000 R14: ffff951e19170e30 R15: 00000000000000e0
march 28 15:07:39 PC-home kernel: FS:  0000000000000000(0000) GS:ffff95210ea40000(0000) knlGS:0000000000000000
march 28 15:07:39 PC-home kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
march 28 15:07:39 PC-home kernel: CR2: ffffa9c54bb4f910 CR3: 0000000385410000 CR4: 00000000003506e0
march 28 15:07:39 PC-home kernel: Call Trace:
march 28 15:07:39 PC-home kernel:  amdgpu_device_gpu_recover.cold+0x180/0x95d [amdgpu]
march 28 15:07:39 PC-home kernel:  ? amdgpu_device_doorbell_init.part.0+0x71/0xc0 [amdgpu]
march 28 15:07:39 PC-home kernel:  process_one_work+0x214/0x3e0
march 28 15:07:39 PC-home kernel:  worker_thread+0x4d/0x3d0
march 28 15:07:39 PC-home kernel:  ? rescuer_thread+0x3c0/0x3c0
march 28 15:07:39 PC-home kernel:  kthread+0x142/0x160
march 28 15:07:39 PC-home kernel:  ? __kthread_bind_mask+0x60/0x60
march 28 15:07:39 PC-home kernel:  ret_from_fork+0x22/0x30
march 28 15:07:39 PC-home kernel: Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg bnep btusb btrtl btbcm btintel bluetooth ecdh_generic ecc uas usb_storage mousedev>
march 28 15:07:39 PC-home kernel:  gpio_amdpt acpi_cpufreq drm uinput sg fuse crypto_user agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 crc32c_intel xhci_pci
march 28 15:07:39 PC-home kernel: CR2: ffffa9c54bb4f910
march 28 15:07:39 PC-home kernel: ---[ end trace 2eaf88bedaabd891 ]---
march 28 15:07:39 PC-home kernel: RIP: 0010:amdgpu_device_lock_adev+0x2b/0x83 [amdgpu]
march 28 15:07:39 PC-home kernel: Code: 1f 44 00 00 31 c0 ba 01 00 00 00 f0 0f b1 97 f4 77 01 00 45 31 c0 85 c0 75 64 53 48 89 fb 48 8d bf 00 78 01 00 e8 e7 16 27 c9 <f0> ff 83 40 >
march 28 15:07:39 PC-home kernel: RSP: 0018:ffffa9c54c73be00 EFLAGS: 00010246
march 28 15:07:39 PC-home kernel: RAX: ffff951f0c155dc0 RBX: ffffa9c54bb495d0 RCX: 0000000000000001
march 28 15:07:39 PC-home kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffa9c54bb60dd0
march 28 15:07:39 PC-home kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
march 28 15:07:39 PC-home kernel: R10: 0000000000000003 R11: 0000000000000000 R12: ffffa9c54bb495d0
march 28 15:07:39 PC-home kernel: R13: ffff951e19160000 R14: ffff951e19170e30 R15: 00000000000000e0
march 28 15:07:39 PC-home kernel: FS:  0000000000000000(0000) GS:ffff95210ea40000(0000) knlGS:0000000000000000
march 28 15:07:39 PC-home kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
march 28 15:07:39 PC-home kernel: CR2: ffffa9c54bb4f910 CR3: 00000002fa6de000 CR4: 00000000003506e0

Revision history for this message

In Linux Kernel Bug Tracker #201957, i-am-not-a-robot (i-am-not-a-robot-linux-kernel-bugs) wrote on 2021-08-22:

#53

This seems to be a firmware(-related) problem. After downgrading to linux firmware 2020-09-18, I'm running 6 days without a crash on the same work loads. (I was getting multiple crashes per day before).

My GPU is Vega8 Mobile (ThinkPad A485). Currently running 5.13.11.

An extensive discussion of different firmware versions in the context of a similar issue on Arch Forums: https://bbs.archlinux.org/viewtopic.php?id=266358&p=5

Revision history for this message

In Linux Kernel Bug Tracker #201957, qydwhotmail (qydwhotmail-linux-kernel-bugs) wrote on 2021-11-17:

#54

Ryzen 4700U same error. openSUSE Tumbleweed

X11

Kernel version is 5.14.14

Mesa version is 21.2.5-293.2

Firmware version is 20211027-1.1

Revision history for this message

In Linux Kernel Bug Tracker #201957, aussir (aussir-linux-kernel-bugs) wrote on 2021-11-26:

#55

(In reply to i-am-not-a-robot from comment #48)
> This seems to be a firmware(-related) problem. After downgrading to linux
> firmware 2020-09-18, I'm running 6 days without a crash on the same work
> loads. (I was getting multiple crashes per day before).

Did you test any other versions? Was 09-18 the last working release?

Revision history for this message

In Linux Kernel Bug Tracker #201957, aussir (aussir-linux-kernel-bugs) wrote on 2021-12-12:

#56

A possible solution is to pass
amdgpu.dpm=0
as a kernel launch option.

However: this kills fps in many games and probably anything that depends on the gpu for rendering.

Revision history for this message

In Linux Kernel Bug Tracker #201957, coolx67 (coolx67-linux-kernel-bugs) wrote on 2021-12-22:

#57

Download full text (9.7 KiB)

I can confirm that
amdgpu.dpm=0
removes the issue
on an AMD Radeon PRO FIJI (Dual Fury) kernel: 5.15.10|FW: 20211027.1d00989-1|mesa: 21.3.2-1

Works perfectly fine in Gnome as long as there is no application accessing the 2nd GPU.

When opening Radeon-profile as long as card0 is selected, there is no issue but as soon as I select card1 I get instantly
Dec 22 21:15:46 Workstation kernel: amdgpu:
failed to send message 171 ret is 0
Dec 22 21:15:49 Workstation kernel: amdgpu:
last message was failed ret is 0

The application Radeon-profile freezes but desktop is still responsive.

When opening CS:GO with mangohud and configuring either

pci_dev = 0000:3d:00.0 # primary card works fine
or
pci_dev = 0000:3e:00.0 # secondary card, errors from above occur and CS:GO loads super slow and after menu is visible it is stuck

When CSM is disabled in BIOS I have 2 GPUs

Dec 22 20:45:50 Workstation kernel: [drm] amdgpu kernel modesetting enabled.
Dec 22 20:45:50 Workstation kernel: amdgpu: CRAT table not found
Dec 22 20:45:50 Workstation kernel: amdgpu: Virtual CRAT table created for CPU
Dec 22 20:45:50 Workstation kernel: amdgpu: Topology: Add CPU node
Dec 22 20:45:50 Workstation kernel: amdgpu 0000:3d:00.0: vgaarb: deactivate vga console
Dec 22 20:45:50 Workstation kernel: amdgpu 0000:3d:00.0: enabling device (0106 -> 0107)
Dec 22 20:45:50 Workstation kernel: amdgpu 0000:3d:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
Dec 22 20:45:50 Workstation kernel: amdgpu 0000:3d:00.0: amdgpu: Fetched VBIOS from ROM BAR
Dec 22 20:45:50 Workstation kernel: amdgpu: ATOM BIOS: 113-C88801MS-102
Dec 22 20:45:50 Workstation kernel: amdgpu 0000:3d:00.0: amdgpu: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
Dec 22 20:45:50 Workstation kernel: amdgpu 0000:3d:00.0: amdgpu: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
Dec 22 20:45:50 Workstation kernel: [drm] amdgpu: 4096M of VRAM memory ready
Dec 22 20:45:50 Workstation kernel: [drm] amdgpu: 4096M of GTT memory ready.
Dec 22 20:45:50 Workstation kernel: amdgpu: hwmgr_sw_init smu backed is fiji_smu
Dec 22 20:45:50 Workstation kernel: snd_hda_intel 0000:3d:00.1: bound 0000:3d:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
Dec 22 20:45:50 Workstation kernel: [drm:retrieve_link_cap [amdgpu]] *ERROR* retrieve_link_cap: Read receiver caps dpcd data failed.
Dec 22 20:45:50 Workstation kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart
Dec 22 20:45:50 Workstation kernel: amdgpu: Virtual CRAT table created for GPU
Dec 22 20:45:50 Workstation kernel: amdgpu: Topology: Add dGPU node [0x7300:0x1002]
Dec 22 20:45:50 Workstation kernel: kfd kfd: amdgpu: added device 1002:7300
Dec 22 20:45:50 Workstation kernel: amdgpu 0000:3d:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 16, active_cu_number 64
Dec 22 20:45:50 Workstation kernel: fbcon: amdgpu (fb0) is primary device
Dec 22 20:45:51 Workstation kernel: amdgpu 0000:3d:00.0: [drm] fb0: amdgpu frame buffer device
Dec 22 20:45:51 Workstation kernel: amdgpu 0000:3d:00.0: amdgpu: Using BACO for runtime pm
Dec 22 20:45:51 Workstation kernel: [drm] Initialize...

I can confirm that 
amdgpu.dpm=0 
removes the issue 
on an AMD Radeon PRO FIJI (Dual Fury) kernel: 5.15.10|FW: 20211027.1d00989-1|mesa: 21.3.2-1

Works perfectly fine in Gnome as long as there is no application accessing the 2nd GPU.

When opening Radeon-profile as long as card0 is selected, there is no issue but as soon as I select card1 I get instantly 
Dec 22 21:15:46 Workstation kernel: amdgpu: 
                                     failed to send message 171 ret is 0 
Dec 22 21:15:49 Workstation kernel: amdgpu: 
                                     last message was failed ret is 0

The application Radeon-profile freezes but desktop is still responsive.

When opening CS:GO with mangohud and configuring either

pci_dev = 0000:3d:00.0 # primary card works fine
or 
pci_dev = 0000:3e:00.0 # secondary card, errors from above occur and CS:GO loads super slow and after menu is visible it is stuck

When CSM is disabled in BIOS I have 2 GPUs

Dec 22 20:45:50 Workstation kernel: [drm] amdgpu kernel modesetting enabled.
Dec 22 20:45:50 Workstation kernel: amdgpu: CRAT table not found
Dec 22 20:45:50 Workstation kernel: amdgpu: Virtual CRAT table created for CPU
Dec 22 20:45:50 Workstation kernel: amdgpu: Topology: Add CPU node
Dec 22 20:45:50 Workstation kernel: amdgpu 0000:3d:00.0: vgaarb: deactivate vga console
Dec 22 20:45:50 Workstation kernel: amdgpu 0000:3d:00.0: enabling device (0106 -> 0107)
Dec 22 20:45:50 Workstation kernel: amdgpu 0000:3d:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
Dec 22 20:45:50 Workstation kernel: amdgpu 0000:3d:00.0: amdgpu: Fetched VBIOS from ROM BAR
Dec 22 20:45:50 Workstation kernel: amdgpu: ATOM BIOS: 113-C88801MS-102
Dec 22 20:45:50 Workstation kernel: amdgpu 0000:3d:00.0: amdgpu: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
Dec 22 20:45:50 Workstation kernel: amdgpu 0000:3d:00.0: amdgpu: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
Dec 22 20:45:50 Workstation kernel: [drm] amdgpu: 4096M of VRAM memory ready
Dec 22 20:45:50 Workstation kernel: [drm] amdgpu: 4096M of GTT memory ready.
Dec 22 20:45:50 Workstation kernel: amdgpu: hwmgr_sw_init smu backed is fiji_smu
Dec 22 20:45:50 Workstation kernel: snd_hda_intel 0000:3d:00.1: bound 0000:3d:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
Dec 22 20:45:50 Workstation kernel: [drm:retrieve_link_cap [amdgpu]] *ERROR* retrieve_link_cap: Read receiver caps dpcd data failed.
Dec 22 20:45:50 Workstation kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart
Dec 22 20:45:50 Workstation kernel: amdgpu: Virtual CRAT table created for GPU
Dec 22 20:45:50 Workstation kernel: amdgpu: Topology: Add dGPU node [0x7300:0x1002]
Dec 22 20:45:50 Workstation kernel: kfd kfd: amdgpu: added device 1002:7300
Dec 22 20:45:50 Workstation kernel: amdgpu 0000:3d:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 16, active_cu_number 64
Dec 22 20:45:50 Workstation kernel: fbcon: amdgpu (fb0) is primary device
Dec 22 20:45:51 Workstation kernel: amdgpu 0000:3d:00.0: [drm] fb0: amdgpu frame buffer device
Dec 22 20:45:51 Workstation kernel: amdgpu 0000:3d:00.0: amdgpu: Using BACO for runtime pm
Dec 22 20:45:51 Workstation kernel: [drm] Initialized amdgpu 3.42.0 20150101 for 0000:3d:00.0 on minor 0
Dec 22 20:45:51 Workstation kernel: amdgpu 0000:3e:00.0: enabling device (0106 -> 0107)
Dec 22 20:45:51 Workstation kernel: amdgpu 0000:3e:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
Dec 22 20:45:51 Workstation kernel: amdgpu 0000:3e:00.0: amdgpu: Fetched VBIOS from ROM BAR
Dec 22 20:45:51 Workstation kernel: amdgpu: ATOM BIOS: 113-C88801SL-102
Dec 22 20:45:51 Workstation kernel: amdgpu 0000:3e:00.0: amdgpu: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
Dec 22 20:45:51 Workstation kernel: amdgpu 0000:3e:00.0: amdgpu: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
Dec 22 20:45:51 Workstation kernel: [drm] amdgpu: 4096M of VRAM memory ready
Dec 22 20:45:51 Workstation kernel: [drm] amdgpu: 4096M of GTT memory ready.
Dec 22 20:45:51 Workstation kernel: amdgpu: hwmgr_sw_init smu backed is fiji_smu
Dec 22 20:45:51 Workstation kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart
Dec 22 20:45:51 Workstation kernel: amdgpu: Virtual CRAT table created for GPU
Dec 22 20:45:51 Workstation kernel: amdgpu: Topology: Add dGPU node [0x7300:0x1002]
Dec 22 20:45:51 Workstation kernel: kfd kfd: amdgpu: added device 1002:7300
Dec 22 20:45:51 Workstation kernel: amdgpu 0000:3e:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 16, active_cu_number 64
Dec 22 20:45:51 Workstation kernel: amdgpu 0000:3e:00.0: amdgpu: Using BACO for runtime pm
Dec 22 20:45:51 Workstation kernel: [drm] Initialized amdgpu 3.42.0 20150101 for 0000:3e:00.0 on minor 1
Dec 22 20:45:53 Workstation gnome-shell[1988]: Added device '/dev/dri/card0' (amdgpu) using atomic mode setting.
Dec 22 20:45:53 Workstation gnome-shell[1988]: Added device '/dev/dri/card1' (amdgpu) using atomic mode setting.
Dec 22 20:45:55 Workstation gnome-shell[1988]: Disabling DMA buffer screen sharing for driver 'amdgpu'.
Dec 22 20:46:03 Workstation gnome-shell[2527]: Added device '/dev/dri/card0' (amdgpu) using atomic mode setting.
Dec 22 20:46:04 Workstation gnome-shell[2527]: Added device '/dev/dri/card1' (amdgpu) using atomic mode setting.
Dec 22 20:46:05 Workstation gnome-shell[2527]: Disabling DMA buffer screen sharing for driver 'amdgpu'.

With enabled CSM only the primary GPU is available
Dec 17 18:17:51 Workstation kernel: [drm] amdgpu kernel modesetting enabled.
Dec 17 18:17:51 Workstation kernel: amdgpu: CRAT table not found
Dec 17 18:17:51 Workstation kernel: amdgpu: Virtual CRAT table created for CPU
Dec 17 18:17:51 Workstation kernel: amdgpu: Topology: Add CPU node
Dec 17 18:17:51 Workstation kernel: fb0: switching to amdgpu from EFI VGA
Dec 17 18:17:51 Workstation kernel: amdgpu 0000:3d:00.0: vgaarb: deactivate vga console
Dec 17 18:17:51 Workstation kernel: amdgpu 0000:3d:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
Dec 17 18:17:51 Workstation kernel: amdgpu 0000:3d:00.0: No more image in the PCI ROM
Dec 17 18:17:51 Workstation kernel: amdgpu 0000:3d:00.0: amdgpu: Fetched VBIOS from ROM BAR
Dec 17 18:17:51 Workstation kernel: amdgpu: ATOM BIOS: 113-C88801MS-102
Dec 17 18:17:51 Workstation kernel: amdgpu 0000:3d:00.0: BAR 2: releasing [mem 0xb0000000-0xb01fffff 64bit pref]
Dec 17 18:17:51 Workstation kernel: amdgpu 0000:3d:00.0: BAR 0: releasing [mem 0xa0000000-0xafffffff 64bit pref]
Dec 17 18:17:51 Workstation kernel: amdgpu 0000:3d:00.0: BAR 0: assigned [mem 0x388000000000-0x3880ffffffff 64bit pref]
Dec 17 18:17:51 Workstation kernel: amdgpu 0000:3d:00.0: BAR 2: assigned [mem 0x388100000000-0x3881001fffff 64bit pref]
Dec 17 18:17:51 Workstation kernel: amdgpu 0000:3d:00.0: amdgpu: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
Dec 17 18:17:51 Workstation kernel: amdgpu 0000:3d:00.0: amdgpu: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
Dec 17 18:17:51 Workstation kernel: [drm] amdgpu: 4096M of VRAM memory ready
Dec 17 18:17:51 Workstation kernel: [drm] amdgpu: 4096M of GTT memory ready.
Dec 17 18:17:51 Workstation kernel: amdgpu: hwmgr_sw_init smu backed is fiji_smu
Dec 17 18:17:51 Workstation kernel: snd_hda_intel 0000:3d:00.1: bound 0000:3d:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
Dec 17 18:17:51 Workstation kernel: [drm:retrieve_link_cap [amdgpu]] *ERROR* retrieve_link_cap: Read receiver caps dpcd data failed.
Dec 17 18:17:51 Workstation kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart
Dec 17 18:17:51 Workstation kernel: amdgpu: Virtual CRAT table created for GPU
Dec 17 18:17:51 Workstation kernel: amdgpu: Topology: Add dGPU node [0x7300:0x1002]
Dec 17 18:17:51 Workstation kernel: kfd kfd: amdgpu: added device 1002:7300
Dec 17 18:17:51 Workstation kernel: amdgpu 0000:3d:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 16, active_cu_number 64
Dec 17 18:17:51 Workstation kernel: fbcon: amdgpu (fb0) is primary device
Dec 17 18:17:51 Workstation kernel: amdgpu 0000:3d:00.0: [drm] fb0: amdgpu frame buffer device
Dec 17 18:17:51 Workstation kernel: amdgpu 0000:3d:00.0: amdgpu: Using BACO for runtime pm
Dec 17 18:17:51 Workstation kernel: [drm] Initialized amdgpu 3.42.0 20150101 for 0000:3d:00.0 on minor 0
Dec 17 18:17:51 Workstation kernel: amdgpu 0000:3e:00.0: enabling device (0100 -> 0103)
Dec 17 18:17:51 Workstation kernel: amdgpu 0000:3e:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
Dec 17 18:17:52 Workstation kernel: amdgpu 0000:3e:00.0: amdgpu: Fetched VBIOS from ROM BAR
Dec 17 18:17:52 Workstation kernel: amdgpu: ATOM BIOS: 113-C88801SL-102
Dec 17 18:17:52 Workstation kernel: amdgpu 0000:3e:00.0: BAR 2: releasing [??? 0x00000000 flags 0x0]
Dec 17 18:17:52 Workstation kernel: amdgpu 0000:3e:00.0: BAR 0: releasing [??? 0x00000000 flags 0x0]
Dec 17 18:17:52 Workstation kernel: [drm:amdgpu_device_resize_fb_bar [amdgpu]] *ERROR* Problem resizing BAR0 (-16).
Dec 17 18:17:52 Workstation kernel: [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* sw_init of IP block <gmc_v8_0> failed -19
Dec 17 18:17:52 Workstation kernel: amdgpu 0000:3e:00.0: amdgpu: amdgpu_device_ip_init failed
Dec 17 18:17:52 Workstation kernel: amdgpu 0000:3e:00.0: amdgpu: Fatal error during GPU init
Dec 17 18:17:52 Workstation kernel: amdgpu 0000:3e:00.0: amdgpu: amdgpu: finishing device.
Dec 17 18:18:00 Workstation gnome-shell[1921]: Added device '/dev/dri/card0' (amdgpu) using atomic mode setting.
Dec 17 18:18:02 Workstation gnome-shell[1921]: Disabling DMA buffer screen sharing for driver 'amdgpu'.
Dec 17 18:18:13 Workstation gnome-shell[2410]: Added device '/dev/dri/card0' (amdgpu) using atomic mode setting.
Dec 17 18:18:14 Workstation gnome-shell[2410]: Disabling DMA buffer screen sharing for driver 'amdgpu'.

Hopefully @Alex  can do/forward this since this is a P1 blocking issue and open for 3 years.

Revision history for this message

In Linux Kernel Bug Tracker #201957, smp (smp-linux-kernel-bugs) wrote on 2022-01-01:

#58

(In reply to roman from comment #52)
> I can confirm that
> amdgpu.dpm=0
> removes the issue
> on an AMD Radeon PRO FIJI (Dual Fury) kernel: 5.15.10|FW:
> 20211027.1d00989-1|mesa: 21.3.2-1
>
> Works perfectly fine in Gnome as long as there is no application accessing
> the 2nd GPU.

In sourse games it works fine for me but in many non-source games it'll just fucking die.
Anyways, now I cant boot withouth dpm, it freezes, meaning that source games will crash, along with Risk of Rain 2 and others.

> Hopefully @Alex can do/forward this since this is a P1 blocking issue and
> open for 3 years.

I can only hope it gets fixed one day soon.

Revision history for this message

In Linux Kernel Bug Tracker #201957, james.a.elian (james.a.elian-linux-kernel-bugs) wrote on 2022-01-09:

#59

I can confirm as well that disabling dynamic power management with the amdgpu.drm=0 kernel parameter removes the issue with Dishonored 2 on Ubuntu 21.10, kernel 5.13.0, Radeon RX 580 with Mesa 21.2.2.

Same boat as Spencer: hope it gets fixed one day.

Revision history for this message

In Linux Kernel Bug Tracker #201957, techxgames (techxgames-linux-kernel-bugs) wrote on 2022-01-22:

#60

Download full text (3.2 KiB)

I don't know if it's related, but my display freaks out before shutting off. It's still on, and it doesn't reboot when I do it by SSH. I have to do it on the desktop itself.

Jan 22 06:17:30 Y4M1-II kernel: [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
Jan 22 06:17:30 Y4M1-II kernel: [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
Jan 22 06:17:30 Y4M1-II kernel: [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
Jan 22 06:17:30 Y4M1-II kernel: [drm] PSP is resuming...
Jan 22 06:17:30 Y4M1-II kernel: [drm] VRAM is lost due to GPU reset!
Jan 22 06:17:30 Y4M1-II kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000753000).
Jan 22 06:17:30 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset succeeded, trying to resume
Jan 22 06:17:26 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 06:17:19 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU smu mode1 reset
Jan 22 06:17:19 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU mode1 reset
Jan 22 06:17:19 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: MODE1 reset
Jan 22 06:17:19 Y4M1-II kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <psp> failed -22
Jan 22 06:17:19 Y4M1-II kernel: [drm:psp_suspend [amdgpu]] *ERROR* Failed to terminate ras ta
Jan 22 06:17:19 Y4M1-II kernel: [drm] psp gfx command UNLOAD_TA(0x2) failed and response status is (0x0)
Jan 22 06:17:16 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 06:17:16 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 06:17:15 Y4M1-II kernel: [drm] REG_WAIT timeout 1us * 200 tries - hubp2_set_blank line:950
Jan 22 06:17:15 Y4M1-II kernel: [drm] REG_WAIT timeout 1us * 200 tries - hubp2_set_blank line:950
Jan 22 06:17:15 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: Failed to disable gfxoff!
Jan 22 06:17:15 Y4M1-II kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:80:crtc-1] flip_done timed out
Jan 22 06:17:15 Y4M1-II kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:77:crtc-0] flip_done timed out
Jan 22 06:17:10 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: Bailing on TDR for s_job:18e3f, as another already in progress
Jan 22 06:17:10 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
Jan 22 06:17:10 Y4M1-II kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1688 thread Xorg:cs0 pid 1731
Jan 22 06:17:10 Y4M1-II kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=112513, emitted seq=112515
Jan 22 06:17:10 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
Jan 22 06:17:10 Y4M1-II kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Jan 22 06:17:10 Y4M1-II kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=7058, emitted seq=7059
Jan 22 06:17:10 Y4M1-II kernel: [drm:amdgpu_dm_commit_planes [amdgpu]] *ERROR* Waiting for fences timed out!
Jan 22 06:17:05 Y4M1-II kernel: [drm:amdgpu_dm_commit_plane...

I don't know if it's related, but my display freaks out before shutting off. It's still on, and it doesn't reboot when I do it by SSH.  I have to do it on the desktop itself.

Jan 22 06:17:30 Y4M1-II kernel: [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
Jan 22 06:17:30 Y4M1-II kernel: [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
Jan 22 06:17:30 Y4M1-II kernel: [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
Jan 22 06:17:30 Y4M1-II kernel: [drm] PSP is resuming...
Jan 22 06:17:30 Y4M1-II kernel: [drm] VRAM is lost due to GPU reset!
Jan 22 06:17:30 Y4M1-II kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000753000).
Jan 22 06:17:30 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset succeeded, trying to resume
Jan 22 06:17:26 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 06:17:19 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU smu mode1 reset
Jan 22 06:17:19 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU mode1 reset
Jan 22 06:17:19 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: MODE1 reset
Jan 22 06:17:19 Y4M1-II kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <psp> failed -22
Jan 22 06:17:19 Y4M1-II kernel: [drm:psp_suspend [amdgpu]] *ERROR* Failed to terminate ras ta
Jan 22 06:17:19 Y4M1-II kernel: [drm] psp gfx command UNLOAD_TA(0x2) failed and response status is (0x0)
Jan 22 06:17:16 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 06:17:16 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 06:17:15 Y4M1-II kernel: [drm] REG_WAIT timeout 1us * 200 tries - hubp2_set_blank line:950
Jan 22 06:17:15 Y4M1-II kernel: [drm] REG_WAIT timeout 1us * 200 tries - hubp2_set_blank line:950
Jan 22 06:17:15 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: Failed to disable gfxoff!
Jan 22 06:17:15 Y4M1-II kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:80:crtc-1] flip_done timed out
Jan 22 06:17:15 Y4M1-II kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:77:crtc-0] flip_done timed out
Jan 22 06:17:10 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: Bailing on TDR for s_job:18e3f, as another already in progress
Jan 22 06:17:10 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
Jan 22 06:17:10 Y4M1-II kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1688 thread Xorg:cs0 pid 1731
Jan 22 06:17:10 Y4M1-II kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=112513, emitted seq=112515
Jan 22 06:17:10 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
Jan 22 06:17:10 Y4M1-II kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Jan 22 06:17:10 Y4M1-II kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=7058, emitted seq=7059
Jan 22 06:17:10 Y4M1-II kernel: [drm:amdgpu_dm_commit_planes [amdgpu]] *ERROR* Waiting for fences timed out!
Jan 22 06:17:05 Y4M1-II kernel: [drm:amdgpu_dm_commit_planes [amdgpu]] *ERROR* Waiting for fences timed out!

Revision history for this message

In Linux Kernel Bug Tracker #201957, techxgames (techxgames-linux-kernel-bugs) wrote on 2022-01-22:

#61

Download full text (9.2 KiB)

Another instance, when my desktop has been idle for a while and the display has been shut off for a while, the display won't come back on. Here's the journal entry I think is relevant to this:

Jan 22 08:07:58 Y4M1-II kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 22 08:07:58 Y4M1-II kernel: Tainted: G OE 5.15.11-76051511-generic #202112220937~1640185481~21.10~b3a2c21
Jan 22 08:07:58 Y4M1-II kernel: INFO: task Xorg:1692 blocked for more than 120 seconds.
Jan 22 08:07:58 Y4M1-II kernel: </TASK>
Jan 22 08:07:58 Y4M1-II kernel: ret_from_fork+0x22/0x30
Jan 22 08:07:58 Y4M1-II kernel: ? set_kthread_struct+0x50/0x50
Jan 22 08:07:58 Y4M1-II kernel: ? process_one_work+0x3d0/0x3d0
Jan 22 08:07:58 Y4M1-II kernel: kthread+0x11e/0x140
Jan 22 08:07:58 Y4M1-II kernel: worker_thread+0x53/0x420
Jan 22 08:07:58 Y4M1-II kernel: process_one_work+0x22b/0x3d0
Jan 22 08:07:58 Y4M1-II kernel: drm_sched_job_timedout+0x6f/0x110 [gpu_sched]
Jan 22 08:07:58 Y4M1-II kernel: amdgpu_job_timedout+0x14f/0x170 [amdgpu]
Jan 22 08:07:58 Y4M1-II kernel: amdgpu_device_gpu_recover.cold+0x6ec/0x8f8 [amdgpu]
Jan 22 08:07:58 Y4M1-II kernel: ? drm_fb_helper_set_suspend_unlocked+0x33/0xa0 [drm_kms_helper]
Jan 22 08:07:58 Y4M1-II kernel: amdgpu_device_pre_asic_reset+0xdd/0x480 [amdgpu]
Jan 22 08:07:58 Y4M1-II kernel: amdgpu_device_ip_suspend+0x21/0x70 [amdgpu]
Jan 22 08:07:58 Y4M1-II kernel: amdgpu_device_ip_suspend_phase1+0xa3/0x180 [amdgpu]
Jan 22 08:07:58 Y4M1-II kernel: ? amdgpu_device_set_cg_state+0x12f/0x280 [amdgpu]
Jan 22 08:07:58 Y4M1-II kernel: ? nv_common_set_clockgating_state+0x9f/0xb0 [amdgpu]
Jan 22 08:07:58 Y4M1-II kernel: dm_suspend+0xaa/0x270 [amdgpu]
Jan 22 08:07:58 Y4M1-II kernel: mutex_lock+0x34/0x40
Jan 22 08:07:58 Y4M1-II kernel: __mutex_lock_slowpath+0x13/0x20
Jan 22 08:07:58 Y4M1-II kernel: __mutex_lock.constprop.0+0x263/0x490
Jan 22 08:07:58 Y4M1-II kernel: schedule_preempt_disabled+0xe/0x10
Jan 22 08:07:58 Y4M1-II kernel: schedule+0x4e/0xb0
Jan 22 08:07:58 Y4M1-II kernel: __schedule+0x23d/0x590
Jan 22 08:07:58 Y4M1-II kernel: <TASK>
Jan 22 08:07:58 Y4M1-II kernel: Call Trace:
Jan 22 08:07:58 Y4M1-II kernel: Workqueue: events drm_sched_job_timedout [gpu_sched]
Jan 22 08:07:58 Y4M1-II kernel: task:kworker/12:1 state:D stack: 0 pid: 246 ppid: 2 flags:0x00004000
Jan 22 08:07:58 Y4M1-II kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 22 08:07:58 Y4M1-II kernel: Tainted: G OE 5.15.11-76051511-generic #202112220937~1640185481~21.10~b3a2c21
Jan 22 08:07:58 Y4M1-II kernel: INFO: task kworker/12:1:246 blocked for more than 120 seconds.
Jan 22 08:05:24 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: Bailing on TDR for s_job:1123, as another already in progress
Jan 22 08:05:24 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: Bailing on TDR for s_job:43c, as another already in progress
Jan 22 08:05:24 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
Jan 22 08:05:24 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
Jan 22 08:05:24 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
Jan 22 08:05:2...

Another instance, when my desktop has been idle for a while and the display has been shut off for a while, the display won't come back on.  Here's the journal entry I think is relevant to this:

Jan 22 08:07:58 Y4M1-II kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 22 08:07:58 Y4M1-II kernel:       Tainted: G           OE     5.15.11-76051511-generic #202112220937~1640185481~21.10~b3a2c21
Jan 22 08:07:58 Y4M1-II kernel: INFO: task Xorg:1692 blocked for more than 120 seconds.
Jan 22 08:07:58 Y4M1-II kernel:  </TASK>
Jan 22 08:07:58 Y4M1-II kernel:  ret_from_fork+0x22/0x30
Jan 22 08:07:58 Y4M1-II kernel:  ? set_kthread_struct+0x50/0x50
Jan 22 08:07:58 Y4M1-II kernel:  ? process_one_work+0x3d0/0x3d0
Jan 22 08:07:58 Y4M1-II kernel:  kthread+0x11e/0x140
Jan 22 08:07:58 Y4M1-II kernel:  worker_thread+0x53/0x420
Jan 22 08:07:58 Y4M1-II kernel:  process_one_work+0x22b/0x3d0
Jan 22 08:07:58 Y4M1-II kernel:  drm_sched_job_timedout+0x6f/0x110 [gpu_sched]
Jan 22 08:07:58 Y4M1-II kernel:  amdgpu_job_timedout+0x14f/0x170 [amdgpu]
Jan 22 08:07:58 Y4M1-II kernel:  amdgpu_device_gpu_recover.cold+0x6ec/0x8f8 [amdgpu]
Jan 22 08:07:58 Y4M1-II kernel:  ? drm_fb_helper_set_suspend_unlocked+0x33/0xa0 [drm_kms_helper]
Jan 22 08:07:58 Y4M1-II kernel:  amdgpu_device_pre_asic_reset+0xdd/0x480 [amdgpu]
Jan 22 08:07:58 Y4M1-II kernel:  amdgpu_device_ip_suspend+0x21/0x70 [amdgpu]
Jan 22 08:07:58 Y4M1-II kernel:  amdgpu_device_ip_suspend_phase1+0xa3/0x180 [amdgpu]
Jan 22 08:07:58 Y4M1-II kernel:  ? amdgpu_device_set_cg_state+0x12f/0x280 [amdgpu]
Jan 22 08:07:58 Y4M1-II kernel:  ? nv_common_set_clockgating_state+0x9f/0xb0 [amdgpu]
Jan 22 08:07:58 Y4M1-II kernel:  dm_suspend+0xaa/0x270 [amdgpu]
Jan 22 08:07:58 Y4M1-II kernel:  mutex_lock+0x34/0x40
Jan 22 08:07:58 Y4M1-II kernel:  __mutex_lock_slowpath+0x13/0x20
Jan 22 08:07:58 Y4M1-II kernel:  __mutex_lock.constprop.0+0x263/0x490
Jan 22 08:07:58 Y4M1-II kernel:  schedule_preempt_disabled+0xe/0x10
Jan 22 08:07:58 Y4M1-II kernel:  schedule+0x4e/0xb0
Jan 22 08:07:58 Y4M1-II kernel:  __schedule+0x23d/0x590
Jan 22 08:07:58 Y4M1-II kernel:  <TASK>
Jan 22 08:07:58 Y4M1-II kernel: Call Trace:
Jan 22 08:07:58 Y4M1-II kernel: Workqueue: events drm_sched_job_timedout [gpu_sched]
Jan 22 08:07:58 Y4M1-II kernel: task:kworker/12:1    state:D stack:    0 pid:  246 ppid:     2 flags:0x00004000
Jan 22 08:07:58 Y4M1-II kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 22 08:07:58 Y4M1-II kernel:       Tainted: G           OE     5.15.11-76051511-generic #202112220937~1640185481~21.10~b3a2c21
Jan 22 08:07:58 Y4M1-II kernel: INFO: task kworker/12:1:246 blocked for more than 120 seconds.
Jan 22 08:05:24 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: Bailing on TDR for s_job:1123, as another already in progress
Jan 22 08:05:24 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: Bailing on TDR for s_job:43c, as another already in progress
Jan 22 08:05:24 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
Jan 22 08:05:24 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
Jan 22 08:05:24 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
Jan 22 08:05:24 Y4M1-II kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Jan 22 08:05:24 Y4M1-II kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Jan 22 08:05:24 Y4M1-II kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Jan 22 08:05:24 Y4M1-II kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=4303, emitted seq=4305
Jan 22 08:05:24 Y4M1-II kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma3 timeout, signaled seq=1084, emitted seq=1086
Jan 22 08:05:24 Y4M1-II kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma2 timeout, signaled seq=4379, emitted seq=4381
Jan 22 08:05:20 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 08:05:20 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 08:05:19 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 08:05:19 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 08:05:19 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 08:05:19 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 08:05:19 Y4M1-II kernel: amdgpu_cs_ioctl: 59 callbacks suppressed
Jan 22 08:05:14 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset end with ret = -62
Jan 22 08:05:14 Y4M1-II kernel: snd_hda_intel 0000:0c:00.1: CORB reset timeout#2, CORBRP = 65535
Jan 22 08:05:14 Y4M1-II kernel: snd_hda_intel 0000:0c:00.1: refused to change power state from D3hot to D0
Jan 22 08:05:14 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 08:05:14 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 08:05:14 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 08:05:14 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 08:05:14 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 08:05:14 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 08:05:14 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 08:05:14 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 08:05:14 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 08:05:14 Y4M1-II kernel: [drm] Skip scheduling IBs!
Jan 22 08:05:14 Y4M1-II kernel: [drm] Skip scheduling IBs!
Jan 22 08:05:14 Y4M1-II kernel: [drm] Skip scheduling IBs!
Jan 22 08:05:14 Y4M1-II kernel: [drm] Skip scheduling IBs!
Jan 22 08:05:14 Y4M1-II kernel: [drm] Skip scheduling IBs!
Jan 22 08:05:14 Y4M1-II kernel: [drm] Skip scheduling IBs!
Jan 22 08:05:14 Y4M1-II kernel: [drm] Skip scheduling IBs!
Jan 22 08:05:14 Y4M1-II kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 22 08:05:14 Y4M1-II kernel: [drm] Skip scheduling IBs!
Jan 22 08:05:14 Y4M1-II kernel: [drm] Skip scheduling IBs!
...
Jan 22 08:05:14 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset(2) failed
Jan 22 08:05:14 Y4M1-II kernel: [drm] Skip scheduling IBs!
Jan 22 08:05:14 Y4M1-II kernel: [drm] Skip scheduling IBs!
Jan 22 08:05:14 Y4M1-II kernel: [drm] Skip scheduling IBs!
Jan 22 08:05:14 Y4M1-II kernel: [drm] Skip scheduling IBs!
Jan 22 08:05:14 Y4M1-II kernel: [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
Jan 22 08:05:14 Y4M1-II kernel: [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
Jan 22 08:05:14 Y4M1-II kernel: [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
Jan 22 08:05:14 Y4M1-II kernel: [drm] PSP is resuming...
Jan 22 08:05:14 Y4M1-II kernel: [drm] VRAM is lost due to GPU reset!
Jan 22 08:05:14 Y4M1-II kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000753000).
Jan 22 08:05:14 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset succeeded, trying to resume
Jan 22 08:05:03 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: ASIC reset failed with error, -62 for drm dev, 0000:0c:00.0
Jan 22 08:05:03 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU mode1 reset failed
Jan 22 08:05:03 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: SMU: I'm not done with your previous command!
Jan 22 08:04:58 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU smu mode1 reset
Jan 22 08:04:58 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU mode1 reset
Jan 22 08:04:58 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: MODE1 reset
Jan 22 08:04:58 Y4M1-II kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <psp> failed -22
Jan 22 08:04:58 Y4M1-II kernel: [drm:psp_suspend [amdgpu]] *ERROR* Failed to terminate ras ta
Jan 22 08:04:58 Y4M1-II kernel: [drm] psp gfx command UNLOAD_TA(0x2) failed and response status is (0x0)
Jan 22 08:04:56 Y4M1-II kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -62
Jan 22 08:04:56 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: Fail to disable dpm features!
Jan 22 08:04:56 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: Failed to disable smu features.
Jan 22 08:04:51 Y4M1-II kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Jan 22 08:04:51 Y4M1-II kernel: amdgpu 0000:0c:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Jan 22 08:04:50 Y4M1-II kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
Jan 22 08:04:50 Y4M1-II kernel: amdgpu 0000:0c:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Jan 22 08:04:50 Y4M1-II kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
Jan 22 08:04:50 Y4M1-II kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1692 thread Xorg:cs0 pid 1745
Jan 22 08:04:50 Y4M1-II kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=570767, emitted seq=570769

Revision history for this message

In Linux Kernel Bug Tracker #201957, smp (smp-linux-kernel-bugs) wrote on 2022-01-24:

#62

Download full text (3.6 KiB)

Created attachment 300315
Kernel config

OS: Gentoo
Kernel: 5.15.16, config attached, built with make -j12
Launch options: root=/dev/sda2 ro quiet

I'd like to be able to boot with amdgpu.dpm=0, as this seems to fix the bug with minor tradeoffs, however:
When I boot with dpm disabled, my screen will freeze and leave this nice little stinker to ruin my day

Jan 24 16:33:05 [kernel] [ 2.572474] Loading firmware: amdgpu/navi10_pfp.bin
Jan 24 16:33:05 [kernel] [ 2.572475] Loading firmware: amdgpu/navi10_me.bin
Jan 24 16:33:05 [kernel] [ 2.572476] Loading firmware: amdgpu/navi10_ce.bin
Jan 24 16:33:05 [kernel] [ 2.572477] Loading firmware: amdgpu/navi10_rlc.bin
Jan 24 16:33:05 [kernel] [ 2.572477] Loading firmware: amdgpu/navi10_mec.bin
Jan 24 16:33:05 [kernel] [ 2.572478] Loading firmware: amdgpu/navi10_mec2.bin
Jan 24 16:33:05 [kernel] [ 2.572968] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: discard. Quota mode: none.
Jan 24 16:33:05 [kernel] [ 2.573030] Loading firmware: amdgpu/navi10_sdma.bin
Jan 24 16:33:05 [kernel] [ 2.573032] Loading firmware: amdgpu/navi10_sdma1.bin
Jan 24 16:33:05 [kernel] [ 2.573071] Loading firmware: amdgpu/navi10_vcn.bin
Jan 24 16:33:05 [kernel] [ 2.573072] [drm] Found VCN firmware Version ENC: 1.14 DEC: 5 VEP: 0 Revision: 20
Jan 24 16:33:05 [kernel] [ 2.573075] amdgpu 0000:28:00.0: amdgpu: Will use PSP to load VCN firmware
Jan 24 16:33:05 [kernel] [ 2.747244] [drm] reserve 0x900000 from 0x817e400000 for PSP TMR
Jan 24 16:33:05 [kernel] [ 2.785931] amdgpu 0000:28:00.0: amdgpu: RAS: optional ras ta ucode is not available
Jan 24 16:33:05 [kernel] [ 2.790137] amdgpu 0000:28:00.0: amdgpu: RAP: optional rap ta ucode is not available
Jan 24 16:33:05 [kernel] [ 2.790138] amdgpu 0000:28:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Jan 24 16:33:05 [kernel] [ 2.790140] amdgpu: smu firmware loading failed
Jan 24 16:33:05 [kernel] [ 2.790141] amdgpu 0000:28:00.0: amdgpu: amdgpu_device_ip_init failed
Jan 24 16:33:05 [kernel] [ 2.790143] amdgpu 0000:28:00.0: amdgpu: Fatal error during GPU init
Jan 24 16:33:05 [kernel] [ 2.790144] amdgpu 0000:28:00.0: amdgpu: amdgpu: finishing device.
Jan 24 16:33:05 [kernel] [ 2.793726] [drm] free PSP TMR buffer
Jan 24 16:33:05 [kernel] [ 2.825874] amdgpu: probe of 0000:28:00.0 failed with error -95
Jan 24 16:33:05 [kernel] [ 2.825951] BUG: unable to handle page fault for address: ffffa4af5100d000
Jan 24 16:33:05 [kernel] [ 2.825954] #PF: supervisor write access in kernel mode
Jan 24 16:33:05 [kernel] [ 2.825955] #PF: error_code(0x0002) - not-present page
Jan 24 16:33:05 [kernel] [ 2.825957] PGD 100000067 P4D 100000067 PUD 100104067 PMD 0
Jan 24 16:33:05 [kernel] [ 2.825960] Oops: 0002 [#1] SMP NOPTI
Jan 24 16:33:05 [kernel] [ 2.825962] CPU: 6 PID: 759 Comm: systemd-udevd Not tainted 5.15.16-gentoo #8
Jan 24 16:33:05 [kernel] [ 2.825965] Hardware name: Micro-Star International Co., Ltd MS-7B86/B450 GAMING PLUS MAX (MS-7B86), BIOS H.60 04/18/2020
Jan 24 16:33:05 [kernel] [ 2.825967] RIP: 0010:vcn_v2_0_sw_fini+0x65/0x80 [amdgpu]
Jan 24 16:33:05 [kernel] [ 2.826139] C...

Created attachment 300315
Kernel config

OS: Gentoo
Kernel: 5.15.16, config attached, built with make -j12
Launch options: root=/dev/sda2 ro quiet

I'd like to be able to boot with amdgpu.dpm=0, as this seems to fix the bug with minor tradeoffs, however:
When I boot with dpm disabled, my screen will freeze and leave this nice little stinker to ruin my day

Jan 24 16:33:05 [kernel] [    2.572474] Loading firmware: amdgpu/navi10_pfp.bin
Jan 24 16:33:05 [kernel] [    2.572475] Loading firmware: amdgpu/navi10_me.bin
Jan 24 16:33:05 [kernel] [    2.572476] Loading firmware: amdgpu/navi10_ce.bin
Jan 24 16:33:05 [kernel] [    2.572477] Loading firmware: amdgpu/navi10_rlc.bin
Jan 24 16:33:05 [kernel] [    2.572477] Loading firmware: amdgpu/navi10_mec.bin
Jan 24 16:33:05 [kernel] [    2.572478] Loading firmware: amdgpu/navi10_mec2.bin
Jan 24 16:33:05 [kernel] [    2.572968] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: discard. Quota mode: none.
Jan 24 16:33:05 [kernel] [    2.573030] Loading firmware: amdgpu/navi10_sdma.bin
Jan 24 16:33:05 [kernel] [    2.573032] Loading firmware: amdgpu/navi10_sdma1.bin
Jan 24 16:33:05 [kernel] [    2.573071] Loading firmware: amdgpu/navi10_vcn.bin
Jan 24 16:33:05 [kernel] [    2.573072] [drm] Found VCN firmware Version ENC: 1.14 DEC: 5 VEP: 0 Revision: 20
Jan 24 16:33:05 [kernel] [    2.573075] amdgpu 0000:28:00.0: amdgpu: Will use PSP to load VCN firmware
Jan 24 16:33:05 [kernel] [    2.747244] [drm] reserve 0x900000 from 0x817e400000 for PSP TMR
Jan 24 16:33:05 [kernel] [    2.785931] amdgpu 0000:28:00.0: amdgpu: RAS: optional ras ta ucode is not available
Jan 24 16:33:05 [kernel] [    2.790137] amdgpu 0000:28:00.0: amdgpu: RAP: optional rap ta ucode is not available
Jan 24 16:33:05 [kernel] [    2.790138] amdgpu 0000:28:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Jan 24 16:33:05 [kernel] [    2.790140] amdgpu: smu firmware loading failed
Jan 24 16:33:05 [kernel] [    2.790141] amdgpu 0000:28:00.0: amdgpu: amdgpu_device_ip_init failed
Jan 24 16:33:05 [kernel] [    2.790143] amdgpu 0000:28:00.0: amdgpu: Fatal error during GPU init
Jan 24 16:33:05 [kernel] [    2.790144] amdgpu 0000:28:00.0: amdgpu: amdgpu: finishing device.
Jan 24 16:33:05 [kernel] [    2.793726] [drm] free PSP TMR buffer
Jan 24 16:33:05 [kernel] [    2.825874] amdgpu: probe of 0000:28:00.0 failed with error -95
Jan 24 16:33:05 [kernel] [    2.825951] BUG: unable to handle page fault for address: ffffa4af5100d000
Jan 24 16:33:05 [kernel] [    2.825954] #PF: supervisor write access in kernel mode
Jan 24 16:33:05 [kernel] [    2.825955] #PF: error_code(0x0002) - not-present page
Jan 24 16:33:05 [kernel] [    2.825957] PGD 100000067 P4D 100000067 PUD 100104067 PMD 0
Jan 24 16:33:05 [kernel] [    2.825960] Oops: 0002 [#1] SMP NOPTI
Jan 24 16:33:05 [kernel] [    2.825962] CPU: 6 PID: 759 Comm: systemd-udevd Not tainted 5.15.16-gentoo #8
Jan 24 16:33:05 [kernel] [    2.825965] Hardware name: Micro-Star International Co., Ltd MS-7B86/B450 GAMING PLUS MAX (MS-7B86), BIOS H.60 04/18/2020
Jan 24 16:33:05 [kernel] [    2.825967] RIP: 0010:vcn_v2_0_sw_fini+0x65/0x80 [amdgpu]
Jan 24 16:33:05 [kernel] [    2.826139] Code: 89 ef e8 fe 1b ff ff 85 c0 75 08 48 89 ef e8 42 1a ff ff 48 8b 54 24 08 65 48 2b 14 25 28 00 00 00 75 18 48 83 c4 10 5b 5d c3 <c7> 03 00 00 00 00 8b 7c 24 04 e8 4c c4 4d e9 eb bc e8 15 cd ab e9
Jan 24 16:33:05 [kernel] [    2.826142] RSP: 0018:ffffa4af40bc7c30 EFLAGS: 00010202

TL;DR: amdgpu: smu firmware loading failed
What it means exactly, I know not, but I know it means my screen is frozen
Is there a trick? A workaround to this?
If there is some info I left out ask for it and I'll fetch it

Revision history for this message

In Linux Kernel Bug Tracker #201957, andrewammerlaan (andrewammerlaan-linux-kernel-bugs) wrote on 2022-01-25:

#63

> Jan 24 16:33:05 [kernel] [ 2.785931] amdgpu 0000:28:00.0: amdgpu: RAS:
> optional ras ta ucode is not available
> Jan 24 16:33:05 [kernel] [ 2.790137] amdgpu 0000:28:00.0: amdgpu: RAP:
> optional rap ta ucode is not available
> Jan 24 16:33:05 [kernel] [ 2.790138] amdgpu 0000:28:00.0: amdgpu:
> SECUREDISPLAY: securedisplay ta ucode is not available
> Jan 24 16:33:05 [kernel] [ 2.790140] amdgpu: smu firmware loading failed
> Jan 24 16:33:05 [kernel] [ 2.790141] amdgpu 0000:28:00.0: amdgpu:
> amdgpu_device_ip_init failed
> Jan 24 16:33:05 [kernel] [ 2.790143] amdgpu 0000:28:00.0: amdgpu: Fatal
> error during GPU init

Is this a custom built kernel? Is amdgpu built into the kernel or enabled as a module? In the former case, is all required firmware also built into the kernel? In the later case, is all required firmware available on the initramfs (if amdgpu is incorporated in the initramfs)? The required firmware files are listed here: https://wiki.gentoo.org/wiki/AMDGPU#Known_firmware_blobs

Revision history for this message

In Linux Kernel Bug Tracker #201957, smp (smp-linux-kernel-bugs) wrote on 2022-01-25:

#64

>Is this a custom built kernel? Is amdgpu built into the kernel or enabled as a
>module? In the former case, is all required firmware also built into the
>kernel? In the later case, is all required firmware available on the initramfs
>(if amdgpu is incorporated in the initramfs)? The required firmware files are
>listed here:

It's a custom, but I have them all builtin.
>grep navi10 .config && echo
>amdgpu/navi10_{asd,ce,gpu_info,me,mec2,mec,pfp,rlc,sdma1,sdma,smc,sos,ta,vcn}.bin
amdgpu/navi10_asd.bin amdgpu/navi10_ce.bin amdgpu/navi10_gpu_info.bin amdgpu/navi10_me.bin amdgpu/navi10_mec2.bin amdgpu/navi10_mec.bin amdgpu/navi10_pfp.bin amdgpu/navi10_rlc.bin amdgpu/navi10_sdma1.bin amdgpu/navi10_sdma.bin amdgpu/navi10_smc.bin amdgpu/navi10_sos.bin amdgpu/navi10_ta.bin amdgpu/navi10_vcn.bin

Revision history for this message

In Linux Kernel Bug Tracker #201957, smp (smp-linux-kernel-bugs) wrote on 2022-01-25:

#65

As an append to both comments, a working boot spits out this:

Loading firmware: amdgpu/navi10_sos.bin
Loading firmware: amdgpu/navi10_asd.bin
Loading firmware: amdgpu/navi10_ta.bin
amdgpu 0000:28:00.0: amdgpu: PSP runtime database doesn't exist
Loading firmware: amdgpu/navi10_smc.bin
Loading firmware: amdgpu/navi10_pfp.bin
Loading firmware: amdgpu/navi10_me.bin
Loading firmware: amdgpu/navi10_ce.bin
Loading firmware: amdgpu/navi10_rlc.bin
Loading firmware: amdgpu/navi10_mec.bin
Loading firmware: amdgpu/navi10_mec2.bin
Loading firmware: amdgpu/navi10_sdma.bin
Loading firmware: amdgpu/navi10_sdma1.bin
Loading firmware: amdgpu/navi10_vcn.bin
amdgpu 0000:28:00.0: amdgpu: Will use PSP to load VCN firmware
amdgpu 0000:28:00.0: amdgpu: RAS: optional ras ta ucode is not available
amdgpu 0000:28:00.0: amdgpu: RAP: optional rap ta ucode is not available
amdgpu 0000:28:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
amdgpu 0000:28:00.0: amdgpu: use vbios provided pptable
amdgpu 0000:28:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.5
amdgpu 0000:28:00.0: amdgpu: SMU is initialized successfully!
kfd kfd: amdgpu: Allocated 3969056 bytes on gart
amdgpu: HMM registered 6128MB device memory
amdgpu: SRAT table not found
amdgpu: Virtual CRAT table created for GPU
amdgpu: Topology: Add dGPU node [0x731f:0x1002]
kfd kfd: amdgpu: added device 1002:731f
amdgpu 0000:28:00.0: amdgpu: SE 2, SH per SE 2, CU per SH 10, active_cu_number 36
fbcon: amdgpudrmfb (fb0) is primary device

Revision history for this message

In Linux Kernel Bug Tracker #201957, kernelorg (kernelorg-linux-kernel-bugs) wrote on 2022-02-02:

#66

Chiming in as another victim of:
[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

Radeon RX 6700 XT (NAVY_FLOUNDER, DRM 3.42.0, 5.15.15-76051515-generic, LLVM 12.0.1)
AMD Ryzen 9 5900X
Ubuntu Mate
Mesa 21.2.2

Haven't attempted the amdgpu.dpm=0 workaround because the side effects of it appear to be bad.

Revision history for this message

In Linux Kernel Bug Tracker #201957, randyk161 (randyk161-linux-kernel-bugs) wrote on 2022-02-03:

#67

I've been getting "ring gfx timeouts" for some time (See comment 35), most of the time it's when the computer has not had any input for a while (while I'm away from it). When it freezes I can SSH into it but when I try to do a: "shutdown -h now" it boots me out of SSH as it should but the computer never seems to actually shutdown.

I've tried many different kernel parameters but no luck so far. I'm now trying the amdgpu.runpm=0 as suggested here: https://wiki.archlinux.org/title/AMDGPU (at the very bottom of the page: Issues with power management / dynamic re-activation of a discrete amdgpu graphics card) I haven't seen any performance repercussions yet. I'll just have to wait it out and see.

For my system specs see my previous comment 35.

Revision history for this message

In Linux Kernel Bug Tracker #201957, randyk161 (randyk161-linux-kernel-bugs) wrote on 2022-02-03:

#68

(In reply to Jon from comment #61)
> Chiming in as another victim of:
> [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
>
> Radeon RX 6700 XT (NAVY_FLOUNDER, DRM 3.42.0, 5.15.15-76051515-generic, LLVM
> 12.0.1)
> AMD Ryzen 9 5900X
> Ubuntu Mate
> Mesa 21.2.2
>
> Haven't attempted the amdgpu.dpm=0 workaround because the side effects of it
> appear to be bad.

I've tried amdgpu.dpm=0 and it seriously kills the frame rate in super tux kart at least.

Revision history for this message

In Linux Kernel Bug Tracker #201957, alexdeucher (alexdeucher-linux-kernel-bugs) wrote on 2022-02-03:

#69

(In reply to Jon from comment #61)
> Chiming in as another victim of:
> [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
>

This is just a symptom of an application trying to use the GPU after a GPU reset without re-initializing it's context. The cause of a GPU reset can be a lot of things. If you have different hardware from other people on this ticket, it's not likely the same issue.

Revision history for this message

In Linux Kernel Bug Tracker #201957, inferrna (inferrna-linux-kernel-bugs) wrote on 2022-02-11:

#70

Download full text (7.8 KiB)

I have same bug with firefox (happened once a day, starting about a week ago)

[ 4409.071226] BUG: unable to handle page fault for address: fffffffffffffff8
[ 4409.071234] #PF: supervisor read access in kernel mode
[ 4409.071235] #PF: error_code(0x0000) - not-present page
[ 4409.071237] PGD 427e12067 P4D 427e12067 PUD 427e14067 PMD 0
[ 4409.071240] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 4409.071242] CPU: 18 PID: 191 Comm: uvd Tainted: G OE 5.16.8uksm #1
[ 4409.071245] Hardware name: Hewlett-Packard HP Z420 Workstation/1589, BIOS J61 v03.96 10/29/2019
[ 4409.071246] RIP: 0010:swake_up_locked+0x17/0x40
[ 4409.071251] Code: ff ff ff eb ad 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00 00 48 8b 57 08 48 8d 47 08 48 39 c2 74 25 53 48 8b 5f 08 <48> 8b 7b f8 e8 80 7f fe ff 48 8b 13 48 8b 43 08 48 89 42 08 48 89
[ 4409.071253] RSP: 0018:ffffbbdf012b7e70 EFLAGS: 00010007
[ 4409.071254] RAX: ffff9719549270b0 RBX: 0000000000000000 RCX: 0000000000000000
[ 4409.071256] RDX: 0000000000000000 RSI: ffff97185d547250 RDI: ffff9719549270a8
[ 4409.071257] RBP: ffff9719549270a8 R08: ffff9716473efec0 R09: ffff9716473efed8
[ 4409.071258] R10: ffff971646cc3000 R11: ffff971646cc3000 R12: 0000000000000286
[ 4409.071259] R13: ffff9716473eebe0 R14: ffff9716ee901bc0 R15: ffff9719549270a0
[ 4409.071260] FS: 0000000000000000(0000) GS:ffff97213fc80000(0000) knlGS:0000000000000000
[ 4409.071262] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4409.071263] CR2: fffffffffffffff8 CR3: 0000000427e10006 CR4: 00000000001706e0
[ 4409.071264] Call Trace:
[ 4409.071267] <TASK>
[ 4409.071269] complete+0x2f/0x40
[ 4409.071271] drm_sched_main+0x24b/0x450
[ 4409.071274] ? wait_woken+0x70/0x70
[ 4409.071289] ? drm_sched_job_done.isra.0+0x130/0x130
[ 4409.071290] kthread+0x169/0x190
[ 4409.071294] ? set_kthread_struct+0x40/0x40
[ 4409.071297] ret_from_fork+0x1f/0x30
[ 4409.071301] </TASK>
[ 4409.071302] Modules linked in: xt_conntrack nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter cmac rfcomm vboxnetadp(OE) vboxnetflt(OE) iptable_mangle xt_CHECKSUM xt_tcpudp iptable_nat xt_comment xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc overlay iptable_filter vboxdrv(OE) bnep cpufreq_powersave zram binfmt_misc squashfs snd_emu10k1_synth snd_hda_codec_realtek snd_emux_synth snd_seq_midi_emul snd_seq_virmidi snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel intel_rapl_msr snd_intel_dspcfg intel_rapl_common snd_emu10k1 snd_hda_codec snd_util_mem snd_ac97_codec snd_hda_core nls_iso8859_1 hp_wmi nls_cp866 ac97_bus platform_profile sparse_keymap snd_hwdep wmi_bmof btusb snd_pcm sb_edac btrtl x86_pkg_temp_thermal intel_powerclamp snd_seq_midi btbcm snd_seq_midi_event btintel snd_rawmidi kvm_intel bluetooth input_leds snd_seq kvm ecdh_generic snd_seq_device snd_timer irqbypass emu10k1_gp serio_raw snd gameport ioatdma soundcore dca
[ 4409.071342] wmi mac_hid xpad ff_memless coretemp mei_me mei hwmon_vid i5500_temp msr ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq zstd_compress libcrc32c hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid crc32_pclmul ghash_clmulni_intel aesni_intel e1000e psmou...

I have same bug with firefox (happened once a day, starting about a week ago)

[ 4409.071226] BUG: unable to handle page fault for address: fffffffffffffff8
[ 4409.071234] #PF: supervisor read access in kernel mode
[ 4409.071235] #PF: error_code(0x0000) - not-present page
[ 4409.071237] PGD 427e12067 P4D 427e12067 PUD 427e14067 PMD 0 
[ 4409.071240] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 4409.071242] CPU: 18 PID: 191 Comm: uvd Tainted: G           OE     5.16.8uksm #1
[ 4409.071245] Hardware name: Hewlett-Packard HP Z420 Workstation/1589, BIOS J61 v03.96 10/29/2019
[ 4409.071246] RIP: 0010:swake_up_locked+0x17/0x40
[ 4409.071251] Code: ff ff ff eb ad 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00 00 48 8b 57 08 48 8d 47 08 48 39 c2 74 25 53 48 8b 5f 08 <48> 8b 7b f8 e8 80 7f fe ff 48 8b 13 48 8b 43 08 48 89 42 08 48 89
[ 4409.071253] RSP: 0018:ffffbbdf012b7e70 EFLAGS: 00010007
[ 4409.071254] RAX: ffff9719549270b0 RBX: 0000000000000000 RCX: 0000000000000000
[ 4409.071256] RDX: 0000000000000000 RSI: ffff97185d547250 RDI: ffff9719549270a8
[ 4409.071257] RBP: ffff9719549270a8 R08: ffff9716473efec0 R09: ffff9716473efed8
[ 4409.071258] R10: ffff971646cc3000 R11: ffff971646cc3000 R12: 0000000000000286
[ 4409.071259] R13: ffff9716473eebe0 R14: ffff9716ee901bc0 R15: ffff9719549270a0
[ 4409.071260] FS:  0000000000000000(0000) GS:ffff97213fc80000(0000) knlGS:0000000000000000
[ 4409.071262] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4409.071263] CR2: fffffffffffffff8 CR3: 0000000427e10006 CR4: 00000000001706e0
[ 4409.071264] Call Trace:
[ 4409.071267]  <TASK>
[ 4409.071269]  complete+0x2f/0x40
[ 4409.071271]  drm_sched_main+0x24b/0x450
[ 4409.071274]  ? wait_woken+0x70/0x70
[ 4409.071289]  ? drm_sched_job_done.isra.0+0x130/0x130
[ 4409.071290]  kthread+0x169/0x190
[ 4409.071294]  ? set_kthread_struct+0x40/0x40
[ 4409.071297]  ret_from_fork+0x1f/0x30
[ 4409.071301]  </TASK>
[ 4409.071302] Modules linked in: xt_conntrack nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter cmac rfcomm vboxnetadp(OE) vboxnetflt(OE) iptable_mangle xt_CHECKSUM xt_tcpudp iptable_nat xt_comment xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc overlay iptable_filter vboxdrv(OE) bnep cpufreq_powersave zram binfmt_misc squashfs snd_emu10k1_synth snd_hda_codec_realtek snd_emux_synth snd_seq_midi_emul snd_seq_virmidi snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel intel_rapl_msr snd_intel_dspcfg intel_rapl_common snd_emu10k1 snd_hda_codec snd_util_mem snd_ac97_codec snd_hda_core nls_iso8859_1 hp_wmi nls_cp866 ac97_bus platform_profile sparse_keymap snd_hwdep wmi_bmof btusb snd_pcm sb_edac btrtl x86_pkg_temp_thermal intel_powerclamp snd_seq_midi btbcm snd_seq_midi_event btintel snd_rawmidi kvm_intel bluetooth input_leds snd_seq kvm ecdh_generic snd_seq_device snd_timer irqbypass emu10k1_gp serio_raw snd gameport ioatdma soundcore dca
[ 4409.071342]  wmi mac_hid xpad ff_memless coretemp mei_me mei hwmon_vid i5500_temp msr ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq zstd_compress libcrc32c hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid crc32_pclmul ghash_clmulni_intel aesni_intel e1000e psmouse crypto_simd cryptd ahci i2c_i801 libahci lpc_ich i2c_smbus [last unloaded: cpuid]
[ 4409.071362] CR2: fffffffffffffff8
[ 4409.071364] ---[ end trace a6d18badbe55bb92 ]---
[ 4409.071365] RIP: 0010:swake_up_locked+0x17/0x40
[ 4409.071367] Code: ff ff ff eb ad 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00 00 48 8b 57 08 48 8d 47 08 48 39 c2 74 25 53 48 8b 5f 08 <48> 8b 7b f8 e8 80 7f fe ff 48 8b 13 48 8b 43 08 48 89 42 08 48 89
[ 4409.071368] RSP: 0018:ffffbbdf012b7e70 EFLAGS: 00010007
[ 4409.071370] RAX: ffff9719549270b0 RBX: 0000000000000000 RCX: 0000000000000000
[ 4409.071371] RDX: 0000000000000000 RSI: ffff97185d547250 RDI: ffff9719549270a8
[ 4409.071372] RBP: ffff9719549270a8 R08: ffff9716473efec0 R09: ffff9716473efed8
[ 4409.071373] R10: ffff971646cc3000 R11: ffff971646cc3000 R12: 0000000000000286
[ 4409.071374] R13: ffff9716473eebe0 R14: ffff9716ee901bc0 R15: ffff9719549270a0
[ 4409.071375] FS:  0000000000000000(0000) GS:ffff97213fc80000(0000) knlGS:0000000000000000
[ 4409.071377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4409.071378] CR2: fffffffffffffff8 CR3: 0000000427e10006 CR4: 00000000001706e0
[ 4409.071379] note: uvd[191] exited with preempt_count 1
[ 4419.193226] [drm:amdgpu_job_timedout] *ERROR* ring uvd timeout, signaled seq=14, emitted seq=14
[ 4419.193237] [drm:amdgpu_job_timedout] *ERROR* Process information: process RDD Process pid 37880 thread firefox:cs0 pid 46445
[ 4419.193242] amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
[ 4419.193305] ------------[ cut here ]------------
[ 4419.193307] WARNING: CPU: 18 PID: 45938 at kernel/kthread.c:596 kthread_park+0x6d/0x90
[ 4419.193312] Modules linked in: xt_conntrack nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter cmac rfcomm vboxnetadp(OE) vboxnetflt(OE) iptable_mangle xt_CHECKSUM xt_tcpudp iptable_nat xt_comment xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc overlay iptable_filter vboxdrv(OE) bnep cpufreq_powersave zram binfmt_misc squashfs snd_emu10k1_synth snd_hda_codec_realtek snd_emux_synth snd_seq_midi_emul snd_seq_virmidi snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel intel_rapl_msr snd_intel_dspcfg intel_rapl_common snd_emu10k1 snd_hda_codec snd_util_mem snd_ac97_codec snd_hda_core nls_iso8859_1 hp_wmi nls_cp866 ac97_bus platform_profile sparse_keymap snd_hwdep wmi_bmof btusb snd_pcm sb_edac btrtl x86_pkg_temp_thermal intel_powerclamp snd_seq_midi btbcm snd_seq_midi_event btintel snd_rawmidi kvm_intel bluetooth input_leds snd_seq kvm ecdh_generic snd_seq_device snd_timer irqbypass emu10k1_gp serio_raw snd gameport ioatdma soundcore dca
[ 4419.193358]  wmi mac_hid xpad ff_memless coretemp mei_me mei hwmon_vid i5500_temp msr ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq zstd_compress libcrc32c hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid crc32_pclmul ghash_clmulni_intel aesni_intel e1000e psmouse crypto_simd cryptd ahci i2c_i801 libahci lpc_ich i2c_smbus [last unloaded: cpuid]
[ 4419.193380] CPU: 18 PID: 45938 Comm: kworker/18:1 Tainted: G      D    OE     5.16.8uksm #1
[ 4419.193383] Hardware name: Hewlett-Packard HP Z420 Workstation/1589, BIOS J61 v03.96 10/29/2019
[ 4419.193384] Workqueue: events drm_sched_job_timedout
[ 4419.193388] RIP: 0010:kthread_park+0x6d/0x90
[ 4419.193391] Code: 20 e8 a7 50 dd 00 be 40 00 00 00 48 89 ef e8 7a 1d 01 00 48 85 c0 74 25 31 c0 5b 5d c3 0f 0b a8 04 48 8b 9d a0 05 00 00 74 b2 <0f> 0b b8 da ff ff ff 5b 5d c3 0f 0b b8 f0 ff ff ff eb dd 0f 0b eb
[ 4419.193394] RSP: 0018:ffffbbdf30497d10 EFLAGS: 00010202
[ 4419.193396] RAX: 000000000020804c RBX: ffff97164124c780 RCX: 0000000000000001
[ 4419.193397] RDX: 0000000000000000 RSI: ffff97185d547000 RDI: ffff971646e38000
[ 4419.193399] RBP: ffff971646e38000 R08: 0000000000000000 R09: ffff97213fcaab70
[ 4419.193400] R10: ffff971646e3c1e8 R11: ffff971646e3c1d8 R12: ffff9716473eea68
[ 4419.193401] R13: 0000000000000060 R14: ffff971642540000 R15: ffff9716473eebd0
[ 4419.193403] FS:  0000000000000000(0000) GS:ffff97213fc80000(0000) knlGS:0000000000000000
[ 4419.193404] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4419.193406] CR2: 00007fad273687f0 CR3: 0000000427e10002 CR4: 00000000001706e0
[ 4419.193408] Call Trace:
[ 4419.193410]  <TASK>
[ 4419.193413]  drm_sched_stop+0x31/0x160
[ 4419.193416]  amdgpu_device_gpu_recover.cold+0xa34/0xa6c
[ 4419.193422]  amdgpu_job_timedout+0x145/0x170
[ 4419.193425]  drm_sched_job_timedout+0x63/0x100
[ 4419.193427]  process_one_work+0x1d8/0x3b0
[ 4419.193430]  worker_thread+0x4d/0x3d0
[ 4419.193431]  ? rescuer_thread+0x360/0x360
[ 4419.193433]  kthread+0x169/0x190
[ 4419.193436]  ? set_kthread_struct+0x40/0x40
[ 4419.193439]  ret_from_fork+0x1f/0x30
[ 4419.193444]  </TASK>
[ 4419.193445] ---[ end trace a6d18badbe55bb93 ]---

Also no problem with 3d-games.

Revision history for this message

In Linux Kernel Bug Tracker #201957, randyk161 (randyk161-linux-kernel-bugs) wrote on 2022-02-24:

#71

So I've been running for about 2.5 weeks now using the amdgpu.runpm=0 kernel parameter and I've had no crashes or freezes so far. I'm cautiously optimistic that for me at least this may have solved the problem. So far I haven't noticed any side effects (performance degradation etc.).

I understand that amdgpu.runpm=0 is related to power management but I don't know the specifics. Possibly Alex Deucher can chime in and specify exactly what this parameter does?

See my previous comments for some context:
comment 35
comment 62
comment 63

Revision history for this message

In Linux Kernel Bug Tracker #201957, alexdeucher (alexdeucher-linux-kernel-bugs) wrote on 2022-02-25:

#72

(In reply to Randune from comment #66)
>
> I understand that amdgpu.runpm=0 is related to power management but I don't
> know the specifics. Possibly Alex Deucher can chime in and specify exactly
> what this parameter does?

The runpm parameter allows you to disable runtime power management which powers down dGPUs at runtime if they are not being used (e.g., hybrid graphics laptops or desktop systems with multiple GPUs) to save power. It does not affect dynamic power management while the chip is powered up. Disabling it will increase idle power usage.

Revision history for this message

In Linux Kernel Bug Tracker #201957, s48gs.w (s48gs.w-linux-kernel-bugs) wrote on 2022-05-05:

#73

Had this problem with Ryzen3 3200 CPU (Vega8 integrated) on A320M-DVS R4.0 motherboard.
microcode: CPU: patch_level=0x08108109
microcode: Microcode Update Driver: v2.2.

I had 100% scenario to trigger freeze:
1. play video (in webbrowser or video player, should stay visible(dont hide tab or minimize window))
2. open shadertoy website (any shader, keep it rendering also keep window visible)
3. open any OpenGL or Vulkan application (that use integrated GPU)
4. start pressing fullscreen/un-fullscreen button on shadertoy shader (~5 times is enough to trigger bug, system will slowdown slowly in next 10-20 mins till freeze, just wait(visible on shadertoy FPS counter))
... and freeze

I use this PC for 2 years, every Linux kernel had this "freeze" when used integrated GPU. Current kernel OpenSuse 5.17.4-1-default.
(my solution for all this time was obvious - disable integrated GPU in BIOS and use discrete only, and everything works)

Today I checked motherboard website - https://asrock.com/MB/AMD/A320M-DVS%20R4.0/index.asp#BIOS they have 7.00 and 7.10 BIOS, I was on 4.00 BIOS
So I updated BIOS to 7.00 and 7.10 (now)... and everything works - no freezes anymore.
So it was firmware problem (atleast for me) that fixed by BIOS update.

Revision history for this message

In Linux Kernel Bug Tracker #201957, s48gs.w (s48gs.w-linux-kernel-bugs) wrote on 2022-05-05:

#74

Edit - got freeze after using PC for 4 hours, before it was 20 min longest time I could use integrated GPU, so it not fixed completely look like, just some improvement(or I just got lucky)... im back to use Discrete GPU.

Revision history for this message

In Linux Kernel Bug Tracker #201957, martin.von.wittich (martin.von.wittich-linux-kernel-bugs) wrote on 2022-06-11:

#75

Download full text (4.6 KiB)

My Ubuntu 20.04 desktop is crashing several times per day due to this bug since I've upgraded my computer from an old Intel Xeon to an AMD Ryzen 9 5900X on a B550 mainboard. I've had the same AMD RX Vega 56 graphics card in both computers, so I assume this is probably more related to the mainboard/CPU than to the graphics card.

The crashes from today:

```
martin@martin ~ % grep amdgpu /var/log/syslog | grep ERROR | grep -v 'Failed to initialize parser'
Jun 11 03:15:33 martin kernel: [21494.642889] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1750601, emitted seq=1750603
Jun 11 03:15:33 martin kernel: [21494.643055] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 5037 thread firefox:cs0 pid 5123
Jun 11 03:15:50 martin kernel: [21511.795007] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1750605, emitted seq=1750608
Jun 11 03:15:50 martin kernel: [21511.795174] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 5037 thread firefox:cs0 pid 5123
Jun 11 15:56:07 martin kernel: [ 1477.069969] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=216293, emitted seq=216295
Jun 11 15:56:07 martin kernel: [ 1477.070140] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 5237 thread firefox:cs0 pid 5302
Jun 11 15:56:22 martin kernel: [ 1492.174077] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=216297, emitted seq=216300
Jun 11 15:56:22 martin kernel: [ 1492.174248] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Jun 11 16:03:28 martin kernel: [ 1918.161101] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=264406, emitted seq=264408
Jun 11 16:03:28 martin kernel: [ 1918.161271] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 10569 thread firefox:cs0 pid 10633
Jun 11 16:03:49 martin kernel: [ 1938.385307] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=264410, emitted seq=264413
Jun 11 16:03:49 martin kernel: [ 1938.385479] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 10569 thread firefox:cs0 pid 10633
Jun 11 23:28:12 martin kernel: [25491.854294] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2390985, emitted seq=2390987
Jun 11 23:28:12 martin kernel: [25491.854460] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 4922 thread firefox:cs0 pid 4989
Jun 11 23:28:28 martin kernel: [25507.982446] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2390989, emitted seq=2390992
Jun 11 23:28:28 martin kernel: [25507.982613] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Jun 11 23:29:51 martin kernel: [25591.333483] amdgpu 0000:2d:00.0: amdgpu: WALKER_ERROR: 0x0
Jun 11 23:29:51 martin kernel: [25591.333485] amdgpu 0000:2d:00.0: amdgpu: MAPPING_ERROR: 0x0
Jun 11 23:30:01 martin kernel: [25601.412838] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring uvd_...

My Ubuntu 20.04 desktop is crashing several times per day due to this bug since I've upgraded my computer from an old Intel Xeon to an AMD Ryzen 9 5900X on a B550 mainboard. I've had the same AMD RX Vega 56 graphics card in both computers, so I assume this is probably more related to the mainboard/CPU than to the graphics card.

The crashes from today:

```
martin@martin ~ % grep amdgpu /var/log/syslog | grep ERROR | grep -v 'Failed to initialize parser'
Jun 11 03:15:33 martin kernel: [21494.642889] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1750601, emitted seq=1750603
Jun 11 03:15:33 martin kernel: [21494.643055] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 5037 thread firefox:cs0 pid 5123
Jun 11 03:15:50 martin kernel: [21511.795007] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1750605, emitted seq=1750608
Jun 11 03:15:50 martin kernel: [21511.795174] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 5037 thread firefox:cs0 pid 5123
Jun 11 15:56:07 martin kernel: [ 1477.069969] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=216293, emitted seq=216295
Jun 11 15:56:07 martin kernel: [ 1477.070140] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 5237 thread firefox:cs0 pid 5302
Jun 11 15:56:22 martin kernel: [ 1492.174077] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=216297, emitted seq=216300
Jun 11 15:56:22 martin kernel: [ 1492.174248] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Jun 11 16:03:28 martin kernel: [ 1918.161101] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=264406, emitted seq=264408
Jun 11 16:03:28 martin kernel: [ 1918.161271] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 10569 thread firefox:cs0 pid 10633
Jun 11 16:03:49 martin kernel: [ 1938.385307] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=264410, emitted seq=264413
Jun 11 16:03:49 martin kernel: [ 1938.385479] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 10569 thread firefox:cs0 pid 10633
Jun 11 23:28:12 martin kernel: [25491.854294] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2390985, emitted seq=2390987
Jun 11 23:28:12 martin kernel: [25491.854460] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 4922 thread firefox:cs0 pid 4989
Jun 11 23:28:28 martin kernel: [25507.982446] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2390989, emitted seq=2390992
Jun 11 23:28:28 martin kernel: [25507.982613] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Jun 11 23:29:51 martin kernel: [25591.333483] amdgpu 0000:2d:00.0: amdgpu:       WALKER_ERROR: 0x0
Jun 11 23:29:51 martin kernel: [25591.333485] amdgpu 0000:2d:00.0: amdgpu:       MAPPING_ERROR: 0x0
Jun 11 23:30:01 martin kernel: [25601.412838] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring uvd_0 timeout, signaled seq=308, emitted seq=310
Jun 11 23:30:01 martin kernel: [25601.413009] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process mpv pid 44110 thread mpv:cs0 pid 44122
Jun 11 23:30:16 martin kernel: [25616.014983] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2409182, emitted seq=2409185
Jun 11 23:30:16 martin kernel: [25616.015151] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 42941 thread firefox:cs0 pid 43005
```

When I upgraded my computer at the end of 2021, I had to switch from the default Ubuntu 20.04 kernel `linux-image-generic` (5.4.0) to `linux-image-generic-hwe-20.04` (5.11.0) because of some hardware issues with the new computer (I don't remember what exactly didn't work, IIRC the network).

I'm not exactly sure when the crashes started, but I changed from `linux-image-generic-hwe-20.04` (5.14) to `linux-image-oem-20.04d` (5.14) on 2022-04-30 in the hopes that that might resolve the issue, but unfortunately it didn't help.

I tried the `amdgpu.runpm=0` workaround today which also didn't help.

I can also confirm that the attached video "5 second video clip that triggers a crash" successfully triggers the crash on my system.

The main other thing that seems to trigger the crash is to open new tabs in Firefox (in that not every new tab I open causes the crash, but when it crashes, it's usually when I was trying to open a new tab).

Revision history for this message

In Linux Kernel Bug Tracker #201957, panospolychronis (panospolychronis-linux-kernel-bugs) wrote on 2022-06-13:

#76

Download full text (5.1 KiB)

(In reply to Martin von Wittich from comment #70)
> My Ubuntu 20.04 desktop is crashing several times per day due to this bug
> since I've upgraded my computer from an old Intel Xeon to an AMD Ryzen 9
> 5900X on a B550 mainboard. I've had the same AMD RX Vega 56 graphics card in
> both computers, so I assume this is probably more related to the
> mainboard/CPU than to the graphics card.
>
> The crashes from today:
>
> ```
> martin@martin ~ % grep amdgpu /var/log/syslog | grep ERROR | grep -v 'Failed
> to initialize parser'
> Jun 11 03:15:33 martin kernel: [21494.642889] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1750601, emitted seq=1750603
> Jun 11 03:15:33 martin kernel: [21494.643055] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox pid 5037 thread
> firefox:cs0 pid 5123
> Jun 11 03:15:50 martin kernel: [21511.795007] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1750605, emitted seq=1750608
> Jun 11 03:15:50 martin kernel: [21511.795174] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox pid 5037 thread
> firefox:cs0 pid 5123
> Jun 11 15:56:07 martin kernel: [ 1477.069969] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=216293, emitted seq=216295
> Jun 11 15:56:07 martin kernel: [ 1477.070140] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox pid 5237 thread
> firefox:cs0 pid 5302
> Jun 11 15:56:22 martin kernel: [ 1492.174077] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=216297, emitted seq=216300
> Jun 11 15:56:22 martin kernel: [ 1492.174248] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
> Jun 11 16:03:28 martin kernel: [ 1918.161101] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=264406, emitted seq=264408
> Jun 11 16:03:28 martin kernel: [ 1918.161271] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox pid 10569 thread
> firefox:cs0 pid 10633
> Jun 11 16:03:49 martin kernel: [ 1938.385307] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=264410, emitted seq=264413
> Jun 11 16:03:49 martin kernel: [ 1938.385479] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox pid 10569 thread
> firefox:cs0 pid 10633
> Jun 11 23:28:12 martin kernel: [25491.854294] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2390985, emitted seq=2390987
> Jun 11 23:28:12 martin kernel: [25491.854460] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox pid 4922 thread
> firefox:cs0 pid 4989
> Jun 11 23:28:28 martin kernel: [25507.982446] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2390989, emitted seq=2390992
> Jun 11 23:28:28 martin kernel: [25507.982613] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
> Jun 11 23:29:51 martin kernel: [25591.333483] amdgpu 0000:2d:00.0: amdgpu:
> WALKER_ERROR: 0x0
> Jun 11 23:29:51 martin kernel: [25591.333485] am...

(In reply to Martin von Wittich from comment #70)
> My Ubuntu 20.04 desktop is crashing several times per day due to this bug
> since I've upgraded my computer from an old Intel Xeon to an AMD Ryzen 9
> 5900X on a B550 mainboard. I've had the same AMD RX Vega 56 graphics card in
> both computers, so I assume this is probably more related to the
> mainboard/CPU than to the graphics card.
> 
> The crashes from today:
> 
> ```
> martin@martin ~ % grep amdgpu /var/log/syslog | grep ERROR | grep -v 'Failed
> to initialize parser'
> Jun 11 03:15:33 martin kernel: [21494.642889] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1750601, emitted seq=1750603
> Jun 11 03:15:33 martin kernel: [21494.643055] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox pid 5037 thread
> firefox:cs0 pid 5123
> Jun 11 03:15:50 martin kernel: [21511.795007] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1750605, emitted seq=1750608
> Jun 11 03:15:50 martin kernel: [21511.795174] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox pid 5037 thread
> firefox:cs0 pid 5123
> Jun 11 15:56:07 martin kernel: [ 1477.069969] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=216293, emitted seq=216295
> Jun 11 15:56:07 martin kernel: [ 1477.070140] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox pid 5237 thread
> firefox:cs0 pid 5302
> Jun 11 15:56:22 martin kernel: [ 1492.174077] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=216297, emitted seq=216300
> Jun 11 15:56:22 martin kernel: [ 1492.174248] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
> Jun 11 16:03:28 martin kernel: [ 1918.161101] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=264406, emitted seq=264408
> Jun 11 16:03:28 martin kernel: [ 1918.161271] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox pid 10569 thread
> firefox:cs0 pid 10633
> Jun 11 16:03:49 martin kernel: [ 1938.385307] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=264410, emitted seq=264413
> Jun 11 16:03:49 martin kernel: [ 1938.385479] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox pid 10569 thread
> firefox:cs0 pid 10633
> Jun 11 23:28:12 martin kernel: [25491.854294] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2390985, emitted seq=2390987
> Jun 11 23:28:12 martin kernel: [25491.854460] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox pid 4922 thread
> firefox:cs0 pid 4989
> Jun 11 23:28:28 martin kernel: [25507.982446] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2390989, emitted seq=2390992
> Jun 11 23:28:28 martin kernel: [25507.982613] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
> Jun 11 23:29:51 martin kernel: [25591.333483] amdgpu 0000:2d:00.0: amdgpu:  
> WALKER_ERROR: 0x0
> Jun 11 23:29:51 martin kernel: [25591.333485] amdgpu 0000:2d:00.0: amdgpu:  
> MAPPING_ERROR: 0x0
> Jun 11 23:30:01 martin kernel: [25601.412838] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring uvd_0 timeout, signaled seq=308, emitted seq=310
> Jun 11 23:30:01 martin kernel: [25601.413009] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process mpv pid 44110 thread mpv:cs0
> pid 44122
> Jun 11 23:30:16 martin kernel: [25616.014983] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2409182, emitted seq=2409185
> Jun 11 23:30:16 martin kernel: [25616.015151] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox pid 42941 thread
> firefox:cs0 pid 43005
> ```
> 
> When I upgraded my computer at the end of 2021, I had to switch from the
> default Ubuntu 20.04 kernel `linux-image-generic` (5.4.0) to
> `linux-image-generic-hwe-20.04` (5.11.0) because of some hardware issues
> with the new computer (I don't remember what exactly didn't work, IIRC the
> network).
> 
> I'm not exactly sure when the crashes started, but I changed from
> `linux-image-generic-hwe-20.04` (5.14) to `linux-image-oem-20.04d` (5.14) on
> 2022-04-30 in the hopes that that might resolve the issue, but unfortunately
> it didn't help.
> 
> I tried the `amdgpu.runpm=0` workaround today which also didn't help.
> 
> I can also confirm that the attached video "5 second video clip that
> triggers a crash" successfully triggers the crash on my system.
> 
> The main other thing that seems to trigger the crash is to open new tabs in
> Firefox (in that not every new tab I open causes the crash, but when it
> crashes, it's usually when I was trying to open a new tab).

Did you try with the latest Linux Kernel? I had a lot of gpu lockups like this. Also try these kernel parameters : "amdgpu.ppfeaturemask=0xffffbffb  amdgpu.noretry=0 amdgpu.lockup_timeout=0 amdgpu.gpu_recovery=1 amdgpu.audio=0 amdgpu.deep_color=1 amd_iommu=on iommu=pt"" ( you might also try with amdgpu.ppfeaturemask=0xfffd7fff or amdgpu.ppfeaturemask=0xffffffff )

Revision history for this message

In Linux Kernel Bug Tracker #201957, martin.von.wittich (martin.von.wittich-linux-kernel-bugs) wrote on 2022-06-20:

#77

I can confirm that adding "amdgpu.dpm=0" to the kernel command line seems to resolve this issue - I enabled that option on 2022-06-12 13:24, and my system didn't crash at all on 2022-06-12 - 2022-06-14 (I was on vacation from 2022-06-15 on and didn't use my computer from then on).

I don't use Linux for gaming and therefore can't comment how badly this affects gaming performance, but I did notice mpv could no longer play 1080p x264 video without stuttering when it defaults to --vo=gpu. Using another --vo like sdl seems to be a viable workaround.

> Did you try with the latest Linux Kernel? I had a lot of gpu lockups like this. Also try these kernel parameters : "amdgpu.ppfeaturemask=0xffffbffb amdgpu.noretry=0 amdgpu.lockup_timeout=0 amdgpu.gpu_recovery=1 amdgpu.audio=0 amdgpu.deep_color=1 amd_iommu=on iommu=pt"" ( you might also try with amdgpu.ppfeaturemask=0xfffd7fff or amdgpu.ppfeaturemask=0xffffffff )

I'll try these next.

Revision history for this message

In Linux Kernel Bug Tracker #201957, martin.von.wittich (martin.von.wittich-linux-kernel-bugs) wrote on 2022-06-20:

#78

Sorry, forgot to mention in my last post and now can't edit: interestingly enough, the attached video "5 second video clip that triggers a crash" still successfully triggers the crash.

Seems to me like the root issue isn't actually in the dynamic power management code, but somewhere else, and the DPM is just one of several things that can trigger it?

Revision history for this message

In Linux Kernel Bug Tracker #201957, martin.von.wittich (martin.von.wittich-linux-kernel-bugs) wrote on 2022-06-22:

#79

> Did you try with the latest Linux Kernel? I had a lot of gpu lockups like this. Also try these kernel parameters : "amdgpu.ppfeaturemask=0xffffbffb amdgpu.noretry=0 amdgpu.lockup_timeout=0 amdgpu.gpu_recovery=1 amdgpu.audio=0 amdgpu.deep_color=1 amd_iommu=on iommu=pt"" ( you might also try with amdgpu.ppfeaturemask=0xfffd7fff or amdgpu.ppfeaturemask=0xffffffff )

I can confirm that at least on the current Ubuntu linux-image-oem-20.04d kernel, these options do not resolve the issue:

```
martin@martin ~ % uname -a
Linux martin 5.14.0-1042-oem #47-Ubuntu SMP Fri Jun 3 18:17:11 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
martin@martin ~ % cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.14.0-1042-oem root=UUID=1bd000ac-1487-4457-be1a-5ea901ded9e9 ro amdgpu.ppfeaturemask=0xffffbffb amdgpu.noretry=0 amdgpu.lockup_timeout=0 amdgpu.gpu_recovery=1 amdgpu.audio=0 amdgpu.deep_color=1 amd_iommu=on iommu=pt quiet
martin@martin ~ % dmesg -T | grep 'ring gfx timeout'
[Mi Jun 22 14:48:07 2022] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1820983, emitted seq=1820985
[Mi Jun 22 14:48:18 2022] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1820987, emitted seq=1820990
```

I had enabled these options on 2022-06-20 14:14 UTC+2, this is the first crash I've encountered since then.

I have no idea how to build the latest kernel and therefore haven't tested that yet.

I'll now revert back to amdgpu.dpm=0.

Revision history for this message

In Linux Kernel Bug Tracker #201957, s48gs.w (s48gs.w-linux-kernel-bugs) wrote on 2022-06-23:

#80

> Did you try with the latest Linux Kernel? I had a lot of gpu lockups like
> this. Also try these kernel parameters : "amdgpu.ppfeaturemask=0xffffbffb
> amdgpu.noretry=0 amdgpu.lockup_timeout=0 amdgpu.gpu_recovery=1
> amdgpu.audio=0 amdgpu.deep_color=1 amd_iommu=on iommu=pt"" ( you might also
> try with amdgpu.ppfeaturemask=0xfffd7fff or amdgpu.ppfeaturemask=0xffffffff )

I tried.

my kernel:
"Linux 5.17.4-1-default #1 SMP PREEMPT Wed Apr 20 07:43:03 UTC 2022 (75e9961) x86_64 x86_64 x86_64 GNU/Linux"

(this video linked above - were not able to freeze integrated AMD GPU for me, I mean before I tested with no kernel parameters)

Result is surprising - no crash/freeze for 4+ hours already, I did launch lots of apps that were reason of freeze for me before.

As I described above - https://bugzilla.kernel.org/show_bug.cgi?id=201957#c68 for me this freeze happening only when I used OpenGL/Vulkan and video on background(everything on integrated GPU), and how it was looking from user experience - when bug triggered(randomly) everything just slowly become lower and lower FPS, apps that was working on 60fps on fullscreen drop to 5 FPS, and video also drop to 5-10fps (UI still was responsible)... and freeze in next few mins/seconds.

Full kernel boot option now: "splash=silent quiet amdgpu.ppfeaturemask=0xffffbffb amdgpu.noretry=0 amdgpu.lockup_timeout=0 amdgpu.gpu_recovery=1 amdgpu.audio=0 amdgpu.deep_color=1 amd_iommu=on iommu=pt "

Now, after boot with these options, I see:

Just after boot everything working (OpenGL/Vulkan acceleration by integrated GPU) with expected performance.

After trying to "trigger bug" (opening multiple OpenGL apps with Vulkan and WebGL and playing many videos) - OpenGL and Vulkan drops FPS to 20(constant for single triangle in fullscreen), WebGL2 does not work anymore in webbrowser(even after browser restart), but Video - still playing with 60 fps with no lag, and system UI also does not lag.

So GPU graphics acceleration just drop to very low performance mode look like, but everything else works fine. (also launching graphic apps(native only) using Nvidia GPU works with 60fps as expected).

Interesting - since FPS droped 20 I can no longer launch "anything" in Wine (any version include Proton) (after boot it was working), I launched few apps after boot and check them when GPU FPS drops wine always crash with:
"wine: Unhandled page fault on execute access to 00007F894E200460 at address 00007F894E200460 (thread 0070), starting debugger..."
(not being able to use Wine is a big disadvantage)

> Did you try with the latest Linux Kernel? I had a lot of gpu lockups like
> this. Also try these kernel parameters : "amdgpu.ppfeaturemask=0xffffbffb 
> amdgpu.noretry=0 amdgpu.lockup_timeout=0 amdgpu.gpu_recovery=1
> amdgpu.audio=0 amdgpu.deep_color=1 amd_iommu=on iommu=pt"" ( you might also
> try with amdgpu.ppfeaturemask=0xfffd7fff or amdgpu.ppfeaturemask=0xffffffff )

I tried.

my kernel:
"Linux 5.17.4-1-default #1 SMP PREEMPT Wed Apr 20 07:43:03 UTC 2022 (75e9961) x86_64 x86_64 x86_64 GNU/Linux"

(this video linked above - were not able to freeze integrated AMD GPU for me, I mean before I tested with no kernel parameters)

Result is surprising - no crash/freeze for 4+ hours already, I did launch lots of apps that were reason of freeze for me before.

As I described above - https://bugzilla.kernel.org/show_bug.cgi?id=201957#c68 for me this freeze happening only when I used OpenGL/Vulkan and video on background(everything on integrated GPU), and how it was looking from user experience - when bug triggered(randomly) everything just slowly become lower and lower FPS, apps that was working on 60fps on fullscreen drop to 5 FPS, and video also drop to 5-10fps (UI still was responsible)... and freeze in next few mins/seconds.

Full kernel boot option now: "splash=silent quiet amdgpu.ppfeaturemask=0xffffbffb amdgpu.noretry=0 amdgpu.lockup_timeout=0 amdgpu.gpu_recovery=1 amdgpu.audio=0 amdgpu.deep_color=1 amd_iommu=on iommu=pt "

Now, after boot with these options, I see:

Just after boot everything working (OpenGL/Vulkan acceleration by integrated GPU) with expected performance.

After trying to "trigger bug" (opening multiple OpenGL apps with Vulkan and WebGL and playing many videos) - OpenGL and Vulkan drops FPS to 20(constant for single triangle in fullscreen), WebGL2 does not work anymore in webbrowser(even after browser restart), but Video - still playing with 60 fps with no lag, and system UI also does not lag.

So GPU graphics acceleration just drop to very low performance mode look like, but everything else works fine. (also launching graphic apps(native only) using Nvidia GPU works with 60fps as expected).

Interesting - since FPS droped 20 I can no longer launch "anything" in Wine (any version include Proton) (after boot it was working), I launched few apps after boot and check them when GPU FPS drops wine always crash with:
"wine: Unhandled page fault on execute access to 00007F894E200460 at address 00007F894E200460 (thread 0070), starting debugger..."
(not being able to use Wine is a big disadvantage)

Revision history for this message

In Linux Kernel Bug Tracker #201957, s48gs.w (s48gs.w-linux-kernel-bugs) wrote on 2022-06-23:

#81

Wine problem - this happened because (how/why/when) '/usr/share/vulkan/icd.d/nvidia_icd.json' file was deleted... no idea how and why this happened when AMD GPU drops its FPS(obviously this file exists when I use just Nvidia GPU with integrated AMD disabled)

so fix for wine gonna be - "VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.x86_64.json winecfg"

super weird, so wine problem fixed I think

Revision history for this message

In Linux Kernel Bug Tracker #201957, s48gs.w (s48gs.w-linux-kernel-bugs) wrote on 2022-06-23:

#82

but even creating nvidia_icd.json
{
    "file_format_version" : "1.0.0",
    "ICD": {
        "library_path": "/usr/lib64/libGLX_nvidia.so.0",
        "api_version" : "1.3.211"
    }
}

does not help wine, Wine still crashing with same error on trying use/initialize Nvidia
but I can use Nvidia outside of Wine from native apps (and Vulkan works), so it must be related to AMD gpu driver somehow (before it was not happening, I first time seeing wine crashing this way(in previous times when I tested AMD GPU integrated))

P.S. I have second PC with same AMD Vega 8 integrated GPU, and there it works fine(never crashed/freeze even once), other PC has other motherboard, this why I originally think it problem with motherboard, but current "boot option" help to make integrated GPU stable on this PC.

Revision history for this message

In Linux Kernel Bug Tracker #201957, s48gs.w (s48gs.w-linux-kernel-bugs) wrote on 2022-06-23:

#83

(I did small mistake in my file organizing, creating nvidia_icd.json with listed above content is enough to fix Wine for me, everything works now)

Revision history for this message

In Linux Kernel Bug Tracker #201957, s48gs.w (s48gs.w-linux-kernel-bugs) wrote on 2022-06-23:

#84

Updated to kernel 5.18.4-1-default #1 SMP PREEMPT_DYNAMIC Wed Jun 15 06:00:33 UTC 2022 (ed6345d) x86_64 x86_64 x86_64 GNU/Linux (OpenSuSe latest for now)

Seems my integrated AMD GPU freeze completely fixed even without using previous boot option (in 5.17 it was freezing without boot option), also integrated GPU does not go to "low performance mode forever"(like it was with boot option before) it continues working for hours on max performance(I mean it works without slowdown like before)

... but now Nvidia GPU does not work anymore from AMD (when integrated is main GPU), Nvidia 515.48.07 driver(latest now), in X11 and Wayland, Nvidia driver correctly installed and device visible (nvidia-smi works and vulkaninfo --summary list Nvidia GPU correctly), on creating Vulkan surface on Nvidia device application always crash (any application)... (just tested - disabling AMD integrated and boot using Nvidia - everything works there, Vulkan etc)

So fixing integrated AMD GPU result in Nvidia does not work anymore... okey (im back to use discrete Nvidia only again)

Revision history for this message

In Linux Kernel Bug Tracker #201957, jrch2k10 (jrch2k10-linux-kernel-bugs) wrote on 2022-06-29:

#85

Download full text (17.7 KiB)

same issue here with (also LTS kernel as well)

Linux archlinux 5.18.7-262-tkg-pds #1 TKG SMP PREEMPT_DYNAMIC Mon, 27 Jun 2022 15:50:06 +0000 x86_64 GNU/Linux

[11090.086287] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11090.086296] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11090.086302] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11090.195133] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11090.195139] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11090.195143] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11090.195150] [drm] Cannot get clockgating state when UVD is powergated.
[11090.195152] [drm] Cannot get clockgating state when VCE is powergated.
[11090.695288] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11090.699331] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11091.194893] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11091.194898] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11091.194901] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11091.194908] [drm] Cannot get clockgating state when UVD is powergated.
[11091.194909] [drm] Cannot get clockgating state when VCE is powergated.
[11091.695473] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11092.194965] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11092.194969] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11092.194973] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11092.194979] [drm] Cannot get clockgating state when UVD is powergated.
[11092.194980] [drm] Cannot get clockgating state when VCE is powergated.
[11092.695749] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11093.195046] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11093.195050] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11093.195053] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11093.195060] [drm] Cannot get clockgating state when UVD is powergated.
[11093.195061] [drm] Cannot get clockgating state when VCE is powergated.
[11093.695004] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11094.195065] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11094.195070] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11094.195074] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11094.195082] [drm] Cannot get clockgating state when UVD is powergated.
[11094.195083] [drm] Cannot get clockgating state when VCE is powergated.
[11094.695286] amdgpu 0000:02:00.0: amdgpu:
               last mess...

same issue here with (also LTS kernel as well)

Linux archlinux 5.18.7-262-tkg-pds #1 TKG SMP PREEMPT_DYNAMIC Mon, 27 Jun 2022 15:50:06 +0000 x86_64 GNU/Linux

[11090.086287] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11090.086296] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11090.086302] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11090.195133] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11090.195139] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11090.195143] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11090.195150] [drm] Cannot get clockgating state when UVD is powergated.
[11090.195152] [drm] Cannot get clockgating state when VCE is powergated.
[11090.695288] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11090.699331] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11091.194893] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11091.194898] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11091.194901] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11091.194908] [drm] Cannot get clockgating state when UVD is powergated.
[11091.194909] [drm] Cannot get clockgating state when VCE is powergated.
[11091.695473] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11092.194965] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11092.194969] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11092.194973] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11092.194979] [drm] Cannot get clockgating state when UVD is powergated.
[11092.194980] [drm] Cannot get clockgating state when VCE is powergated.
[11092.695749] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11093.195046] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11093.195050] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11093.195053] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11093.195060] [drm] Cannot get clockgating state when UVD is powergated.
[11093.195061] [drm] Cannot get clockgating state when VCE is powergated.
[11093.695004] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11094.195065] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11094.195070] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11094.195074] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11094.195082] [drm] Cannot get clockgating state when UVD is powergated.
[11094.195083] [drm] Cannot get clockgating state when VCE is powergated.
[11094.695286] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11095.131026] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[11095.195055] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11095.195061] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11095.195065] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11095.195071] [drm] Cannot get clockgating state when UVD is powergated.
[11095.195072] [drm] Cannot get clockgating state when VCE is powergated.
[11095.695232] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11096.195132] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11096.195137] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11096.195140] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11096.195146] [drm] Cannot get clockgating state when UVD is powergated.
[11096.195147] [drm] Cannot get clockgating state when VCE is powergated.
[11096.694900] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11097.195057] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11097.195061] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11097.195064] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11097.195070] [drm] Cannot get clockgating state when UVD is powergated.
[11097.195071] [drm] Cannot get clockgating state when VCE is powergated.
[11097.695156] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11098.195054] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11098.195058] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11098.195062] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11098.195068] [drm] Cannot get clockgating state when UVD is powergated.
[11098.195069] [drm] Cannot get clockgating state when VCE is powergated.
[11098.695226] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11099.195056] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11099.195060] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11099.195064] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11099.195070] [drm] Cannot get clockgating state when UVD is powergated.
[11099.195071] [drm] Cannot get clockgating state when VCE is powergated.
[11099.695224] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11100.175702] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2678111, emitted seq=2678113
[11100.175937] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ArcheAge.exe pid 702264 thread ArcheAge.e:cs0 pid 703382
[11100.176120] amdgpu 0000:02:00.0: amdgpu: GPU reset begin!
[11104.176155] amdgpu 0000:02:00.0: amdgpu: failed to suspend display audio
[11104.176290] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176294] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176296] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176298] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176299] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176301] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176303] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176305] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176307] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176309] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176311] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176312] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176314] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176316] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176318] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176320] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176321] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176417] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176420] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176421] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176423] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176425] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11104.176427] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11118.768958] audit: type=1100 audit(1656469160.416:402): pid=707085 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:authentication grantors=pam_shells,pam_faillock,pam_permit,pam_faillock acct="junior" exe="/usr/bin/sshd" hostname=192.168.10.47 addr=192.168.10.47 terminal=ssh res=success'
[11118.769433] audit: type=1101 audit(1656469160.416:403): pid=707085 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:accounting grantors=pam_access,pam_unix,pam_permit,pam_time acct="junior" exe="/usr/bin/sshd" hostname=192.168.10.47 addr=192.168.10.47 terminal=ssh res=success'
[11118.769972] audit: type=1103 audit(1656469160.418:404): pid=707085 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:setcred grantors=pam_shells,pam_faillock,pam_permit,pam_faillock acct="junior" exe="/usr/bin/sshd" hostname=192.168.10.47 addr=192.168.10.47 terminal=ssh res=success'
[11118.770029] audit: type=1006 audit(1656469160.418:405): pid=707085 uid=0 old-auid=4294967295 auid=1000 tty=(none) old-ses=4294967295 ses=5 res=1
[11118.770038] audit: type=1300 audit(1656469160.418:405): arch=c000003e syscall=1 success=yes exit=4 a0=3 a1=7ffd3b3d22d0 a2=4 a3=7ffd3b3d1fe4 items=0 ppid=759 pid=707085 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=5 comm="sshd" exe="/usr/bin/sshd" key=(null)
[11118.770040] audit: type=1327 audit(1656469160.418:405): proctitle=737368643A206A756E696F72205B707269765D
[11118.785798] audit: type=1105 audit(1656469160.434:406): pid=707085 uid=0 auid=1000 ses=5 msg='op=PAM:session_open grantors=pam_loginuid,pam_keyinit,pam_systemd_home,pam_limits,pam_unix,pam_permit,pam_mail,pam_systemd,pam_env acct="junior" exe="/usr/bin/sshd" hostname=192.168.10.47 addr=192.168.10.47 terminal=ssh res=success'
[11118.786714] audit: type=1103 audit(1656469160.434:407): pid=707087 uid=0 auid=1000 ses=5 msg='op=PAM:setcred grantors=pam_shells,pam_faillock,pam_permit,pam_faillock acct="junior" exe="/usr/bin/sshd" hostname=192.168.10.47 addr=192.168.10.47 terminal=ssh res=success'
[11124.189733] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
[11124.189930] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D718 (len 824, WS 0, PS 0) @ 0xD898
[11124.190079] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D5D2 (len 326, WS 0, PS 0) @ 0xD6C2
[11124.190230] [drm:dce110_link_encoder_disable_output [amdgpu]] *ERROR* dce110_link_encoder_disable_output: Failed to execute VBIOS command table!
[11126.469943] audit: type=1101 audit(1656469168.118:408): pid=707219 uid=1000 auid=1000 ses=5 msg='op=PAM:accounting grantors=pam_unix,pam_permit,pam_time acct="junior" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'
[11126.470552] audit: type=1110 audit(1656469168.118:409): pid=707219 uid=1000 auid=1000 ses=5 msg='op=PAM:setcred grantors=pam_faillock,pam_permit,pam_env,pam_faillock acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'
[11126.472793] audit: type=1105 audit(1656469168.120:410): pid=707219 uid=1000 auid=1000 ses=5 msg='op=PAM:session_open grantors=pam_systemd_home,pam_limits,pam_unix,pam_permit acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'
[11126.492151] audit: type=1106 audit(1656469168.139:411): pid=707219 uid=1000 auid=1000 ses=5 msg='op=PAM:session_close grantors=pam_systemd_home,pam_limits,pam_unix,pam_permit acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'
[11126.492202] audit: type=1104 audit(1656469168.139:412): pid=707219 uid=1000 auid=1000 ses=5 msg='op=PAM:setcred grantors=pam_faillock,pam_permit,pam_env,pam_faillock acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'
[11144.191100] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
[11144.191292] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C16E (len 62, WS 0, PS 0) @ 0xC18A
[11164.192468] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
[11164.192658] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing B190 (len 1227, WS 8, PS 8) @ 0xB418
[11164.192828] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.192831] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.192833] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.201396] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vce_v3_0> failed -110
[11164.216360] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.216364] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.216366] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.216368] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.216370] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.216371] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.216373] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.216375] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.216377] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.216378] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.436229] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.436234] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.436236] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.436238] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.436240] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.436241] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.436243] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.436246] amdgpu 0000:02:00.0: amdgpu: 
               last message was failed ret is 65535
[11164.436248] amdgpu: Failed to force to switch arbf0!
[11164.436249] amdgpu: [disable_dpm_tasks] Failed to disable DPM!
[11164.436250] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <powerplay> failed -22
[11164.546720] amdgpu 0000:02:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[11164.546864] [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[11164.767164] amdgpu: cp is busy, skip halt cp
[11164.877251] amdgpu: rlc is busy, skip halt rlc
[11164.988549] CPU: 2 PID: 705317 Comm: kworker/u48:4 Tainted: G           OE     5.18.7-262-tkg-pds #1 ab3a1701b6bb2d2603e5fe14656a947bbae77de2
[11164.988553] Hardware name: ATERMITER ZX-99EV3/ZX-99EV3, BIOS X99AT011 10/15/2020
[11164.988554] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
[11164.988561] Call Trace:
[11164.988562]  <TASK>
[11164.988563]  dump_stack_lvl+0x48/0x5d
[11164.988570]  amdgpu_do_asic_reset+0x2a/0x470 [amdgpu d2028a110b701082c428a38d2a7699ba96e2f894]
[11164.988790]  amdgpu_device_gpu_recover_imp.cold+0x537/0x8cc [amdgpu d2028a110b701082c428a38d2a7699ba96e2f894]
[11164.989002]  amdgpu_job_timedout+0x18c/0x1c0 [amdgpu d2028a110b701082c428a38d2a7699ba96e2f894]
[11164.989183]  drm_sched_job_timedout+0x76/0x100 [gpu_sched ca892a3eb32539b04f830de75b342015ecf19774]
[11164.989188]  process_one_work+0x1c7/0x380
[11164.989192]  worker_thread+0x51/0x380
[11164.989195]  ? rescuer_thread+0x3a0/0x3a0
[11164.989197]  kthread+0xde/0x110
[11164.989200]  ? kthread_complete_and_exit+0x20/0x20
[11164.989203]  ret_from_fork+0x22/0x30
[11164.989208]  </TASK>
[11164.989212] amdgpu 0000:02:00.0: amdgpu: BACO reset
[drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:53:crtc-0] hw_done or flip_done timed out
[11187.893035] radeon-profile[54935]: segfault at 0 ip 00007fe553eee6ef sp 00007ffc8035f9e0 error 4 in libQt5Core.so.5.15.5[7fe553e9f000+2d6000]
[11187.893049] Code: 38 64 48 8b 04 25 28 00 00 00 48 89 44 24 28 31 c0 e8 d5 98 ff ff 48 85 c0 0f 84 f2 3c fb ff 48 89 c3 4c 8d 68 50 48 8b 40 50 <49> 63 2c 24 3b 68 04 7d 78 8b 10 83 fa 01 76 26 8b 70 08 81 e6 ff

[drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
[11206.839405] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C16E (len 62, WS 0, PS 0) @ 0xC18A
[11206.839546] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing AB18 (len 142, WS 0, PS 8) @ 0xAB33
[11206.839688] amdgpu 0000:02:00.0: amdgpu: asic atom init failed!
[11206.839725] amdgpu 0000:02:00.0: amdgpu: GPU reset(2) failed
[11206.839746] amdgpu 0000:02:00.0: amdgpu: GPU reset end with ret = -22
[11206.839748] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -22

[11216.913239] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2678113, emitted seq=2678113
[11216.913503] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ArcheAge.exe pid 702264 thread ArcheAge.e:cs0 pid 703382
[11216.913700] amdgpu 0000:02:00.0: amdgpu: GPU reset begin!

Revision history for this message

In Linux Kernel Bug Tracker #201957, s48gs.w (s48gs.w-linux-kernel-bugs) wrote on 2022-07-14:

#86

Nvidia released 515.57 drivers that fix "Nvidia being broken when used as second GPU in Linux", my bug above.
Nvidia GPU works again when AMD GPU main.

Revision history for this message

In Linux Kernel Bug Tracker #201957, s48gs.w (s48gs.w-linux-kernel-bugs) wrote on 2022-07-17:

#87

Afteer using this PC for few days with AMD Vega 8 (integrated) as main GPU I see no freezes at all. (before in 2021 it was freeze every 10-20 mins so I had to use Nvidia as main GPU)
(works with and without listed above kernel boot option)

I use OpenSuse kernel 5.18.4-1-default (not going to update for some time, because it works)

Maybe it just fixed for "my motherboard+CPU combination", my hardware:
Ryzen3 3200 CPU (Vega8 integrated) on A320M-DVS R4.0 motherboard.
microcode: CPU: patch_level=0x08108109
microcode: Microcode Update Driver: v2.2.

Wayland and x11 works, with Nvidia as second GPU.
Wayland slowdown(to like 1-2FPS whole UI performance) once after few hours of using, but it fixed just by switching to system-terminal(ctrl+alt+f1) and back, nothing crash video apps and graphic keep working.

integrated GPU performance still goes down(in few hours, randomly in 2-6 hours of PC use) and never go back, but its fine(since I have Nvidia second GPU for complex graphic), Vega 8 performance go down only in "complex shaders" FPS drop from 60 fullscreen(1080p) to 10-20 on complex raymarching shaders, but for system UI (Wayland/x11 Gnome 42) this is not noticeable, and video play on 60fps as expected. (Sleep mode also works, not every time(because Nvidia) but most of the time, same as when used Nvidia as main GPU)

Revision history for this message

In Linux Kernel Bug Tracker #201957, s48gs.w (s48gs.w-linux-kernel-bugs) wrote on 2022-07-17:

#88

Download full text (7.5 KiB)

Log from what I described above - "fixed just by switching to system-terminal(ctrl+alt+f1)", nothing crash even GPU apps keep working, just huge mouse+UI freeze and switching to F1 terminal and back fix it (Wayland).
Logs:

Jul 17 22:54:04 home-danil kernel: amdgpu 0000:07:00.0: amdgpu: Failed to send Message 7.
Jul 17 22:54:09 home-danil kernel: amdgpu 0000:07:00.0: amdgpu: Failed to send Message 7.
Jul 17 22:54:12 home-danil kernel: ------------[ cut here ]------------
Jul 17 22:54:12 home-danil kernel: WARNING: CPU: 1 PID: 1100 at drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dcn10/rv1_clk_mgr_vbios_smu.c:120 rv1_vbios_smu_send_msg_with_param+0xa3/0xb0 [amdgpu]
Jul 17 22:54:12 home-danil kernel: Modules linked in: dm_crypt essiv authenc trusted asn1_encoder tee nvidia_uvm(POE) nvidia_modeset(POE) nvidia(POE) snd_seq_dummy snd_hrtimer snd_seq snd_seq_device af_packet nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set iscsi_ibft iscsi_boot_sysfs nfnetlink rfkill ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter qrtr vboxnetadp(O) vboxnetflt(O) vboxdrv(O) dmi_sysfs joydev intel_rapl_msr intel_rapl_common snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio edac_mce_amd snd_hda_intel snd_intel_dspcfg kvm_amd snd_intel_sdw_acpi snd_hda_codec r8169 pcspkr snd_hda_core kvm realtek snd_hwdep snd_pcm wmi_bmof mdio_devres snd_timer
Jul 17 22:54:12 home-danil kernel: libphy irqbypass snd soundcore efi_pstore i2c_piix4 gpio_amdpt gpio_generic acpi_cpufreq k10temp tiny_power_button nls_iso8859_1 squashfs nls_cp437 loop ext4 mbcache vfat jbd2 fat fuse configfs ip_tables x_tables hid_generic usbhid uas usb_storage amdgpu crct10dif_pclmul crc32_pclmul ghash_clmulni_intel drm_ttm_helper ttm iommu_v2 gpu_sched i2c_algo_bit drm_dp_helper drm_kms_helper aesni_intel crypto_simd syscopyarea sysfillrect sysimgblt fb_sys_fops cryptd drm cec xhci_pci xhci_pci_renesas sp5100_tco ccp rc_core xhci_hcd usbcore wmi video button btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr efivarfs
Jul 17 22:54:12 home-danil kernel: CPU: 1 PID: 1100 Comm: systemd-logind Tainted: P OE 5.18.4-1-default #1 openSUSE Tumbleweed 59778fa2462c9ee971468464596d3fbe14e51d2e
Jul 17 22:54:12 home-danil kernel: Hardware name: To Be Filled By O.E.M. A320M-DVS R4.0/A320M-DVS R4.0, BIOS P7.10 12/23/2021
Jul 17 22:54:12 home-danil kernel: RIP: 0010:rv1_vbios_smu_send_msg_with_param+0xa3/0xb0 [amdgpu]
Jul 17 22:54:12 home-danil kernel: Code: 62 01 00 e8 8f 4e f5 ff 85 c0 74 d8 83 f8 01 75 19 48 8b 7d 00 5b be 93 62 01 00 48 c7 c2 00 99 cd c0 5d 41 5c e9 6d 4e f5 ff <0f> 0b eb e3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 81 c6 e7 03
Jul 17 22:54:12 home-danil kernel: RSP: 0018:ffff9f0a00b1f580 EFLAGS: 00010246
Jul 17 22:54:12 home-danil kernel: RAX: 00007570227d95d8 RBX: 00000000000000...

Log from what I described above - "fixed just by switching to system-terminal(ctrl+alt+f1)", nothing crash even GPU apps keep working, just huge mouse+UI freeze and switching to F1 terminal and back fix it (Wayland).
Logs:

Jul 17 22:54:04 home-danil kernel: amdgpu 0000:07:00.0: amdgpu: Failed to send Message 7.
Jul 17 22:54:09 home-danil kernel: amdgpu 0000:07:00.0: amdgpu: Failed to send Message 7.
Jul 17 22:54:12 home-danil kernel: ------------[ cut here ]------------
Jul 17 22:54:12 home-danil kernel: WARNING: CPU: 1 PID: 1100 at drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dcn10/rv1_clk_mgr_vbios_smu.c:120 rv1_vbios_smu_send_msg_with_param+0xa3/0xb0 [amdgpu]
Jul 17 22:54:12 home-danil kernel: Modules linked in: dm_crypt essiv authenc trusted asn1_encoder tee nvidia_uvm(POE) nvidia_modeset(POE) nvidia(POE) snd_seq_dummy snd_hrtimer snd_seq snd_seq_device af_packet nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set iscsi_ibft iscsi_boot_sysfs nfnetlink rfkill ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter qrtr vboxnetadp(O) vboxnetflt(O) vboxdrv(O) dmi_sysfs joydev intel_rapl_msr intel_rapl_common snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio edac_mce_amd snd_hda_intel snd_intel_dspcfg kvm_amd snd_intel_sdw_acpi snd_hda_codec r8169 pcspkr snd_hda_core kvm realtek snd_hwdep snd_pcm wmi_bmof mdio_devres snd_timer
Jul 17 22:54:12 home-danil kernel:  libphy irqbypass snd soundcore efi_pstore i2c_piix4 gpio_amdpt gpio_generic acpi_cpufreq k10temp tiny_power_button nls_iso8859_1 squashfs nls_cp437 loop ext4 mbcache vfat jbd2 fat fuse configfs ip_tables x_tables hid_generic usbhid uas usb_storage amdgpu crct10dif_pclmul crc32_pclmul ghash_clmulni_intel drm_ttm_helper ttm iommu_v2 gpu_sched i2c_algo_bit drm_dp_helper drm_kms_helper aesni_intel crypto_simd syscopyarea sysfillrect sysimgblt fb_sys_fops cryptd drm cec xhci_pci xhci_pci_renesas sp5100_tco ccp rc_core xhci_hcd usbcore wmi video button btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr efivarfs
Jul 17 22:54:12 home-danil kernel: CPU: 1 PID: 1100 Comm: systemd-logind Tainted: P           OE     5.18.4-1-default #1 openSUSE Tumbleweed 59778fa2462c9ee971468464596d3fbe14e51d2e
Jul 17 22:54:12 home-danil kernel: Hardware name: To Be Filled By O.E.M. A320M-DVS R4.0/A320M-DVS R4.0, BIOS P7.10 12/23/2021
Jul 17 22:54:12 home-danil kernel: RIP: 0010:rv1_vbios_smu_send_msg_with_param+0xa3/0xb0 [amdgpu]
Jul 17 22:54:12 home-danil kernel: Code: 62 01 00 e8 8f 4e f5 ff 85 c0 74 d8 83 f8 01 75 19 48 8b 7d 00 5b be 93 62 01 00 48 c7 c2 00 99 cd c0 5d 41 5c e9 6d 4e f5 ff <0f> 0b eb e3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 81 c6 e7 03
Jul 17 22:54:12 home-danil kernel: RSP: 0018:ffff9f0a00b1f580 EFLAGS: 00010246
Jul 17 22:54:12 home-danil kernel: RAX: 00007570227d95d8 RBX: 0000000000000000 RCX: 0000000000000001
Jul 17 22:54:12 home-danil kernel: RDX: 0000000000009288 RSI: 0000000000008b82 RDI: 00007570227d0350
Jul 17 22:54:12 home-danil kernel: RBP: ffff8b0388bf3c00 R08: 0000000000002700 R09: 0000000000002700
Jul 17 22:54:12 home-danil kernel: R10: ffff9f0a00b1f630 R11: 0000000000000003 R12: 0000000000000097
Jul 17 22:54:12 home-danil kernel: R13: ffff8b0386ec98a0 R14: ffff8b0388bf3c00 R15: ffff8b03c04a0000
Jul 17 22:54:12 home-danil kernel: FS:  00007fb68308cb40(0000) GS:ffff8b06c0a40000(0000) knlGS:0000000000000000
Jul 17 22:54:12 home-danil kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 17 22:54:12 home-danil kernel: CR2: 00003e74003afe38 CR3: 000000018ef3c000 CR4: 00000000003506e0
Jul 17 22:54:12 home-danil kernel: Call Trace:
Jul 17 22:54:12 home-danil kernel:  <TASK>
Jul 17 22:54:12 home-danil kernel:  rv1_vbios_smu_set_dispclk+0x46/0xb0 [amdgpu e7857b98c028928796f1e71af6f4284e57f7c0e3]
Jul 17 22:54:12 home-danil kernel:  rv1_update_clocks+0x254/0x500 [amdgpu e7857b98c028928796f1e71af6f4284e57f7c0e3]
Jul 17 22:54:12 home-danil kernel:  dcn10_prepare_bandwidth+0x6b/0x130 [amdgpu e7857b98c028928796f1e71af6f4284e57f7c0e3]
Jul 17 22:54:12 home-danil kernel:  dc_commit_updates_for_stream+0x1b69/0x1f90 [amdgpu e7857b98c028928796f1e71af6f4284e57f7c0e3]
Jul 17 22:54:12 home-danil kernel:  ? mutex_lock+0xe/0x30
Jul 17 22:54:12 home-danil kernel:  ? flush_workqueue+0x177/0x3a0
Jul 17 22:54:12 home-danil kernel:  amdgpu_dm_atomic_commit_tail+0x1627/0x2720 [amdgpu e7857b98c028928796f1e71af6f4284e57f7c0e3]
Jul 17 22:54:12 home-danil kernel:  ? ttm_resource_compat+0x23/0x50 [ttm 63072f655d2dc7ed260c9d980e7b7104612ede60]
Jul 17 22:54:12 home-danil kernel:  commit_tail+0x94/0x120 [drm_kms_helper 9e4d316863dffca879cbc8a3a12d452ad7e0a149]
Jul 17 22:54:12 home-danil kernel:  drm_atomic_helper_commit+0x10f/0x140 [drm_kms_helper 9e4d316863dffca879cbc8a3a12d452ad7e0a149]
Jul 17 22:54:12 home-danil kernel:  drm_client_modeset_commit_atomic+0x1e4/0x220 [drm 93e548a999b532667e8d1d66f85cd72b61d212a3]
Jul 17 22:54:12 home-danil kernel:  drm_client_modeset_commit_locked+0x56/0x150 [drm 93e548a999b532667e8d1d66f85cd72b61d212a3]
Jul 17 22:54:12 home-danil kernel:  drm_fb_helper_set_par+0x78/0xd0 [drm_kms_helper 9e4d316863dffca879cbc8a3a12d452ad7e0a149]
Jul 17 22:54:12 home-danil kernel:  fb_set_var+0x19d/0x380
Jul 17 22:54:12 home-danil kernel:  ? update_load_avg+0x7e/0x730
Jul 17 22:54:12 home-danil kernel:  ? update_load_avg+0x7e/0x730
Jul 17 22:54:12 home-danil kernel:  fbcon_blank+0x206/0x2c0
Jul 17 22:54:12 home-danil kernel:  do_unblank_screen+0xa7/0x150
Jul 17 22:54:12 home-danil kernel:  complete_change_console+0x54/0x120
Jul 17 22:54:12 home-danil kernel:  vt_ioctl+0x12c8/0x13b0
Jul 17 22:54:12 home-danil kernel:  ? __x64_sys_ioctl+0x8d/0xc0
Jul 17 22:54:12 home-danil kernel:  tty_ioctl+0x283/0x860
Jul 17 22:54:12 home-danil kernel:  ? __sys_sendmsg+0x57/0xa0
Jul 17 22:54:12 home-danil kernel:  ? __seccomp_filter+0x314/0x4d0
Jul 17 22:54:12 home-danil kernel:  __x64_sys_ioctl+0x8d/0xc0
Jul 17 22:54:12 home-danil kernel:  do_syscall_64+0x5b/0x80
Jul 17 22:54:12 home-danil kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
Jul 17 22:54:12 home-danil kernel: RIP: 0033:0x7fb683be145f
Jul 17 22:54:12 home-danil kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
Jul 17 22:54:12 home-danil kernel: RSP: 002b:00007ffd5c30c340 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jul 17 22:54:12 home-danil kernel: RAX: ffffffffffffffda RBX: 0000000000000017 RCX: 00007fb683be145f
Jul 17 22:54:12 home-danil kernel: RDX: 0000000000000001 RSI: 0000000000005605 RDI: 0000000000000017
Jul 17 22:54:12 home-danil kernel: RBP: 0000000000000000 R08: 00007ffd5c30c340 R09: 000055c0f8a6f55e
Jul 17 22:54:12 home-danil kernel: R10: 00007ffd5c30c380 R11: 0000000000000246 R12: 000055c0f8a45430
Jul 17 22:54:12 home-danil kernel: R13: 00007ffd5c30c420 R14: 00007ffd5c30c418 R15: 0000000000000006
Jul 17 22:54:12 home-danil kernel:  </TASK>
Jul 17 22:54:12 home-danil kernel: ---[ end trace 0000000000000000 ]---
Jul 17 22:54:15 home-danil kernel: amdgpu 0000:07:00.0: amdgpu: Failed to send Message 7.
Jul 17 22:54:15 home-danil kernel: rfkill: input handler enabled
Jul 17 22:54:20 home-danil systemd[1]: Started Getty on tty2.

Revision history for this message

In Linux Kernel Bug Tracker #201957, 291765088 (291765088-linux-kernel-bugs) wrote on 2022-08-11:

#89

amd driver problem,u can connect me ,i'll give u the final solution,email <email address hidden> ,maybe in China will get more efficent communication

Revision history for this message

In Linux Kernel Bug Tracker #201957, hcarter1112 (hcarter1112-linux-kernel-bugs) wrote on 2023-01-11:

#90

Download full text (13.5 KiB)

[67760.805903] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=19820784, emitted seq=19820786
[67760.806285] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process valheim.x86_64 pid 464107 thread valheim.x8:cs0 pid 464109
[67760.806667] amdgpu 0000:0d:00.0: amdgpu: GPU reset begin!
[67761.257012] amdgpu 0000:0d:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[67761.257232] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[67761.307862] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:80:crtc-1] hw_done or flip_done timed out
[67761.516374] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[67761.542980] [drm] free PSP TMR buffer
[67761.587266] amdgpu 0000:0d:00.0: amdgpu: MODE1 reset
[67761.587269] amdgpu 0000:0d:00.0: amdgpu: GPU mode1 reset
[67761.587329] amdgpu 0000:0d:00.0: amdgpu: GPU smu mode1 reset
[67762.091974] amdgpu 0000:0d:00.0: amdgpu: GPU reset succeeded, trying to resume
[67762.092156] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[67762.092219] [drm] VRAM is lost due to GPU reset!
[67762.092220] [drm] PSP is resuming...
[67762.168492] [drm] reserve 0xa00000 from 0x8001000000 for PSP TMR
[67762.269801] amdgpu 0000:0d:00.0: amdgpu: RAS: optional ras ta ucode is not available
[67762.283510] amdgpu 0000:0d:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[67762.283513] amdgpu 0000:0d:00.0: amdgpu: SMU is resuming...
[67762.283516] amdgpu 0000:0d:00.0: amdgpu: smu driver if version = 0x0000000e, smu fw if version = 0x00000012, smu fw program = 0, version = 0x00413900 (65.57.0)
[67762.283519] amdgpu 0000:0d:00.0: amdgpu: SMU driver if version not matched
[67762.283549] amdgpu 0000:0d:00.0: amdgpu: use vbios provided pptable
[67762.343739] amdgpu 0000:0d:00.0: amdgpu: SMU is resumed successfully!
[67762.345104] [drm] DMUB hardware initialized: version=0x02020017
[67762.615558] [drm] kiq ring mec 2 pipe 1 q 0
[67762.618728] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[67762.618910] [drm] JPEG decode initialized successfully.
[67762.618918] amdgpu 0000:0d:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[67762.618921] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[67762.618922] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[67762.618924] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[67762.618925] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[67762.618926] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[67762.618927] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[67762.618929] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[67762.618930] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[67762.618931] amdgpu 0000:0d:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[67762.618933] amdgpu 0000:0d:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[67762.618934] amdgpu 0000:0d:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[67762.618936] amd...

[67760.805903] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=19820784, emitted seq=19820786
[67760.806285] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process valheim.x86_64 pid 464107 thread valheim.x8:cs0 pid 464109
[67760.806667] amdgpu 0000:0d:00.0: amdgpu: GPU reset begin!
[67761.257012] amdgpu 0000:0d:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[67761.257232] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[67761.307862] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:80:crtc-1] hw_done or flip_done timed out
[67761.516374] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[67761.542980] [drm] free PSP TMR buffer
[67761.587266] amdgpu 0000:0d:00.0: amdgpu: MODE1 reset
[67761.587269] amdgpu 0000:0d:00.0: amdgpu: GPU mode1 reset
[67761.587329] amdgpu 0000:0d:00.0: amdgpu: GPU smu mode1 reset
[67762.091974] amdgpu 0000:0d:00.0: amdgpu: GPU reset succeeded, trying to resume
[67762.092156] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[67762.092219] [drm] VRAM is lost due to GPU reset!
[67762.092220] [drm] PSP is resuming...
[67762.168492] [drm] reserve 0xa00000 from 0x8001000000 for PSP TMR
[67762.269801] amdgpu 0000:0d:00.0: amdgpu: RAS: optional ras ta ucode is not available
[67762.283510] amdgpu 0000:0d:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[67762.283513] amdgpu 0000:0d:00.0: amdgpu: SMU is resuming...
[67762.283516] amdgpu 0000:0d:00.0: amdgpu: smu driver if version = 0x0000000e, smu fw if version = 0x00000012, smu fw program = 0, version = 0x00413900 (65.57.0)
[67762.283519] amdgpu 0000:0d:00.0: amdgpu: SMU driver if version not matched
[67762.283549] amdgpu 0000:0d:00.0: amdgpu: use vbios provided pptable
[67762.343739] amdgpu 0000:0d:00.0: amdgpu: SMU is resumed successfully!
[67762.345104] [drm] DMUB hardware initialized: version=0x02020017
[67762.615558] [drm] kiq ring mec 2 pipe 1 q 0
[67762.618728] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[67762.618910] [drm] JPEG decode initialized successfully.
[67762.618918] amdgpu 0000:0d:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[67762.618921] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[67762.618922] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[67762.618924] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[67762.618925] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[67762.618926] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[67762.618927] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[67762.618929] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[67762.618930] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[67762.618931] amdgpu 0000:0d:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[67762.618933] amdgpu 0000:0d:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[67762.618934] amdgpu 0000:0d:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[67762.618936] amdgpu 0000:0d:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
[67762.618937] amdgpu 0000:0d:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
[67762.618938] amdgpu 0000:0d:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
[67762.618940] amdgpu 0000:0d:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
[67762.622875] amdgpu 0000:0d:00.0: amdgpu: recover vram bo from shadow start
[67762.622989] amdgpu 0000:0d:00.0: amdgpu: recover vram bo from shadow done
[67762.622991] [drm] Skip scheduling IBs!
[67762.622993] [drm] Skip scheduling IBs!
[67762.623004] amdgpu 0000:0d:00.0: amdgpu: GPU reset(2) succeeded!
[67762.623027] [drm] Skip scheduling IBs!
[67762.623044] [drm] Skip scheduling IBs!
[67762.623052] [drm] Skip scheduling IBs!
[67762.623057] [drm] Skip scheduling IBs!
[67762.623058] [drm] Skip scheduling IBs!
[67762.623064] [drm] Skip scheduling IBs!
[67762.623067] [drm] Skip scheduling IBs!
[67762.623069] [drm] Skip scheduling IBs!
[67762.623071] [drm] Skip scheduling IBs!
[67762.623073] [drm] Skip scheduling IBs!
[67762.623076] [drm] Skip scheduling IBs!
[67762.623076] [drm] Skip scheduling IBs!
[67762.623080] [drm] Skip scheduling IBs!
[67762.623082] [drm] Skip scheduling IBs!
[67762.623083] [drm] Skip scheduling IBs!
[67762.623086] [drm] Skip scheduling IBs!
[67762.623086] [drm] Skip scheduling IBs!
[67762.623090] [drm] Skip scheduling IBs!
[67762.623091] [drm] Skip scheduling IBs!
[67762.623093] [drm] Skip scheduling IBs!
[67762.623096] [drm] Skip scheduling IBs!
[67762.623097] [drm] Skip scheduling IBs!
[67762.623100] [drm] Skip scheduling IBs!
[67762.623101] [drm] Skip scheduling IBs!
[67762.623104] [drm] Skip scheduling IBs!
[67762.623107] [drm] Skip scheduling IBs!
[67762.623107] [drm] Skip scheduling IBs!
[67762.623111] [drm] Skip scheduling IBs!
[67762.623112] [drm] Skip scheduling IBs!
[67762.623114] [drm] Skip scheduling IBs!
[67762.623117] [drm] Skip scheduling IBs!
[67762.623117] [drm] Skip scheduling IBs!
[67762.623121] [drm] Skip scheduling IBs!
[67762.623122] [drm] Skip scheduling IBs!
[67762.623124] [drm] Skip scheduling IBs!
[67762.623127] [drm] Skip scheduling IBs!
[67762.623127] [drm] Skip scheduling IBs!
[67762.623130] [drm] Skip scheduling IBs!
[67762.623132] [drm] Skip scheduling IBs!
[67762.623133] [drm] Skip scheduling IBs!
[67762.623136] [drm] Skip scheduling IBs!
[67762.623139] [drm] Skip scheduling IBs!
[67762.623143] [drm] Skip scheduling IBs!
[67762.623144] [drm] Skip scheduling IBs!
[67762.623148] [drm] Skip scheduling IBs!
[67762.623148] [drm] Skip scheduling IBs!
[67762.623152] [drm] Skip scheduling IBs!
[67762.623153] [drm] Skip scheduling IBs!
[67762.623157] [drm] Skip scheduling IBs!
[67762.623158] [drm] Skip scheduling IBs!
[67762.623161] [drm] Skip scheduling IBs!
[67762.623163] [drm] Skip scheduling IBs!
[67762.623166] [drm] Skip scheduling IBs!
[67762.623168] [drm] Skip scheduling IBs!
[67762.623170] [drm] Skip scheduling IBs!
[67762.623173] [drm] Skip scheduling IBs!
[67762.623174] [drm] Skip scheduling IBs!
[67762.623177] [drm] Skip scheduling IBs!
[67762.623178] [drm] Skip scheduling IBs!
[67762.623182] [drm] Skip scheduling IBs!
[67762.623182] [drm] Skip scheduling IBs!
[67762.623187] [drm] Skip scheduling IBs!
[67762.623187] [drm] Skip scheduling IBs!
[67762.623192] [drm] Skip scheduling IBs!
[67762.623192] [drm] Skip scheduling IBs!
[67762.623197] [drm] Skip scheduling IBs!
[67762.623197] [drm] Skip scheduling IBs!
[67762.623202] [drm] Skip scheduling IBs!
[67762.623202] [drm] Skip scheduling IBs!
[67762.623206] [drm] Skip scheduling IBs!
[67762.623207] [drm] Skip scheduling IBs!
[67762.623210] [drm] Skip scheduling IBs!
[67762.623212] [drm] Skip scheduling IBs!
[67762.623214] [drm] Skip scheduling IBs!
[67762.623216] [drm] Skip scheduling IBs!
[67762.623217] [drm] Skip scheduling IBs!
[67762.623221] [drm] Skip scheduling IBs!
[67762.623221] [drm] Skip scheduling IBs!
[67762.623225] [drm] Skip scheduling IBs!
[67762.623226] [drm] Skip scheduling IBs!
[67762.623230] [drm] Skip scheduling IBs!
[67762.623230] [drm] Skip scheduling IBs!
[67762.623233] [drm] Skip scheduling IBs!
[67762.623234] [drm] Skip scheduling IBs!
[67762.623236] [drm] Skip scheduling IBs!
[67762.623239] [drm] Skip scheduling IBs!
[67762.623243] [drm] Skip scheduling IBs!
[67762.623246] [drm] Skip scheduling IBs!
[67762.623250] [drm] Skip scheduling IBs!
[67762.623254] [drm] Skip scheduling IBs!
[67762.623257] [drm] Skip scheduling IBs!
[67762.623260] [drm] Skip scheduling IBs!
[67762.623263] [drm] Skip scheduling IBs!
[67762.623267] [drm] Skip scheduling IBs!
[67762.623270] [drm] Skip scheduling IBs!
[67762.623273] [drm] Skip scheduling IBs!
[67762.623277] [drm] Skip scheduling IBs!
[67762.623280] [drm] Skip scheduling IBs!
[67762.623284] [drm] Skip scheduling IBs!
[67762.623287] [drm] Skip scheduling IBs!
[67762.623290] [drm] Skip scheduling IBs!
[67762.623293] [drm] Skip scheduling IBs!
[67762.623298] [drm] Skip scheduling IBs!
[67762.623301] [drm] Skip scheduling IBs!
[67762.623305] [drm] Skip scheduling IBs!
[67762.623309] [drm] Skip scheduling IBs!
[67762.623312] [drm] Skip scheduling IBs!
[67762.623316] [drm] Skip scheduling IBs!
[67762.623319] [drm] Skip scheduling IBs!
[67762.623321] [drm] Skip scheduling IBs!
[67762.623324] [drm] Skip scheduling IBs!
[67762.623327] [drm] Skip scheduling IBs!
[67762.623331] [drm] Skip scheduling IBs!
[67762.623334] [drm] Skip scheduling IBs!
[67762.623337] [drm] Skip scheduling IBs!
[67762.623340] [drm] Skip scheduling IBs!
[67762.623343] [drm] Skip scheduling IBs!
[67762.623345] [drm] Skip scheduling IBs!
[67762.623349] [drm] Skip scheduling IBs!
[67762.623353] [drm] Skip scheduling IBs!
[67762.623356] [drm] Skip scheduling IBs!
[67762.623359] [drm] Skip scheduling IBs!
[67762.623362] [drm] Skip scheduling IBs!
[67762.623366] [drm] Skip scheduling IBs!
[67762.623369] [drm] Skip scheduling IBs!
[67762.623373] [drm] Skip scheduling IBs!
[67762.623376] [drm] Skip scheduling IBs!
[67762.623379] [drm] Skip scheduling IBs!
[67762.623382] [drm] Skip scheduling IBs!
[67762.623385] [drm] Skip scheduling IBs!
[67762.623388] [drm] Skip scheduling IBs!
[67762.623392] [drm] Skip scheduling IBs!
[67762.623395] [drm] Skip scheduling IBs!
[67762.623398] [drm] Skip scheduling IBs!
[67762.623401] [drm] Skip scheduling IBs!
[67762.623404] [drm] Skip scheduling IBs!
[67762.623407] [drm] Skip scheduling IBs!
[67762.623411] [drm] Skip scheduling IBs!
[67762.623414] [drm] Skip scheduling IBs!
[67762.623417] [drm] Skip scheduling IBs!
[67762.623420] [drm] Skip scheduling IBs!
[67762.623423] [drm] Skip scheduling IBs!
[67762.623426] [drm] Skip scheduling IBs!
[67762.623429] [drm] Skip scheduling IBs!
[67762.623433] [drm] Skip scheduling IBs!
[67762.623437] [drm] Skip scheduling IBs!
[67762.623440] [drm] Skip scheduling IBs!
[67762.623443] [drm] Skip scheduling IBs!
[67762.623446] [drm] Skip scheduling IBs!
[67762.623450] [drm] Skip scheduling IBs!
[67762.623453] [drm] Skip scheduling IBs!
[67762.623456] [drm] Skip scheduling IBs!
[67762.623460] [drm] Skip scheduling IBs!
[67762.623463] [drm] Skip scheduling IBs!
[67762.623466] [drm] Skip scheduling IBs!
[67762.623469] [drm] Skip scheduling IBs!
[67762.623473] [drm] Skip scheduling IBs!
[67762.623476] [drm] Skip scheduling IBs!
[67762.623479] [drm] Skip scheduling IBs!
[67762.623482] [drm] Skip scheduling IBs!
[67762.623485] [drm] Skip scheduling IBs!
[67762.623489] [drm] Skip scheduling IBs!
[67762.623492] [drm] Skip scheduling IBs!
[67762.623495] [drm] Skip scheduling IBs!
[67762.623498] [drm] Skip scheduling IBs!
[67762.623501] [drm] Skip scheduling IBs!
[67762.623505] [drm] Skip scheduling IBs!
[67762.623508] [drm] Skip scheduling IBs!
[67762.623511] [drm] Skip scheduling IBs!
[67762.623515] [drm] Skip scheduling IBs!
[67762.623518] [drm] Skip scheduling IBs!
[67762.623522] [drm] Skip scheduling IBs!
[67762.623525] [drm] Skip scheduling IBs!
[67762.623529] [drm] Skip scheduling IBs!
[67762.623533] [drm] Skip scheduling IBs!
[67762.623537] [drm] Skip scheduling IBs!
[67762.623541] [drm] Skip scheduling IBs!
[67762.623544] [drm] Skip scheduling IBs!
[67762.623546] amdgpu_cs_ioctl: 7 callbacks suppressed
[67762.623548] [drm] Skip scheduling IBs!
[67762.623553] [drm] Skip scheduling IBs!
[67762.623557] [drm] Skip scheduling IBs!
[67762.623560] [drm] Skip scheduling IBs!
[67762.623565] [drm] Skip scheduling IBs!
[67762.623568] [drm] Skip scheduling IBs!
[67762.623572] [drm] Skip scheduling IBs!
[67762.623575] [drm] Skip scheduling IBs!
[67762.623549] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[67762.636312] traps: xss-lock[2346] trap int3 ip:7f86599e4e51 sp:7ffc0f5bdc20 error:0 in libglib-2.0.so.0.7200.3[7f86599a8000+91000]
[67762.645640] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[67762.862396] qtile[2274]: segfault at 7fa29b6baae0 ip 00007fa29b6baae0 sp 00007fff9d25d758 error 14 in libgobject-2.0.so.0.7200.3[7fa29b70f000+e000]
[67762.862415] Code: Unable to access opcode bytes at RIP 0x7fa29b6baab6.
[67765.682610] rfkill: input handler disabled
[67766.056883] usb 4-2: current rate 16000 is different from the runtime rate 48000
[67766.120888] usb 4-2: current rate 16000 is different from the runtime rate 48000
[67766.184883] usb 4-2: current rate 16000 is different from the runtime rate 48000
[67774.117179] rfkill: input handler enabled
------------------------------------------------------------------------------------------
I am having this same issue. It is with the following hardware and only while gaming. When I am doing anything else besides gaming, everything is fine... I don't game often but it is commonly on overwatch and valheim. in case that helps. 
-----------------------------------------------------------------------------------------
OS: Nobara Linux 36 (Thirty Six) x86_64 
Kernel: 6.0.14-201.fsync.fc36.x86_64 
CPU: AMD Ryzen 5 3600 (12) @ 3.600GHz 
GPU: AMD 6700 XT
Memory: 5382MiB / 32002MiB 
MOBO: Asus Prime MA Wifi II

0d:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M] (rev c1) (prog-if 00 [VGA controller])
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0e36
	Flags: bus master, fast devsel, latency 0, IRQ 104, IOMMU group 18
	Memory at d0000000 (64-bit, prefetchable) [size=256M]
	Memory at e0000000 (64-bit, prefetchable) [size=2M]
	I/O ports at e000 [size=256]
	Memory at fc900000 (32-bit, non-prefetchable) [size=1M]
	Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: <access denied>
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

Revision history for this message

In Linux Kernel Bug Tracker #201957, smf-linux (smf-linux-linux-kernel-bugs) wrote on 2023-05-23:

#91

Created attachment 304307
Started testing kernel 6.4-rc3 got the same problem

Revision history for this message

In Linux Kernel Bug Tracker #201957, smf-linux (smf-linux-linux-kernel-bugs) wrote on 2023-05-24:

#92

Is it worth the effort of bisecting this as it seems to be on a lot of kernel versions ?

thanks

Revision history for this message

In Linux Kernel Bug Tracker #201957, kernel.org (kernel.org-linux-kernel-bugs) wrote on 2023-08-15:

#93

Status = NEW after nearly 5 years?
I have the same problem

Aug 15 14:18:19 nb-tz kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=3442457, emitted seq=3442459
Aug 15 14:18:19 nb-tz kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 2628 thread gnome-shel:cs0 pid 2679

Revision history for this message

In Linux Kernel Bug Tracker #201957, priit (priit-linux-kernel-bugs) wrote on 2023-08-24:

#94

AMD Vega 64 (vega10 chip)
kernel: 6.4.9

linux-firmware: 20230724

# graphical session died and had to log in again, computer didn't boot though...
aug 20 02:11:06 Zen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=368426139, emitted seq=368426141
aug 20 02:11:06 Zen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 414636 thread firefox:cs0 pid 414712

linux-firmware: 20230810 (upgraded it... although there was no "vega10" changes inbetween)

# just freeze for like 30s and then it got unstuck again.
aug 23 23:09:24 Zen kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:60:crtc-0] hw_done or flip_done timed out
aug 23 23:09:34 Zen kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:63:crtc-1] hw_done or flip_done timed out
aug 23 23:09:44 Zen kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:66:crtc-2] hw_done or flip_done timed out

Revision history for this message

In Linux Kernel Bug Tracker #201957, graham.oconnor (graham.oconnor-linux-kernel-bugs) wrote on 2023-09-21:

#95

AMD Ryzen 3700U APU (Vega 10)

This issue has recently started happening, mostly when firing up games or graphically intensive tasks. One case of lockup during normal desktop use.

Worked fine on 6.4.X series (currently running on 6.4.12). However, all kernels in the 6.5 series cause the following:

[ 112.727138] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=9861, emitted seq=9863
[ 112.728214] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xwayland pid 919 thread Xwayland:cs0 pid 928
[ 112.729270] amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
[ 112.885652] amdgpu 0000:04:00.0: amdgpu: MODE2 reset
[ 112.885709] amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 112.886024] [drm] PCIE GART of 1024M enabled.
[ 112.886027] [drm] PTB located at 0x000000F400A00000
[ 112.886143] [drm] PSP is resuming...
[ 112.906168] [drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR
[ 112.985033] amdgpu 0000:04:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 112.992320] amdgpu 0000:04:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 113.733685] [drm] kiq ring mec 2 pipe 1 q 0
[ 113.998619] amdgpu 0000:04:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
[ 113.999249] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110
[ 113.999957] amdgpu 0000:04:00.0: amdgpu: GPU reset(2) failed
[ 114.000006] amdgpu 0000:04:00.0: amdgpu: GPU reset end with ret = -110
[ 114.000010] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -110

Revision history for this message

In Linux Kernel Bug Tracker #201957, kcohar (kcohar-linux-kernel-bugs) wrote on 2023-09-23:

#96

I can confirm this bug

Experiencing it on an AMD Ryzen 5 3500U (Vega 8), Fedora 39 beta, kernel 6.5.2.
Also on Arch (kernel 6.5.2).
No problems on Fedora 38 (kernel 6.2.x).

In my case it happens frequently with normal desktop use on Fedora and Arch.

Sep 23 03:39:34 jackdaw kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=10067, emitted seq=10069
Sep 23 03:39:34 jackdaw kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process nautilus pid 5981 thread nautilus:cs0 pid 6173
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: MODE2 reset
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset succeeded, trying to resume
Sep 23 03:39:34 jackdaw kernel: [drm] PCIE GART of 1024M enabled.
Sep 23 03:39:34 jackdaw kernel: [drm] PTB located at 0x000000F400A00000
Sep 23 03:39:34 jackdaw kernel: [drm] PSP is resuming...
Sep 23 03:39:34 jackdaw kernel: [drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not available
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not available
Sep 23 03:39:34 jackdaw kernel: [drm] kiq ring mec 2 pipe 1 q 0
Sep 23 03:39:35 jackdaw kernel: amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
Sep 23 03:39:35 jackdaw kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110
Sep 23 03:39:35 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset(2) failed
Sep 23 03:39:35 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset end with ret = -110
Sep 23 03:39:35 jackdaw kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -110
Sep 23 03:39:35 jackdaw kernel: [drm] Skip scheduling IBs!
Sep 23 03:39:45 jackdaw kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_high timeout, signaled seq=9114, emitted seq=9116
Sep 23 03:39:45 jackdaw kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 2206 thread gnome-shel:cs0 pid 2258
Sep 23 03:39:45 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin!

I can confirm this bug

Experiencing it on an AMD Ryzen 5 3500U (Vega 8), Fedora 39 beta, kernel 6.5.2.
Also on Arch (kernel 6.5.2).
No problems on Fedora 38 (kernel 6.2.x).

In my case it happens frequently with normal desktop use on Fedora and Arch.

Sep 23 03:39:34 jackdaw kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=10067, emitted seq=10069
Sep 23 03:39:34 jackdaw kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process nautilus pid 5981 thread nautilus:cs0 pid 6173
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: MODE2 reset
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset succeeded, trying to resume
Sep 23 03:39:34 jackdaw kernel: [drm] PCIE GART of 1024M enabled.
Sep 23 03:39:34 jackdaw kernel: [drm] PTB located at 0x000000F400A00000
Sep 23 03:39:34 jackdaw kernel: [drm] PSP is resuming...
Sep 23 03:39:34 jackdaw kernel: [drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not available
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not available
Sep 23 03:39:34 jackdaw kernel: [drm] kiq ring mec 2 pipe 1 q 0
Sep 23 03:39:35 jackdaw kernel: amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
Sep 23 03:39:35 jackdaw kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110
Sep 23 03:39:35 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset(2) failed
Sep 23 03:39:35 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset end with ret = -110
Sep 23 03:39:35 jackdaw kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -110
Sep 23 03:39:35 jackdaw kernel: [drm] Skip scheduling IBs!
Sep 23 03:39:45 jackdaw kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_high timeout, signaled seq=9114, emitted seq=9116
Sep 23 03:39:45 jackdaw kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 2206 thread gnome-shel:cs0 pid 2258
Sep 23 03:39:45 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin!

Revision history for this message

In Linux Kernel Bug Tracker #201957, aros (aros-linux-kernel-bugs) wrote on 2023-09-30:

#97

AMDGPU development is on its own bug tracker:

https://gitlab.freedesktop.org/drm/amd/-/issues

If you're still affected, check for existing bug reports and if there are none, please repost over there.

Revision history for this message

In Linux Kernel Bug Tracker #201957, aspicer (aspicer-linux-kernel-bugs) wrote on 2023-09-30:

#98

I have also been having this issue. It started occurring recently (last 2-3 months). No other changes.

Mostly lockups while gaming (yuzu), one lockup because of chrome.

I was able to fix this issue by switching from HDMI to DP or DVI.

Revision history for this message

In Linux Kernel Bug Tracker #201957, kcohar (kcohar-linux-kernel-bugs) wrote on 2023-09-30:

#99

Created attachment 305165
attachment-27613-0.html

In my case the fix was adding amdgpu.mcbp=0 to the kernel parameters.

On Sat, Sep 30, 2023 at 8:57 PM <email address hidden> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=201957
>
> <email address hidden> changed:
>
> What |Removed |Added
>
> ----------------------------------------------------------------------------
> CC| |<email address hidden>
>
> --- Comment #93 from <email address hidden> ---
> I have also been having this issue. It started occurring recently (last 2-3
> months). No other changes.
>
> Mostly lockups while gaming (yuzu), one lockup because of chrome.
>
> I was able to fix this issue by switching from HDMI to DP or DVI.
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.

Revision history for this message

In Linux Kernel Bug Tracker #201957, aspicer (aspicer-linux-kernel-bugs) wrote on 2023-09-30:

#100

(In reply to KC from comment #94)

Did you have it set to 1 previously? If not, I'm not sure if that was the silver bullet, because it looks like it defaults to 0. https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html

mcbp (int)

It is used to enable mid command buffer preemption. (0 = disabled (default), 1 = enabled)

Revision history for this message

In Linux Kernel Bug Tracker #201957, kcohar (kcohar-linux-kernel-bugs) wrote on 2023-09-30:

#101

Created attachment 305166
attachment-16816-0.html

The default is now -1.
https://unix.stackexchange.com/questions/756281/kernel-6-5-2-seems-to-have-amdgpu-crash-on-no-retry-page-fault
https://www.kernel.org/doc/html/v6.5/gpu/amdgpu/module-parameters.html

I set it to zero and I haven't had a single crash since (Fedora 39 beta,
Linux 6.5.5).
This one parameter change made my system entirely unusable (it would crash
very quickly after booting).

On Sat, Sep 30, 2023 at 9:35 PM <email address hidden> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=201957
>
> --- Comment #95 from <email address hidden> ---
> (In reply to KC from comment #94)
>
> Did you have it set to 1 previously? If not, I'm not sure if that was the
> silver bullet, because it looks like it defaults to 0.
> https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html
>
> mcbp (int)
>
> It is used to enable mid command buffer preemption. (0 = disabled
> (default), 1
> = enabled)
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.

Revision history for this message

Pirouette Cacahuète (lissyx) wrote on 2023-10-19:

#1

AlsaInfo.txt Edit (91.4 KiB, text/plain; charset="utf-8")
AudioDevicesInUse.txt Edit (669 bytes, text/plain; charset="utf-8")
CRDA.txt Edit (5.8 KiB, text/plain; charset="utf-8")
CurrentDmesg.txt Edit (156.1 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (3.3 KiB, text/plain; charset="utf-8")
IwConfig.txt Edit (733 bytes, text/plain; charset="utf-8")
Lspci.txt Edit (84.9 KiB, text/plain; charset="utf-8")
Lspci-vt.txt Edit (2.6 KiB, text/plain; charset="utf-8")
Lsusb.txt Edit (1.5 KiB, text/plain; charset="utf-8")
Lsusb-t.txt Edit (3.0 KiB, text/plain; charset="utf-8")
Lsusb-v.txt Edit (143.6 KiB, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (24.6 KiB, text/plain; charset="utf-8")
ProcCpuinfoMinimal.txt Edit (1.5 KiB, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (23.2 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (11.0 KiB, text/plain; charset="utf-8")
RfKill.txt Edit (250 bytes, text/plain; charset="utf-8")
UdevDb.txt Edit (454.8 KiB, text/plain; charset="utf-8")
WifiSyslog.txt Edit (230.4 KiB, text/plain; charset="utf-8")
acpidump.txt Edit (1.0 MiB, text/plain; charset="utf-8")

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2023-10-19: Status changed to Confirmed

#2

This change was made by a bot.

Changed in linux (Ubuntu):
status:	New → Confirmed

Revision history for this message

Pirouette Cacahuète (lissyx) wrote on 2023-10-19:

#3

https://gitlab.freedesktop.org/drm/amd/-/issues/2848#note_2108686

Revision history for this message

Erich Eickmeyer (eeickmeyer) wrote on 2023-10-19 (last edit on 2023-10-19):

#4

Working with Pirouette on IRC, we determined this may be related to https://bugzilla.kernel.org/show_bug.cgi?id=201957#c94 in which the solution, sadly, was to add amdgpu.mcbp=0 to the kernel boot parameters. Per that bug report, it does appear as though this might be the result of a regression in the 6.5 kernel as they did not experience this issue in prior kernels or Ubuntu 23.04.

They also found mentions of https://gitlab.freedesktop.org/drm/amd/-/issues/2848 where Kernel 6.6 has a fix which we could pull a patch from, and we might have a patch for mesa at https://gitlab.freedesktop.org/drm/amd/-/issues/2848#note_2095536.

Revision history for this message

Mario Limonciello (superm1) wrote on 2023-10-20:

#102

6.5.6 has the fix for preemption issue, it should get fixed when stable updates come in Mantic.

Revision history for this message

Pirouette Cacahuète (lissyx) wrote on 2023-10-20:

#103

Thanks, I'll try and keep you updated, however I am also facing bug 2039958 (probably a dupe of bug 2034619), so I might still need GNOME 45.1 to be released.

Revision history for this message

In Linux Kernel Bug Tracker #201957, jer (jer-linux-kernel-bugs) wrote on 2023-10-21:

#104

Hello, I'm having this same issue with my thinkpad z16 laptop, Ryzen 6850H and Radeon RX 6500M graphics card.

I do not use the laptop for gaming but for audio and video editing. I have not had trouble with any video editing software but I can easily reproduce the issue by loading up Ardour or Mixbus32C and either leaving it alone or working. After 15 minutes the screen freezes although audio will continue for a time. At this point Ardour or Mixbus will close and I can continue using the machine. If I load up either program again it will fail again, usually within a couple minutes and the whole laptop will freeze up until I ctrl-alt-F2 to get to a terminal prompt.

The issue always happens when Im recording audio with an HDMI device attached and 90% of the time without HDMI

I will attempt to set this kernel parameter amdgpu.mcbp=0 and report back.

Revision history for this message

In Linux Kernel Bug Tracker #201957, jer (jer-linux-kernel-bugs) wrote on 2023-10-22:

#105

(In reply to jeremy boyd from comment #97)
> Hello, I'm having this same issue with my thinkpad z16 laptop, Ryzen 6850H
> and Radeon RX 6500M graphics card.
>
> I do not use the laptop for gaming but for audio and video editing. I have
> not had trouble with any video editing software but I can easily reproduce
> the issue by loading up Ardour or Mixbus32C and either leaving it alone or
> working. After 15 minutes the screen freezes although audio will continue
> for a time. At this point Ardour or Mixbus will close and I can continue
> using the machine. If I load up either program again it will fail again,
> usually within a couple minutes and the whole laptop will freeze up until I
> ctrl-alt-F2 to get to a terminal prompt.
>
> The issue always happens when Im recording audio with an HDMI device
> attached and 90% of the time without HDMI
>
> I will attempt to set this kernel parameter amdgpu.mcbp=0 and report back.

I can confirm that this did not solve my problem. I tested my system out for several hours with no issue and thought that perhaps it had been solved but while doing a libreoffice presentation with my audio software running it happened again. here is the error from journalctl

Oct 22 09:40:01 fedora kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=433823, emitted seq=433825
Oct 22 09:40:01 fedora kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 2189 thread Xorg:cs0 pid 2319
Oct 22 09:40:01 fedora kernel: amdgpu 0000:67:00.0: amdgpu: GPU reset begin!
Oct 22 09:40:02 fedora kernel: amdgpu 0000:67:00.0: amdgpu: MODE2 reset
Oct 22 09:40:02 fedora kernel: amdgpu 0000:67:00.0: amdgpu: GPU reset succeeded, trying to resume

Revision history for this message

In Linux Kernel Bug Tracker #201957, mario.limonciello (mario.limonciello-linux-kernel-bugs) wrote on 2023-10-23:

#106

#98

The amdgpu.mcbp=0 will only help GFX9 products. For GFX10 this is a different problem, please open at AMD Gitlab.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2024-01-11:

#107

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in mesa (Ubuntu):
status:	New → Confirmed

Revision history for this message

Pirouette Cacahuète (lissyx) wrote on 2024-01-25:

#108

There's 6.5.0-15 package incoming on mantic-update, does it contains the fix?

Revision history for this message

Timo Aaltonen (tjaalton) wrote on 2024-01-25:

#109

no, -17 does

Revision history for this message

In Linux Kernel Bug Tracker #201957, mastercatz (mastercatz-linux-kernel-bugs) wrote on 2024-06-04:

#110

Download full text (25.9 KiB)

I am pretty sure I have amdgpu.mcbp=0 set

and after doing Ubuntu 24.04 LTS , just doing just about anything crashes the GPU

open web browser = crash , then I have to ssh in and restart desktop session

GL_VENDOR: AMD
GL_RENDERER: AMD Radeon RX 6800 XT (radeonsi, navi21, LLVM 15.0.7, DRM 3.57, 6.8.0-31-generic)
GL_VERSION: 4.6 (Compatibility Profile) Mesa 24.2~git2406010600.71d455~oibaf~j (git-71d455b 2024-06-01 jammy-oi

6.8.0-31-generic

[ 26.417827] [drm] amdgpu kernel modesetting enabled.
[ 26.431708] amdgpu: Virtual CRAT table created for CPU
[ 26.431727] amdgpu: Topology: Add CPU node
[ 26.431934] [drm] initializing kernel modesetting (SIENNA_CICHLID 0x1002:0x73BF 0x1043:0x04F0 0xC1).
[ 26.431949] [drm] register mmio base: 0xFC900000
[ 26.431951] [drm] register mmio size: 1048576
[ 26.435975] [drm] add ip block number 0 <nv_common>
[ 26.435978] [drm] add ip block number 1 <gmc_v10_0>
[ 26.435980] [drm] add ip block number 2 <navi10_ih>
[ 26.435982] [drm] add ip block number 3 <psp>
[ 26.435983] [drm] add ip block number 4 <smu>
[ 26.435985] [drm] add ip block number 5 <dm>
[ 26.435986] [drm] add ip block number 6 <gfx_v10_0>
[ 26.435988] [drm] add ip block number 7 <sdma_v5_2>
[ 26.435990] [drm] add ip block number 8 <vcn_v3_0>
[ 26.435996] [drm] add ip block number 9 <jpeg_v3_0>
[ 26.436013] amdgpu 0000:0e:00.0: No more image in the PCI ROM
[ 26.436028] amdgpu 0000:0e:00.0: amdgpu: Fetched VBIOS from ROM BAR
[ 26.436031] amdgpu: ATOM BIOS: 115-D412BS0-101
[ 26.473962] [drm] VCN(0) decode is enabled in VM mode
[ 26.473965] [drm] VCN(1) decode is enabled in VM mode
[ 26.473967] [drm] VCN(0) encode is enabled in VM mode
[ 26.473968] [drm] VCN(1) encode is enabled in VM mode
[ 26.477565] [drm] JPEG decode is enabled in VM mode
[ 26.477596] amdgpu 0000:0e:00.0: vgaarb: deactivate vga console
[ 26.478479] Console: switching to colour dummy device 80x25
[ 26.478490] amdgpu 0000:0e:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[ 26.478548] amdgpu 0000:0e:00.0: amdgpu: MEM ECC is not presented.
[ 26.478550] amdgpu 0000:0e:00.0: amdgpu: SRAM ECC is not presented.
[ 26.478570] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 26.478577] amdgpu 0000:0e:00.0: amdgpu: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
[ 26.478580] amdgpu 0000:0e:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 26.478588] [drm] Detected VRAM RAM=16368M, BAR=256M
[ 26.478589] [drm] RAM width 256bits GDDR6
[ 26.478734] [drm] amdgpu: 16368M of VRAM memory ready
[ 26.478739] [drm] amdgpu: 64363M of GTT memory ready.
[ 26.478768] [drm] GART: num cpu pages 131072, num gpu pages 131072
[ 26.478919] [drm] PCIE GART of 512M enabled (table at 0x0000008000900000).
[ 27.968739] amdgpu 0000:0e:00.0: amdgpu: STB initialized to 2048 entries
[ 27.969354] [drm] Loading DMUB firmware via PSP: version=0x02020020
[ 27.969777] [drm] use_doorbell being set to: [true]
[ 27.969791] [drm] use_doorbell being set to: [true]
[ 27.969803] [drm] use_doorbell being set to: [true]
[ ...

I am pretty sure I have amdgpu.mcbp=0 set

and after doing Ubuntu 24.04 LTS , just doing just about anything crashes the GPU

open web browser = crash  , then I have to ssh in and restart desktop session

GL_VENDOR:     AMD
    GL_RENDERER:   AMD Radeon RX 6800 XT (radeonsi, navi21, LLVM 15.0.7, DRM 3.57, 6.8.0-31-generic)
    GL_VERSION:    4.6 (Compatibility Profile) Mesa 24.2~git2406010600.71d455~oibaf~j (git-71d455b 2024-06-01 jammy-oi

6.8.0-31-generic

[   26.417827] [drm] amdgpu kernel modesetting enabled.
[   26.431708] amdgpu: Virtual CRAT table created for CPU
[   26.431727] amdgpu: Topology: Add CPU node
[   26.431934] [drm] initializing kernel modesetting (SIENNA_CICHLID 0x1002:0x73BF 0x1043:0x04F0 0xC1).
[   26.431949] [drm] register mmio base: 0xFC900000
[   26.431951] [drm] register mmio size: 1048576
[   26.435975] [drm] add ip block number 0 <nv_common>
[   26.435978] [drm] add ip block number 1 <gmc_v10_0>
[   26.435980] [drm] add ip block number 2 <navi10_ih>
[   26.435982] [drm] add ip block number 3 <psp>
[   26.435983] [drm] add ip block number 4 <smu>
[   26.435985] [drm] add ip block number 5 <dm>
[   26.435986] [drm] add ip block number 6 <gfx_v10_0>
[   26.435988] [drm] add ip block number 7 <sdma_v5_2>
[   26.435990] [drm] add ip block number 8 <vcn_v3_0>
[   26.435996] [drm] add ip block number 9 <jpeg_v3_0>
[   26.436013] amdgpu 0000:0e:00.0: No more image in the PCI ROM
[   26.436028] amdgpu 0000:0e:00.0: amdgpu: Fetched VBIOS from ROM BAR
[   26.436031] amdgpu: ATOM BIOS: 115-D412BS0-101
[   26.473962] [drm] VCN(0) decode is enabled in VM mode
[   26.473965] [drm] VCN(1) decode is enabled in VM mode
[   26.473967] [drm] VCN(0) encode is enabled in VM mode
[   26.473968] [drm] VCN(1) encode is enabled in VM mode
[   26.477565] [drm] JPEG decode is enabled in VM mode
[   26.477596] amdgpu 0000:0e:00.0: vgaarb: deactivate vga console
[   26.478479] Console: switching to colour dummy device 80x25
[   26.478490] amdgpu 0000:0e:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[   26.478548] amdgpu 0000:0e:00.0: amdgpu: MEM ECC is not presented.
[   26.478550] amdgpu 0000:0e:00.0: amdgpu: SRAM ECC is not presented.
[   26.478570] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[   26.478577] amdgpu 0000:0e:00.0: amdgpu: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
[   26.478580] amdgpu 0000:0e:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[   26.478588] [drm] Detected VRAM RAM=16368M, BAR=256M
[   26.478589] [drm] RAM width 256bits GDDR6
[   26.478734] [drm] amdgpu: 16368M of VRAM memory ready
[   26.478739] [drm] amdgpu: 64363M of GTT memory ready.
[   26.478768] [drm] GART: num cpu pages 131072, num gpu pages 131072
[   26.478919] [drm] PCIE GART of 512M enabled (table at 0x0000008000900000).
[   27.968739] amdgpu 0000:0e:00.0: amdgpu: STB initialized to 2048 entries
[   27.969354] [drm] Loading DMUB firmware via PSP: version=0x02020020
[   27.969777] [drm] use_doorbell being set to: [true]
[   27.969791] [drm] use_doorbell being set to: [true]
[   27.969803] [drm] use_doorbell being set to: [true]
[   27.969815] [drm] use_doorbell being set to: [true]
[   27.969830] [drm] Found VCN firmware Version ENC: 1.30 DEC: 3 VEP: 0 Revision: 4
[   27.969842] amdgpu 0000:0e:00.0: amdgpu: Will use PSP to load VCN firmware
[   28.036225] [drm] reserve 0xa00000 from 0x83fd000000 for PSP TMR
[   28.184762] amdgpu 0000:0e:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[   28.184784] amdgpu 0000:0e:00.0: amdgpu: smu driver if version = 0x00000040, smu fw if version = 0x00000041, smu fw program = 0, version = 0x003a5900 (58.89.0)
[   28.184788] amdgpu 0000:0e:00.0: amdgpu: SMU driver if version not matched
[   28.184816] amdgpu 0000:0e:00.0: amdgpu: use vbios provided pptable
[   28.257551] amdgpu 0000:0e:00.0: amdgpu: SMU is initialized successfully!
[   28.257835] [drm] Display Core v3.2.266 initialized on DCN 3.0
[   28.257837] [drm] DP-HDMI FRL PCON supported
[   28.259090] [drm] DMUB hardware initialized: version=0x02020020
[   28.261811] snd_hda_intel 0000:0e:00.1: bound 0000:0e:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
[   28.390798] [drm] kiq ring mec 2 pipe 1 q 0
[   28.398526] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[   28.398700] [drm] JPEG decode initialized successfully.
[   28.471332] amdgpu: HMM registered 16368MB device memory
[   28.473409] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[   28.473425] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[   28.473606] amdgpu: Virtual CRAT table created for GPU
[   28.474183] amdgpu: Topology: Add dGPU node [0x73bf:0x1002]
[   28.474186] kfd kfd: amdgpu: added device 1002:73bf
[   28.474214] amdgpu 0000:0e:00.0: amdgpu: SE 4, SH per SE 2, CU per SH 10, active_cu_number 72

[   28.474238] amdgpu 0000:0e:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 
[   28.475274] amdgpu 0000:0e:00.0: amdgpu: Using BACO for runtime pm
[   28.476312] [drm] Initialized amdgpu 3.57.0 20150101 for 0000:0e:00.0 on minor 0
[   28.495327] fbcon: amdgpudrmfb (fb0) is primary device

[ 1823.317612] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)
[ 1823.317622] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x000080010e57a000 from client 0x1b (UTCL2)
[ 1823.317626] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
[ 1823.317628] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
[ 1823.317631] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[ 1823.317633] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 1823.317635] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[ 1823.317637] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 1823.317639] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[ 1823.317644] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)
[ 1823.317648] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x000080010e569000 from client 0x1b (UTCL2)
[ 1823.317651] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
[ 1823.317653] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
[ 1823.317655] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[ 1823.317657] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 1823.317659] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[ 1823.317661] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 1823.317663] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[ 1823.317668] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)
[ 1823.317672] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x000080010e568000 from client 0x1b (UTCL2)
[ 1823.317674] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
[ 1823.317676] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
[ 1823.317679] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[ 1823.317681] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 1823.317683] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[ 1823.317685] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 1823.317687] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[ 1823.317692] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)
[ 1823.317695] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x000080010e578000 from client 0x1b (UTCL2)
[ 1823.317697] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
[ 1823.317700] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
[ 1823.317702] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[ 1823.317704] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 1823.317706] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[ 1823.317708] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 1823.317710] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[ 1823.317715] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)
[ 1823.317718] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x000080010e56d000 from client 0x1b (UTCL2)
[ 1823.317721] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 1823.317723] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[ 1823.317725] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[ 1823.317727] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 1823.317729] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[ 1823.317731] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 1823.317733] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[ 1823.317738] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)
[ 1823.317742] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x000080010e57c000 from client 0x1b (UTCL2)
[ 1823.317744] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 1823.317746] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[ 1823.317748] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[ 1823.317750] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 1823.317752] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[ 1823.317754] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 1823.317756] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[ 1823.317761] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)
[ 1823.317765] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x000080010e579000 from client 0x1b (UTCL2)
[ 1823.317767] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 1823.317769] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[ 1823.317771] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[ 1823.317773] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 1823.317775] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[ 1823.317777] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 1823.317779] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[ 1823.317784] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)
[ 1823.317788] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x000080010e56a000 from client 0x1b (UTCL2)
[ 1823.317790] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 1823.317792] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[ 1823.317794] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[ 1823.317796] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 1823.317799] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[ 1823.317801] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 1823.317803] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[ 1823.317809] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)
[ 1823.317812] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x000080010e595000 from client 0x1b (UTCL2)
[ 1823.317814] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 1823.317816] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)

[ 1823.317841] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[ 1823.317843] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 1823.317845] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[ 1823.317847] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 1823.317849] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[ 1833.613761] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
[ 1946.888252] kauditd_printk_skb: 113 callbacks suppressed
[ 1946.888257] audit: type=1326 audit(1717405290.252:120): auid=1000 uid=1000 gid=1000 ses=2 pid=145934 comm="firefox" exe="/snap/firefox/1075/usr/lib/firefox/firefox" sig=0 arch=c000003e syscall=314 compat=0 ip=0x74c6a555489d code=0x50000

[ 3461.404320] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)
[ 3461.404337] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x000080011516d000 from client 0x1b (UTCL2)
[ 3461.404344] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
[ 3461.404350] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
[ 3461.404354] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[ 3461.404359] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[ 3461.404363] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[ 3461.404368] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[ 3461.404372] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[ 3461.404381] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)

[41374.040191] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x0000800101561000 from client 0x1b (UTCL2)
[41374.040199] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
[41374.040204] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
[41374.040208] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[41374.040213] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[41374.040217] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[41374.040220] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[41374.040224] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[41374.040232] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)
[41374.040239] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x0000800101563000 from client 0x1b (UTCL2)
[41374.040244] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
[41374.040249] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
[41374.040253] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[41374.040258] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[41374.040262] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[41374.040266] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[41374.040269] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[41374.040277] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)
[41374.040283] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x0000800101563000 from client 0x1b (UTCL2)
[41374.040288] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
[41374.040292] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
[41374.040296] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[41374.040300] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[41374.040304] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[41374.040308] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[41374.040311] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[41374.040320] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)
[41374.040326] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x000080010159d000 from client 0x1b (UTCL2)
[41374.040331] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[41374.040335] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[41374.040339] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[41374.040343] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[41374.040347] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[41374.040351] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[41374.040355] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[41374.040362] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)
[41374.040368] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x000080010159c000 from client 0x1b (UTCL2)
[41374.040373] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[41374.040378] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[41374.040382] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[41374.040386] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[41374.040390] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[41374.040394] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[41374.040397] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[41374.040405] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)
[41374.040411] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x000080010159d000 from client 0x1b (UTCL2)
[41374.040416] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[41374.040420] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[41374.040424] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[41374.040427] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[41374.040431] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[41374.040435] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[41374.040440] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[41374.040447] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)
[41374.040454] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x0000800101565000 from client 0x1b (UTCL2)
[41374.040458] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[41374.040462] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[41374.040466] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[41374.040470] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[41374.040474] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[41374.040478] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[41374.040481] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[41374.040489] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)
[41374.040495] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x000080010159c000 from client 0x1b (UTCL2)
[41374.040501] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[41374.040505] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[41374.040509] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[41374.040513] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[41374.040517] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[41374.040520] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[41374.040524] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[41374.040531] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)
[41374.040538] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x000080010159e000 from client 0x1b (UTCL2)
[41374.040542] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[41374.040546] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[41374.040550] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[41374.040554] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[41374.040558] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[41374.040563] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[41374.040567] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[41374.040574] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32781, for process opera pid 137805 thread opera:cs0 pid 137824)
[41374.040580] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x00008001015ce000 from client 0x1b (UTCL2)
[41374.040585] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[41374.040589] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[41374.040593] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[41374.040596] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[41374.040600] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[41374.040604] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[41374.040608] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[41384.250697] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
[41389.531399] show_signal_msg: 117 callbacks suppressed
[41389.531402] GpuWatchdog[137840]: segfault at 0 ip 000055f397dc977a sp 00007e1c3ddff490 error 6 in opera[55f393dee000+663c000] likely on CPU 8 (core 10, socket 0)
[41389.531415] Code: 3d c9 52 63 fb be 01 00 00 00 ba 07 00 00 00 e8 ec 1f b5 fe 48 8d 3d 1f 62 64 fb be 01 00 00 00 ba 03 00 00 00 e8 d6 1f b5 fe <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 e4 3b e3 02 01 80 bd 7f ff
[41394.500673] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=22350749, emitted seq=22350752
[41394.501241] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[41394.501783] amdgpu 0000:0e:00.0: amdgpu: GPU reset begin!
[41394.831305] amdgpu 0000:0e:00.0: amdgpu: MODE1 reset
[41394.831316] amdgpu 0000:0e:00.0: amdgpu: GPU mode1 reset
[41394.831409] amdgpu 0000:0e:00.0: amdgpu: GPU smu mode1 reset
[41395.338820] amdgpu 0000:0e:00.0: amdgpu: GPU reset succeeded, trying to resume
[41395.339691] [drm] PCIE GART of 512M enabled (table at 0x0000008000900000).
[41395.339815] [drm] VRAM is lost due to GPU reset!
[41395.339818] [drm] PSP is resuming...
[41395.419395] [drm] reserve 0xa00000 from 0x83fd000000 for PSP TMR
[41395.560285] amdgpu 0000:0e:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[41395.560296] amdgpu 0000:0e:00.0: amdgpu: SMU is resuming...
[41395.560303] amdgpu 0000:0e:00.0: amdgpu: smu driver if version = 0x00000040, smu fw if version = 0x00000041, smu fw program = 0, version = 0x003a5900 (58.89.0)
[41395.560310] amdgpu 0000:0e:00.0: amdgpu: SMU driver if version not matched
[41395.560342] amdgpu 0000:0e:00.0: amdgpu: use vbios provided pptable
[41395.637776] amdgpu 0000:0e:00.0: amdgpu: SMU is resumed successfully!
[41395.639052] [drm] DMUB hardware initialized: version=0x02020020
[41395.988694] [drm] kiq ring mec 2 pipe 1 q 0
[41395.995849] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[41395.996030] [drm] JPEG decode initialized successfully.
[41395.996047] amdgpu 0000:0e:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[41395.996053] amdgpu 0000:0e:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[41395.996057] amdgpu 0000:0e:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[41395.996060] amdgpu 0000:0e:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[41395.996064] amdgpu 0000:0e:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[41395.996067] amdgpu 0000:0e:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[41395.996071] amdgpu 0000:0e:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[41395.996075] amdgpu 0000:0e:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[41395.996079] amdgpu 0000:0e:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[41395.996083] amdgpu 0000:0e:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[41395.996087] amdgpu 0000:0e:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[41395.996090] amdgpu 0000:0e:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[41395.996094] amdgpu 0000:0e:00.0: amdgpu: ring sdma2 uses VM inv eng 14 on hub 0
[41395.996097] amdgpu 0000:0e:00.0: amdgpu: ring sdma3 uses VM inv eng 15 on hub 0
[41395.996101] amdgpu 0000:0e:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[41395.996104] amdgpu 0000:0e:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
[41395.996108] amdgpu 0000:0e:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
[41395.996111] amdgpu 0000:0e:00.0: amdgpu: ring vcn_dec_1 uses VM inv eng 5 on hub 8
[41395.996114] amdgpu 0000:0e:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 6 on hub 8
[41395.996118] amdgpu 0000:0e:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv eng 7 on hub 8
[41395.996121] amdgpu 0000:0e:00.0: amdgpu: ring jpeg_dec uses VM inv eng 8 on hub 8
[41396.004859] amdgpu 0000:0e:00.0: amdgpu: recover vram bo from shadow start
[41396.040181] amdgpu 0000:0e:00.0: amdgpu: recover vram bo from shadow done
[41396.040221] [drm] Skip scheduling IBs!
[41396.040225] amdgpu 0000:0e:00.0: amdgpu: GPU reset(6) succeeded!

[41396.040387] [drm] Skip scheduling IBs!
...
[41396.043200] [drm] Skip scheduling IBs!
[42183.025204] amdgpu 0000:0e:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:4 pasid:32775, for process opera pid 2915753 thread opera:cs0 pid 2915831)
[42183.025217] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address

ring:24 vmid:6 pasid:32775, for process opera pid 2968669 thread opera:cs0 pid 2968686)
[81468.642520] amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x0000800125f90000 from client 0x1b (UTCL2)
[81468.642525] amdgpu 0000:0e:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[81468.642528] amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB/DB (0x0)
[81468.642532] amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[81468.642536] amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[81468.642540] amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[81468.642544] amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[81468.642547] amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
[81479.122186] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
[81759.945649] kauditd_printk_skb: 61 callbacks suppressed
[81759.945654] audit: type=1326 audit(1717485104.012:384): auid=1000 uid=1000 gid=1000 ses=2 pid=830708 comm="firefox" exe="/snap/firefox/1075/usr/lib/firefox/firefox" sig=0 arch=c000003e syscall=314 compat=0 ip=0x79433d64289d code=0x50000

Revision history for this message

In Linux Kernel Bug Tracker #201957, mario.limonciello (mario.limonciello-linux-kernel-bugs) wrote on 2024-06-04:

#111

#100:

You have a GFX10 product, this is not affected by amdgpu.mcbp=0/1. That's only for GFX9. Please open your own issue for it. Also in the kernel bug tracker please only report issues with mainline kernels. 6.8 is already EoL.

Revision history for this message

In Linux Kernel Bug Tracker #201957, mastercatz (mastercatz-linux-kernel-bugs) wrote on 2024-06-05:

#112

issue seems to only be with xorg , used wayland today and could not trigger it

Revision history for this message

In Linux Kernel Bug Tracker #201957, mastercatz (mastercatz-linux-kernel-bugs) wrote on 2024-06-05:

#113

and 6.9.3 also crashed

	Status	Importance	Assigned to
Linux	Unknown	Unknown	linux-kernel-bugs #201957
linux (Ubuntu)	Confirmed	Undecided	Unassigned
mesa (Ubuntu)	Confirmed	Undecided	Unassigned

Ubuntu
mesa package

amdgpu reset during usage of firefox

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntumesa package

amdgpu reset during usage of firefox

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
mesa package