[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout

Bug #1782716 reported by Joel Stanley
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Medium
Unassigned

Bug Description

Running the 4.17.0-5-generic kernel on a ppc64le machine with a Radeon R9 Fury GPU

0033:01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev ff)

[ 2361.958847] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=8777, last emitted seq=8778
[ 2362.080397] EEH: Frozen PHB#33-PE#0 detected
[ 2362.080470] EEH: PE location: CPU2 Slot1 (16x), PHB location: N/A
[ 2362.080568] CPU: 53 PID: 874 Comm: kworker/53:1 Not tainted 4.17.0-5-generic #6-Ubuntu
[ 2362.080575] Workqueue: events drm_sched_job_timedout [gpu_sched]
[ 2362.080577] Call Trace:
[ 2362.080584] [c0000000fb7078f0] [c000000000d275ac] dump_stack+0xb0/0xf4 (unreliable)
[ 2362.080590] [c0000000fb707930] [c00000000003ba0c] eeh_dev_check_failure+0x5bc/0x5e0
[ 2362.080593] [c0000000fb7079e0] [c00000000003babc] eeh_check_failure+0x8c/0xd0
[ 2362.080628] [c0000000fb707a20] [c00800000cfa1b88] amdgpu_mm_rreg+0x280/0x2a0 [amdgpu]
[ 2362.080676] [c0000000fb707a70] [c00800000d04cf68] gmc_v8_0_check_soft_reset+0x30/0xe0 [amdgpu]
[ 2362.080711] [c0000000fb707aa0] [c00800000cfa1194] amdgpu_device_ip_check_soft_reset.part.1+0x8c/0x140 [amdgpu]
[ 2362.080745] [c0000000fb707b30] [c00800000cfa649c] amdgpu_device_gpu_recover+0x854/0xa40 [amdgpu]
[ 2362.080799] [c0000000fb707c00] [c00800000d0b97a4] amdgpu_job_timedout+0x5c/0x80 [amdgpu]
[ 2362.080805] [c0000000fb707c70] [c00800000c8f0040] drm_sched_job_timedout+0x38/0x60 [gpu_sched]
[ 2362.080810] [c0000000fb707c90] [c000000000137928] process_one_work+0x298/0x580
[ 2362.080813] [c0000000fb707d20] [c000000000137c98] worker_thread+0x88/0x610
[ 2362.080817] [c0000000fb707dc0] [c000000000140958] kthread+0x1a8/0x1b0
[ 2362.080822] [c0000000fb707e30] [c00000000000b658] ret_from_kernel_thread+0x5c/0x84
[ 2362.080827] [drm] IP block:gmc_v8_0 is hung!
[ 2362.080832] [drm] IP block:tonga_ih is hung!
[ 2362.080843] [drm] IP block:gfx_v8_0 is hung!
[ 2362.080845] EEH: Detected PCI bus error on PHB#33-PE#0
[ 2362.080847] EEH: This PCI device has failed 1 times in the last hour
[ 2362.080849] EEH: Notify device drivers to shutdown
[ 2362.080850] [drm] IP block:sdma_v3_0 is hung!
[ 2362.080856] [drm] IP block:uvd_v6_0 is hung!
[ 2362.080858] EEH: Collect temporary log
[ 2362.080866] [drm] IP block:vce_v3_0 is hung!
[ 2362.080867] [drm] GPU recovery disabled.
[ 2362.080903] EEH: of node=0033:01:00.1
[ 2362.080905] EEH: PCI device/vendor: ffffffff
[ 2362.080907] EEH: PCI cmd/status register: ffffffff
[ 2362.080908] EEH: PCI-E capabilities and status follow:
[ 2362.080915] EEH: PCI-E 00: ffffffff ffffffff ffffffff ffffffff
[ 2362.080920] EEH: PCI-E 10: ffffffff ffffffff ffffffff ffffffff
[ 2362.080921] EEH: PCI-E 20: ffffffff
[ 2362.080922] EEH: PCI-E AER capability register set follows:
[ 2362.080928] EEH: PCI-E AER 00: ffffffff ffffffff ffffffff ffffffff
[ 2362.080933] EEH: PCI-E AER 10: ffffffff ffffffff ffffffff ffffffff
[ 2362.080938] EEH: PCI-E AER 20: ffffffff ffffffff ffffffff ffffffff
[ 2362.080940] EEH: PCI-E AER 30: ffffffff ffffffff
[ 2362.080941] EEH: of node=0033:01:00.0
[ 2362.080943] EEH: PCI device/vendor: ffffffff
[ 2362.080945] EEH: PCI cmd/status register: ffffffff
[ 2362.080945] EEH: PCI-E capabilities and status follow:
[ 2362.080951] EEH: PCI-E 00: ffffffff ffffffff ffffffff ffffffff
[ 2362.080956] EEH: PCI-E 10: ffffffff ffffffff ffffffff ffffffff
[ 2362.080957] EEH: PCI-E 20: ffffffff
[ 2362.080958] EEH: PCI-E AER capability register set follows:
[ 2362.080964] EEH: PCI-E AER 00: ffffffff ffffffff ffffffff ffffffff
[ 2362.080969] EEH: PCI-E AER 10: ffffffff ffffffff ffffffff ffffffff
[ 2362.080974] EEH: PCI-E AER 20: ffffffff ffffffff ffffffff ffffffff
[ 2362.080975] EEH: PCI-E AER 30: ffffffff ffffffff
[ 2362.080977] PHB4 PHB#51 Diag-data (Version: 1)
[ 2362.080978] brdgCtl: 00000002
[ 2362.080979] RootSts: 00060020 00402000 c1010008 00100107 00000000
[ 2362.080980] RootErrSts: 00000000 00000020 00000000
[ 2362.080981] PhbSts: 0000001c00000000 0000001c00000000
[ 2362.080982] Lem: 0000000100000000 0000000000000000 0000000100000000
[ 2362.080983] PhbErr: 000000c000000000 0000008000000000 2148000098000240 a008400000000000
[ 2362.080984] RegbErr: 0090000000000000 0010000000000000 4800003c00000000 0000000000000200
[ 2362.080985] PE[000] A/B: 8000000000000000 8000000000000000
[ 2362.080987] PE[..1fe] A/B: as above
[ 2362.080988] PE[1ff] A/B: b740002a01000000 8000000000000000
[ 2362.080988] EEH: Reset with hotplug activity
[ 2362.579139] iommu: Removing device 0033:01:00.1 from group 3
[ 2362.579206] pci 0033:01:00.1: Dropping the link to 0033:01:00.0
[ 2362.579665] [drm] amdgpu: finishing device.
[ 2363.495059] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, last signaled seq=8052, last emitted seq=8054
[ 2363.495192] [drm] IP block:gmc_v8_0 is hung!
[ 2363.495197] [drm] IP block:tonga_ih is hung!
[ 2363.495208] [drm] IP block:gfx_v8_0 is hung!
[ 2363.495212] [drm] IP block:sdma_v3_0 is hung!
[ 2363.495217] [drm] IP block:uvd_v6_0 is hung!
[ 2363.495225] [drm] IP block:vce_v3_0 is hung!
[ 2363.495226] [drm] GPU recovery disabled.
[ 2372.712463] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:43:crtc-0] hw_done or flip_done timed out

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1782716

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: cosmic
Revision history for this message
Joel Stanley (shenki) wrote :
Download full text (3.4 KiB)

After that, it fails to recover:

 2372.712463] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:43:crtc-0] hw_done or flip_done timed out
[ 2538.367847] INFO: task kworker/u257:2:8785 blocked for more than 120 seconds.
[ 2538.367917] Not tainted 4.17.0-5-generic #6-Ubuntu
[ 2538.367968] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2538.368053] kworker/u257:2 D 0 8785 2 0x00000800
[ 2538.368067] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 2538.368069] Call Trace:
[ 2538.368072] [c000000fcfb33460] [0000001400000000] 0x1400000000 (unreliable)
[ 2538.368078] [c000000fcfb33630] [c00000000001cd3c] __switch_to+0x2ec/0x4c0
[ 2538.368081] [c000000fcfb33690] [c000000000d42550] __schedule+0x330/0xa90
[ 2538.368083] [c000000fcfb33760] [c000000000d42cf0] schedule+0x40/0xc0
[ 2538.368086] [c000000fcfb33780] [c000000000d47b88] schedule_timeout+0x258/0x4f0
[ 2538.368090] [c000000fcfb33880] [c000000000923b90] dma_fence_default_wait+0x2b0/0x370
[ 2538.368093] [c000000fcfb338f0] [c000000000922f64] dma_fence_wait_timeout+0x74/0x190
[ 2538.368096] [c000000fcfb33930] [c000000000925fc0] reservation_object_wait_timeout_rcu+0x2f0/0x3e0
[ 2538.368141] [c000000fcfb339b0] [c00800000d10d108] amdgpu_dm_do_flip+0x130/0x3b0 [amdgpu]
[ 2538.368184] [c000000fcfb33b00] [c00800000d1113c8] amdgpu_dm_atomic_commit_tail+0xcb0/0xf90 [amdgpu]
[ 2538.368191] [c000000fcfb33c60] [c00800000c762b94] commit_tail+0x6c/0xe0 [drm_kms_helper]
[ 2538.368194] [c000000fcfb33c90] [c000000000137928] process_one_work+0x298/0x580
[ 2538.368197] [c000000fcfb33d20] [c000000000137c98] worker_thread+0x88/0x610
[ 2538.368200] [c000000fcfb33dc0] [c000000000140958] kthread+0x1a8/0x1b0
[ 2538.368203] [c000000fcfb33e30] [c00000000000b658] ret_from_kernel_thread+0x5c/0x84
[ 2659.214902] INFO: task kworker/u257:2:8785 blocked for more than 120 seconds.
[ 2659.214976] Not tainted 4.17.0-5-generic #6-Ubuntu
[ 2659.215019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2659.215126] kworker/u257:2 D 0 8785 2 0x00000800
[ 2659.215141] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 2659.215142] Call Trace:
[ 2659.215145] [c000000fcfb33460] [0000001400000000] 0x1400000000 (unreliable)
[ 2659.215151] [c000000fcfb33630] [c00000000001cd3c] __switch_to+0x2ec/0x4c0
[ 2659.215154] [c000000fcfb33690] [c000000000d42550] __schedule+0x330/0xa90
[ 2659.215157] [c000000fcfb33760] [c000000000d42cf0] schedule+0x40/0xc0
[ 2659.215160] [c000000fcfb33780] [c000000000d47b88] schedule_timeout+0x258/0x4f0
[ 2659.215163] [c000000fcfb33880] [c000000000923b90] dma_fence_default_wait+0x2b0/0x370
[ 2659.215166] [c000000fcfb338f0] [c000000000922f64] dma_fence_wait_timeout+0x74/0x190
[ 2659.215169] [c000000fcfb33930] [c000000000925fc0] reservation_object_wait_timeout_rcu+0x2f0/0x3e0
[ 2659.215217] [c000000fcfb339b0] [c00800000d10d108] amdgpu_dm_do_flip+0x130/0x3b0 [amdgpu]
[ 2659.215264] [c000000fcfb33b00] [c00800000d1113c8] amdgpu_dm_atomic_commit_tail+0xcb0/0xf90 [amdgpu]
[ 2659.215272] [c000000fcfb33c60] [c00800000c762b94] commit_tail+0x6c/0xe0 [drm_kms_helper]
[ 2659.215275] [c000000fcfb33c90] [c000000000137928] process_on...

Read more...

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.18 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.18-rc6

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key
Revision history for this message
Joel Stanley (shenki) wrote :
Download full text (8.2 KiB)

With upstream kernels I get this (and a frozen desktop):

[ 2604.488694] [drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 154000
[ 2634.551719] cfg80211: Loading compiled-in X.509 certificates for regulatory database
[ 2634.554170] cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
[ 3060.974388] [drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 154000
[ 3510.632708] [drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 154000
[ 3527.956089] [drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 154000
[ 4992.501324] [drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 154000
[ 5015.179529] [drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 154000
[ 5189.342133] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=4657, last emitted seq=4658
[ 5189.342233] [drm] GPU recovery disabled.
[ 5317.867388] INFO: task kworker/u257:3:54387 blocked for more than 120 seconds.
[ 5317.867471] Not tainted 4.18.0-041800rc6-generic #201807221830
[ 5317.867548] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5317.867656] kworker/u257:3 D 0 54387 2 0x00000808
[ 5317.867675] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 5317.867677] Call Trace:
[ 5317.867680] [c000000fe3447460] [0000002a00000000] 0x2a00000000 (unreliable)
[ 5317.867688] [c000000fe3447630] [c00000000001c430] __switch_to+0x260/0x4c0
[ 5317.867694] [c000000fe3447690] [c000000000d67b44] __schedule+0x304/0xad0
[ 5317.867697] [c000000fe3447760] [c000000000d68358] schedule+0x48/0xc0
[ 5317.867701] [c000000fe3447780] [c000000000d6d1b8] schedule_timeout+0x348/0x510
[ 5317.867707] [c000000fe3447880] [c000000000928b60] dma_fence_default_wait+0x2b0/0x350
[ 5317.867710] [c000000fe34478f0] [c00000000092780c] dma_fence_wait_timeout+0x6c/0x1b0
[ 5317.867714] [c000000fe3447930] [c00000000092aeb0] reservation_object_wait_timeout_rcu+0x320/0x3d0
[ 5317.867774] [c000000fe34479b0] [c00800000d5fc220] amdgpu_dm_do_flip+0x138/0x3b0 [amdgpu]
[ 5317.867831] [c000000fe3447b00] [c00800000d6001a0] amdgpu_dm_atomic_commit_tail+0x7f8/0xf20 [amdgpu]
[ 5317.867840] [c000000fe3447c60] [c00800000cb72da4] commit_tail+0x6c/0xe0 [drm_kms_helper]
[ 5317.867846] [c000000fe3447c90] [c000000000138720] process_one_work+0x2b0/0x560
[ 5317.867850] [c000000fe3447d20] [c000000000138a58] worker_thread+0x88/0x610
[ 5317.867854] [c000000fe3447dc0] [c0000000001416fc] kthread+0x1ac/0x1c0
[ 5317.867859] [c000000fe3447e30] [c00000000000b65c] ret_from_kernel_thread+0x5c/0x80
[ 5438.711397] INFO: task kworker/u257:3:54387 blocked for more than 120 seconds.
[ 5438.711473] Not tainted 4.18.0-041800rc6-generic #201807221830
[ 5438.711552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

If I kill the wayland session:

[ 7012.419912] EEH: Frozen PHB#33-PE#0 detected
[ 7012.419919] EEH: PE location: CPU2 Slot1 (16x), PHB location: N/A
[ 7012.419923] CPU: 74 PID: 126541 Comm: pulseaudio Not tainted 4.18.0-041800rc6-generic #201807221830
[ 7012.419924] Call Trace:
[ 7012.41...

Read more...

Revision history for this message
Joel Stanley (shenki) wrote :
Download full text (3.8 KiB)

With -rc7:

[ 333.596521] EEH: PHB#33 failure detected, location: N/A
[ 333.596563] CPU: 12 PID: 811 Comm: kworker/u257:1 Not tainted 4.18.0-041800rc7-generic #201807292230
[ 333.596576] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 333.596578] Call Trace:
[ 333.596582] [c000000fec1036c0] [c000000000d4d6fc] dump_stack+0xb0/0xf4 (unreliable)
[ 333.596587] [c000000fec103700] [c00000000003b114] eeh_dev_check_failure+0x514/0x5e0
[ 333.596589] [c000000fec1037a0] [c00000000003b26c] eeh_check_failure+0x8c/0xd0
[ 333.596616] [c000000fec1037e0] [c00800000d5119f8] amdgpu_mm_rreg+0x240/0x2a0 [amdgpu]
[ 333.596649] [c000000fec103840] [c00800000d623250] amdgpu_cgs_read_register+0x28/0x50 [amdgpu]
[ 333.596685] [c000000fec103860] [c00800000d6ce81c] dce110_timing_generator_get_vblank_counter+0x44/0x70 [amdgpu]
[ 333.596717] [c000000fec103880] [c00800000d6f4430] dc_stream_get_vblank_counter+0x88/0xb0 [amdgpu]
[ 333.596752] [c000000fec1038a0] [c00800000d67f5f4] dm_vblank_get_counter+0x4c/0xa8 [amdgpu]
[ 333.596774] [c000000fec103900] [c00800000d518630] amdgpu_get_vblank_counter_kms+0xa8/0x250 [amdgpu]
[ 333.596808] [c000000fec1039b0] [c00800000d67c1b8] amdgpu_dm_do_flip+0xd0/0x3b0 [amdgpu]
[ 333.596844] [c000000fec103b00] [c00800000d6801a0] amdgpu_dm_atomic_commit_tail+0x7f8/0xf20 [amdgpu]
[ 333.596850] [c000000fec103c60] [c00800000cb72da4] commit_tail+0x6c/0xe0 [drm_kms_helper]
[ 333.596853] [c000000fec103c90] [c0000000001385f0] process_one_work+0x2b0/0x560
[ 333.596855] [c000000fec103d20] [c000000000138928] worker_thread+0x88/0x610
[ 333.596858] [c000000fec103dc0] [c0000000001415dc] kthread+0x1ac/0x1c0
[ 333.596861] [c000000fec103e30] [c00000000000b65c] ret_from_kernel_thread+0x5c/0x80
[ 333.596886] EEH: Detected error on PHB#33
[ 333.596890] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[ 333.596891] EEH: Notify device drivers to shutdown
[ 333.596895] EEH: Beginning: 'error_detected(IO frozen)'
[ 333.596898] EEH: PE#1fe (PCI 0033:00:00.0): no driver
[ 333.596900] EEH: PE#0 (PCI 0033:01:00.1): driver not EEH aware
[ 333.596902] EEH: PE#0 (PCI 0033:01:00.0): driver not EEH aware
[ 333.596904] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'none'
[ 333.596908] EEH: Collect temporary log
[ 333.596910] PHB4 PHB#51 Diag-data (Version: 1)
[ 333.596911] brdgCtl: 00000002
[ 333.596913] RootSts: 00000020 00402000 e9010008 00100107 00000000
[ 333.596915] nFir: 0000800000000000 0030001c00000000 0000800000000000
[ 333.596916] PhbSts: 0000001800000000 0000001800000000
[ 333.596918] Lem: 0004000100000100 0000000000000000 0000000100000000
[ 333.596919] PhbErr: 000005a000000000 0000008000000000 2148000098000240 a008400000000000
[ 333.596921] PhbTxeErr: 0000200000000000 0000200000000000 4000000000000000 0000000000000000
[ 333.596923] RxeMrgErr: 0000000000000001 0000000000000001 0000000000000000 0000000000000000
[ 333.596925] PblErr: 0000000000000800 0000000000000800 0000000000000000 00000000028de410
[ 333.596926] RegbErr: 0010001000000000 0010000000000000 4800003c00000000 0000000000000200
[ 333.596929...

Read more...

Revision history for this message
Joel Stanley (shenki) wrote :

Reproduced with 4.17-rc7

tags: added: kernel-bug-exists-upstream
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Alexandr Kara (alexandr-kara) wrote :

Reproduced on Fedora (with kernel 4.18.16-300.fc29.x86_64).

Revision history for this message
Jakub Paś (jakubpas) wrote :

I have similar error on matebook D: Ryzen + AMD GPU:

Linux matebook 4.18.0-25-generic #26~18.04.1-Ubuntu SMP Thu Jun 27 07:28:31 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

The same is on kernel 4.15. It basically cause whole system to hang and I need to physically restart it

sie 21 10:41:44 matebook kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=760524, last emitted seq=760524
sie 21 10:41:44 matebook kernel: [drm] GPU recovery disabled.
sie 21 10:41:57 matebook kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [indicator-cpufr:4128]

Revision history for this message
Paul Graydon (twirrim) wrote :

Seeing the same on Ubuntu 18.04.3 with the HWE kernel, 5.0.0-25-generic

AMD A6-9225 RADEON R4

This appears to be tied in to the problems resuming the laptop from suspended (black screen, flashing cursor)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.