Ubuntu
linux package

[drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout

Bug #1782716 reported by Joel Stanley on 2018-07-20

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Confirmed	Medium	Unassigned

Bug Description

Running the 4.17.0-5-generic kernel on a ppc64le machine with a Radeon R9 Fury GPU

0033:01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev ff)

[ 2361.958847] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=8777, last emitted seq=8778
[ 2362.080397] EEH: Frozen PHB#33-PE#0 detected
[ 2362.080470] EEH: PE location: CPU2 Slot1 (16x), PHB location: N/A
[ 2362.080568] CPU: 53 PID: 874 Comm: kworker/53:1 Not tainted 4.17.0-5-generic #6-Ubuntu
[ 2362.080575] Workqueue: events drm_sched_job_timedout [gpu_sched]
[ 2362.080577] Call Trace:
[ 2362.080584] [c0000000fb7078f0] [c000000000d275ac] dump_stack+0xb0/0xf4 (unreliable)
[ 2362.080590] [c0000000fb707930] [c00000000003ba0c] eeh_dev_check_failure+0x5bc/0x5e0
[ 2362.080593] [c0000000fb7079e0] [c00000000003babc] eeh_check_failure+0x8c/0xd0
[ 2362.080628] [c0000000fb707a20] [c00800000cfa1b88] amdgpu_mm_rreg+0x280/0x2a0 [amdgpu]
[ 2362.080676] [c0000000fb707a70] [c00800000d04cf68] gmc_v8_0_check_soft_reset+0x30/0xe0 [amdgpu]
[ 2362.080711] [c0000000fb707aa0] [c00800000cfa1194] amdgpu_device_ip_check_soft_reset.part.1+0x8c/0x140 [amdgpu]
[ 2362.080745] [c0000000fb707b30] [c00800000cfa649c] amdgpu_device_gpu_recover+0x854/0xa40 [amdgpu]
[ 2362.080799] [c0000000fb707c00] [c00800000d0b97a4] amdgpu_job_timedout+0x5c/0x80 [amdgpu]
[ 2362.080805] [c0000000fb707c70] [c00800000c8f0040] drm_sched_job_timedout+0x38/0x60 [gpu_sched]
[ 2362.080810] [c0000000fb707c90] [c000000000137928] process_one_work+0x298/0x580
[ 2362.080813] [c0000000fb707d20] [c000000000137c98] worker_thread+0x88/0x610
[ 2362.080817] [c0000000fb707dc0] [c000000000140958] kthread+0x1a8/0x1b0
[ 2362.080822] [c0000000fb707e30] [c00000000000b658] ret_from_kernel_thread+0x5c/0x84
[ 2362.080827] [drm] IP block:gmc_v8_0 is hung!
[ 2362.080832] [drm] IP block:tonga_ih is hung!
[ 2362.080843] [drm] IP block:gfx_v8_0 is hung!
[ 2362.080845] EEH: Detected PCI bus error on PHB#33-PE#0
[ 2362.080847] EEH: This PCI device has failed 1 times in the last hour
[ 2362.080849] EEH: Notify device drivers to shutdown
[ 2362.080850] [drm] IP block:sdma_v3_0 is hung!
[ 2362.080856] [drm] IP block:uvd_v6_0 is hung!
[ 2362.080858] EEH: Collect temporary log
[ 2362.080866] [drm] IP block:vce_v3_0 is hung!
[ 2362.080867] [drm] GPU recovery disabled.
[ 2362.080903] EEH: of node=0033:01:00.1
[ 2362.080905] EEH: PCI device/vendor: ffffffff
[ 2362.080907] EEH: PCI cmd/status register: ffffffff
[ 2362.080908] EEH: PCI-E capabilities and status follow:
[ 2362.080915] EEH: PCI-E 00: ffffffff ffffffff ffffffff ffffffff
[ 2362.080920] EEH: PCI-E 10: ffffffff ffffffff ffffffff ffffffff
[ 2362.080921] EEH: PCI-E 20: ffffffff
[ 2362.080922] EEH: PCI-E AER capability register set follows:
[ 2362.080928] EEH: PCI-E AER 00: ffffffff ffffffff ffffffff ffffffff
[ 2362.080933] EEH: PCI-E AER 10: ffffffff ffffffff ffffffff ffffffff
[ 2362.080938] EEH: PCI-E AER 20: ffffffff ffffffff ffffffff ffffffff
[ 2362.080940] EEH: PCI-E AER 30: ffffffff ffffffff
[ 2362.080941] EEH: of node=0033:01:00.0
[ 2362.080943] EEH: PCI device/vendor: ffffffff
[ 2362.080945] EEH: PCI cmd/status register: ffffffff
[ 2362.080945] EEH: PCI-E capabilities and status follow:
[ 2362.080951] EEH: PCI-E 00: ffffffff ffffffff ffffffff ffffffff
[ 2362.080956] EEH: PCI-E 10: ffffffff ffffffff ffffffff ffffffff
[ 2362.080957] EEH: PCI-E 20: ffffffff
[ 2362.080958] EEH: PCI-E AER capability register set follows:
[ 2362.080964] EEH: PCI-E AER 00: ffffffff ffffffff ffffffff ffffffff
[ 2362.080969] EEH: PCI-E AER 10: ffffffff ffffffff ffffffff ffffffff
[ 2362.080974] EEH: PCI-E AER 20: ffffffff ffffffff ffffffff ffffffff
[ 2362.080975] EEH: PCI-E AER 30: ffffffff ffffffff
[ 2362.080977] PHB4 PHB#51 Diag-data (Version: 1)
[ 2362.080978] brdgCtl: 00000002
[ 2362.080979] RootSts: 00060020 00402000 c1010008 00100107 00000000
[ 2362.080980] RootErrSts: 00000000 00000020 00000000
[ 2362.080981] PhbSts: 0000001c00000000 0000001c00000000
[ 2362.080982] Lem: 0000000100000000 0000000000000000 0000000100000000
[ 2362.080983] PhbErr: 000000c000000000 0000008000000000 2148000098000240 a008400000000000
[ 2362.080984] RegbErr: 0090000000000000 0010000000000000 4800003c00000000 0000000000000200
[ 2362.080985] PE[000] A/B: 8000000000000000 8000000000000000
[ 2362.080987] PE[..1fe] A/B: as above
[ 2362.080988] PE[1ff] A/B: b740002a01000000 8000000000000000
[ 2362.080988] EEH: Reset with hotplug activity
[ 2362.579139] iommu: Removing device 0033:01:00.1 from group 3
[ 2362.579206] pci 0033:01:00.1: Dropping the link to 0033:01:00.0
[ 2362.579665] [drm] amdgpu: finishing device.
[ 2363.495059] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, last signaled seq=8052, last emitted seq=8054
[ 2363.495192] [drm] IP block:gmc_v8_0 is hung!
[ 2363.495197] [drm] IP block:tonga_ih is hung!
[ 2363.495208] [drm] IP block:gfx_v8_0 is hung!
[ 2363.495212] [drm] IP block:sdma_v3_0 is hung!
[ 2363.495217] [drm] IP block:uvd_v6_0 is hung!
[ 2363.495225] [drm] IP block:vce_v3_0 is hung!
[ 2363.495226] [drm] GPU recovery disabled.
[ 2372.712463] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:43:crtc-0] hw_done or flip_done timed out

Tags:

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2018-07-20: Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1782716

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete
tags:	added: cosmic

Revision history for this message

Joel Stanley (shenki) wrote on 2018-07-20:

Download full text (3.4 KiB)

After that, it fails to recover:

2372.712463] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:43:crtc-0] hw_done or flip_done timed out
[ 2538.367847] INFO: task kworker/u257:2:8785 blocked for more than 120 seconds.
[ 2538.367917] Not tainted 4.17.0-5-generic #6-Ubuntu
[ 2538.367968] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2538.368053] kworker/u257:2 D 0 8785 2 0x00000800
[ 2538.368067] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 2538.368069] Call Trace:
[ 2538.368072] [c000000fcfb33460] [0000001400000000] 0x1400000000 (unreliable)
[ 2538.368078] [c000000fcfb33630] [c00000000001cd3c] __switch_to+0x2ec/0x4c0
[ 2538.368081] [c000000fcfb33690] [c000000000d42550] __schedule+0x330/0xa90
[ 2538.368083] [c000000fcfb33760] [c000000000d42cf0] schedule+0x40/0xc0
[ 2538.368086] [c000000fcfb33780] [c000000000d47b88] schedule_timeout+0x258/0x4f0
[ 2538.368090] [c000000fcfb33880] [c000000000923b90] dma_fence_default_wait+0x2b0/0x370
[ 2538.368093] [c000000fcfb338f0] [c000000000922f64] dma_fence_wait_timeout+0x74/0x190
[ 2538.368096] [c000000fcfb33930] [c000000000925fc0] reservation_object_wait_timeout_rcu+0x2f0/0x3e0
[ 2538.368141] [c000000fcfb339b0] [c00800000d10d108] amdgpu_dm_do_flip+0x130/0x3b0 [amdgpu]
[ 2538.368184] [c000000fcfb33b00] [c00800000d1113c8] amdgpu_dm_atomic_commit_tail+0xcb0/0xf90 [amdgpu]
[ 2538.368191] [c000000fcfb33c60] [c00800000c762b94] commit_tail+0x6c/0xe0 [drm_kms_helper]
[ 2538.368194] [c000000fcfb33c90] [c000000000137928] process_one_work+0x298/0x580
[ 2538.368197] [c000000fcfb33d20] [c000000000137c98] worker_thread+0x88/0x610
[ 2538.368200] [c000000fcfb33dc0] [c000000000140958] kthread+0x1a8/0x1b0
[ 2538.368203] [c000000fcfb33e30] [c00000000000b658] ret_from_kernel_thread+0x5c/0x84
[ 2659.214902] INFO: task kworker/u257:2:8785 blocked for more than 120 seconds.
[ 2659.214976] Not tainted 4.17.0-5-generic #6-Ubuntu
[ 2659.215019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2659.215126] kworker/u257:2 D 0 8785 2 0x00000800
[ 2659.215141] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 2659.215142] Call Trace:
[ 2659.215145] [c000000fcfb33460] [0000001400000000] 0x1400000000 (unreliable)
[ 2659.215151] [c000000fcfb33630] [c00000000001cd3c] __switch_to+0x2ec/0x4c0
[ 2659.215154] [c000000fcfb33690] [c000000000d42550] __schedule+0x330/0xa90
[ 2659.215157] [c000000fcfb33760] [c000000000d42cf0] schedule+0x40/0xc0
[ 2659.215160] [c000000fcfb33780] [c000000000d47b88] schedule_timeout+0x258/0x4f0
[ 2659.215163] [c000000fcfb33880] [c000000000923b90] dma_fence_default_wait+0x2b0/0x370
[ 2659.215166] [c000000fcfb338f0] [c000000000922f64] dma_fence_wait_timeout+0x74/0x190
[ 2659.215169] [c000000fcfb33930] [c000000000925fc0] reservation_object_wait_timeout_rcu+0x2f0/0x3e0
[ 2659.215217] [c000000fcfb339b0] [c00800000d10d108] amdgpu_dm_do_flip+0x130/0x3b0 [amdgpu]
[ 2659.215264] [c000000fcfb33b00] [c00800000d1113c8] amdgpu_dm_atomic_commit_tail+0xcb0/0xf90 [amdgpu]
[ 2659.215272] [c000000fcfb33c60] [c00800000c762b94] commit_tail+0x6c/0xe0 [drm_kms_helper]
[ 2659.215275] [c000000fcfb33c90] [c000000000137928] process_on...

After that, it fails to recover:

2372.712463] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:43:crtc-0] hw_done or flip_done timed out
[ 2538.367847] INFO: task kworker/u257:2:8785 blocked for more than 120 seconds.
[ 2538.367917]       Not tainted 4.17.0-5-generic #6-Ubuntu
[ 2538.367968] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2538.368053] kworker/u257:2  D    0  8785      2 0x00000800
[ 2538.368067] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 2538.368069] Call Trace:
[ 2538.368072] [c000000fcfb33460] [0000001400000000] 0x1400000000 (unreliable)
[ 2538.368078] [c000000fcfb33630] [c00000000001cd3c] __switch_to+0x2ec/0x4c0
[ 2538.368081] [c000000fcfb33690] [c000000000d42550] __schedule+0x330/0xa90
[ 2538.368083] [c000000fcfb33760] [c000000000d42cf0] schedule+0x40/0xc0
[ 2538.368086] [c000000fcfb33780] [c000000000d47b88] schedule_timeout+0x258/0x4f0
[ 2538.368090] [c000000fcfb33880] [c000000000923b90] dma_fence_default_wait+0x2b0/0x370
[ 2538.368093] [c000000fcfb338f0] [c000000000922f64] dma_fence_wait_timeout+0x74/0x190
[ 2538.368096] [c000000fcfb33930] [c000000000925fc0] reservation_object_wait_timeout_rcu+0x2f0/0x3e0
[ 2538.368141] [c000000fcfb339b0] [c00800000d10d108] amdgpu_dm_do_flip+0x130/0x3b0 [amdgpu]
[ 2538.368184] [c000000fcfb33b00] [c00800000d1113c8] amdgpu_dm_atomic_commit_tail+0xcb0/0xf90 [amdgpu]
[ 2538.368191] [c000000fcfb33c60] [c00800000c762b94] commit_tail+0x6c/0xe0 [drm_kms_helper]
[ 2538.368194] [c000000fcfb33c90] [c000000000137928] process_one_work+0x298/0x580
[ 2538.368197] [c000000fcfb33d20] [c000000000137c98] worker_thread+0x88/0x610
[ 2538.368200] [c000000fcfb33dc0] [c000000000140958] kthread+0x1a8/0x1b0
[ 2538.368203] [c000000fcfb33e30] [c00000000000b658] ret_from_kernel_thread+0x5c/0x84
[ 2659.214902] INFO: task kworker/u257:2:8785 blocked for more than 120 seconds.
[ 2659.214976]       Not tainted 4.17.0-5-generic #6-Ubuntu
[ 2659.215019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2659.215126] kworker/u257:2  D    0  8785      2 0x00000800
[ 2659.215141] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 2659.215142] Call Trace:
[ 2659.215145] [c000000fcfb33460] [0000001400000000] 0x1400000000 (unreliable)
[ 2659.215151] [c000000fcfb33630] [c00000000001cd3c] __switch_to+0x2ec/0x4c0
[ 2659.215154] [c000000fcfb33690] [c000000000d42550] __schedule+0x330/0xa90
[ 2659.215157] [c000000fcfb33760] [c000000000d42cf0] schedule+0x40/0xc0
[ 2659.215160] [c000000fcfb33780] [c000000000d47b88] schedule_timeout+0x258/0x4f0
[ 2659.215163] [c000000fcfb33880] [c000000000923b90] dma_fence_default_wait+0x2b0/0x370
[ 2659.215166] [c000000fcfb338f0] [c000000000922f64] dma_fence_wait_timeout+0x74/0x190
[ 2659.215169] [c000000fcfb33930] [c000000000925fc0] reservation_object_wait_timeout_rcu+0x2f0/0x3e0
[ 2659.215217] [c000000fcfb339b0] [c00800000d10d108] amdgpu_dm_do_flip+0x130/0x3b0 [amdgpu]
[ 2659.215264] [c000000fcfb33b00] [c00800000d1113c8] amdgpu_dm_atomic_commit_tail+0xcb0/0xf90 [amdgpu]
[ 2659.215272] [c000000fcfb33c60] [c00800000c762b94] commit_tail+0x6c/0xe0 [drm_kms_helper]
[ 2659.215275] [c000000fcfb33c90] [c000000000137928] process_one_work+0x298/0x580
[ 2659.215278] [c000000fcfb33d20] [c000000000137c98] worker_thread+0x88/0x610
[ 2659.215281] [c000000fcfb33dc0] [c000000000140958] kthread+0x1a8/0x1b0
[ 2659.215284] [c000000fcfb33e30] [c00000000000b658] ret_from_kernel_thread+0x5c/0x84

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-07-24:

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.18 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.18-rc6

Changed in linux (Ubuntu):
importance:	Undecided → Medium
tags:	added: kernel-da-key

Revision history for this message

Joel Stanley (shenki) wrote on 2018-07-30:

Download full text (8.2 KiB)

With upstream kernels I get this (and a frozen desktop):

[ 2604.488694] [drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 154000
[ 2634.551719] cfg80211: Loading compiled-in X.509 certificates for regulatory database
[ 2634.554170] cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
[ 3060.974388] [drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 154000
[ 3510.632708] [drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 154000
[ 3527.956089] [drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 154000
[ 4992.501324] [drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 154000
[ 5015.179529] [drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 154000
[ 5189.342133] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=4657, last emitted seq=4658
[ 5189.342233] [drm] GPU recovery disabled.
[ 5317.867388] INFO: task kworker/u257:3:54387 blocked for more than 120 seconds.
[ 5317.867471] Not tainted 4.18.0-041800rc6-generic #201807221830
[ 5317.867548] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5317.867656] kworker/u257:3 D 0 54387 2 0x00000808
[ 5317.867675] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 5317.867677] Call Trace:
[ 5317.867680] [c000000fe3447460] [0000002a00000000] 0x2a00000000 (unreliable)
[ 5317.867688] [c000000fe3447630] [c00000000001c430] __switch_to+0x260/0x4c0
[ 5317.867694] [c000000fe3447690] [c000000000d67b44] __schedule+0x304/0xad0
[ 5317.867697] [c000000fe3447760] [c000000000d68358] schedule+0x48/0xc0
[ 5317.867701] [c000000fe3447780] [c000000000d6d1b8] schedule_timeout+0x348/0x510
[ 5317.867707] [c000000fe3447880] [c000000000928b60] dma_fence_default_wait+0x2b0/0x350
[ 5317.867710] [c000000fe34478f0] [c00000000092780c] dma_fence_wait_timeout+0x6c/0x1b0
[ 5317.867714] [c000000fe3447930] [c00000000092aeb0] reservation_object_wait_timeout_rcu+0x320/0x3d0
[ 5317.867774] [c000000fe34479b0] [c00800000d5fc220] amdgpu_dm_do_flip+0x138/0x3b0 [amdgpu]
[ 5317.867831] [c000000fe3447b00] [c00800000d6001a0] amdgpu_dm_atomic_commit_tail+0x7f8/0xf20 [amdgpu]
[ 5317.867840] [c000000fe3447c60] [c00800000cb72da4] commit_tail+0x6c/0xe0 [drm_kms_helper]
[ 5317.867846] [c000000fe3447c90] [c000000000138720] process_one_work+0x2b0/0x560
[ 5317.867850] [c000000fe3447d20] [c000000000138a58] worker_thread+0x88/0x610
[ 5317.867854] [c000000fe3447dc0] [c0000000001416fc] kthread+0x1ac/0x1c0
[ 5317.867859] [c000000fe3447e30] [c00000000000b65c] ret_from_kernel_thread+0x5c/0x80
[ 5438.711397] INFO: task kworker/u257:3:54387 blocked for more than 120 seconds.
[ 5438.711473] Not tainted 4.18.0-041800rc6-generic #201807221830
[ 5438.711552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

If I kill the wayland session:

With upstream kernels I get this (and a frozen desktop):

[ 2604.488694] [drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 154000
[ 2634.551719] cfg80211: Loading compiled-in X.509 certificates for regulatory database
[ 2634.554170] cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
[ 3060.974388] [drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 154000
[ 3510.632708] [drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 154000
[ 3527.956089] [drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 154000
[ 4992.501324] [drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 154000
[ 5015.179529] [drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 154000
[ 5189.342133] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=4657, last emitted seq=4658
[ 5189.342233] [drm] GPU recovery disabled.
[ 5317.867388] INFO: task kworker/u257:3:54387 blocked for more than 120 seconds.
[ 5317.867471]       Not tainted 4.18.0-041800rc6-generic #201807221830
[ 5317.867548] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5317.867656] kworker/u257:3  D    0 54387      2 0x00000808
[ 5317.867675] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 5317.867677] Call Trace:
[ 5317.867680] [c000000fe3447460] [0000002a00000000] 0x2a00000000 (unreliable)
[ 5317.867688] [c000000fe3447630] [c00000000001c430] __switch_to+0x260/0x4c0
[ 5317.867694] [c000000fe3447690] [c000000000d67b44] __schedule+0x304/0xad0
[ 5317.867697] [c000000fe3447760] [c000000000d68358] schedule+0x48/0xc0
[ 5317.867701] [c000000fe3447780] [c000000000d6d1b8] schedule_timeout+0x348/0x510
[ 5317.867707] [c000000fe3447880] [c000000000928b60] dma_fence_default_wait+0x2b0/0x350
[ 5317.867710] [c000000fe34478f0] [c00000000092780c] dma_fence_wait_timeout+0x6c/0x1b0
[ 5317.867714] [c000000fe3447930] [c00000000092aeb0] reservation_object_wait_timeout_rcu+0x320/0x3d0
[ 5317.867774] [c000000fe34479b0] [c00800000d5fc220] amdgpu_dm_do_flip+0x138/0x3b0 [amdgpu]
[ 5317.867831] [c000000fe3447b00] [c00800000d6001a0] amdgpu_dm_atomic_commit_tail+0x7f8/0xf20 [amdgpu]
[ 5317.867840] [c000000fe3447c60] [c00800000cb72da4] commit_tail+0x6c/0xe0 [drm_kms_helper]
[ 5317.867846] [c000000fe3447c90] [c000000000138720] process_one_work+0x2b0/0x560
[ 5317.867850] [c000000fe3447d20] [c000000000138a58] worker_thread+0x88/0x610
[ 5317.867854] [c000000fe3447dc0] [c0000000001416fc] kthread+0x1ac/0x1c0
[ 5317.867859] [c000000fe3447e30] [c00000000000b65c] ret_from_kernel_thread+0x5c/0x80
[ 5438.711397] INFO: task kworker/u257:3:54387 blocked for more than 120 seconds.
[ 5438.711473]       Not tainted 4.18.0-041800rc6-generic #201807221830
[ 5438.711552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

If I kill the wayland session:

[ 7012.419912] EEH: Frozen PHB#33-PE#0 detected
[ 7012.419919] EEH: PE location: CPU2 Slot1 (16x), PHB location: N/A
[ 7012.419923] CPU: 74 PID: 126541 Comm: pulseaudio Not tainted 4.18.0-041800rc6-generic #201807221830
[ 7012.419924] Call Trace:
[ 7012.419932] [c000200b36333300] [c000000000d4ce3c] dump_stack+0xb0/0xf4 (unreliable)
[ 7012.419936] [c000200b36333340] [c00000000003b0ac] eeh_dev_check_failure+0x4ac/0x5e0
[ 7012.419938] [c000200b363333e0] [c00000000003b26c] eeh_check_failure+0x8c/0xd0
[ 7012.419945] [c000200b36333420] [c008000016342ae8] pci_azx_readw+0x80/0xb0 [snd_hda_intel]
[ 7012.419950] [c000200b36333450] [c0080000161c5790] snd_hdac_bus_send_cmd+0x78/0x210 [snd_hda_core]
[ 7012.419956] [c000200b363334a0] [c0080000162a20ec] azx_send_cmd+0x34/0x390 [snd_hda_codec]
[ 7012.419959] [c000200b36333530] [c0080000161c0274] snd_hdac_bus_exec_verb_unlocked+0x7c/0x280 [snd_hda_core]
[ 7012.419964] [c000200b36333590] [c00800001629240c] codec_exec_verb+0xb4/0x1f0 [snd_hda_codec]
[ 7012.419967] [c000200b36333630] [c0080000161c1a10] snd_hdac_exec_verb+0x38/0x90 [snd_hda_core]
[ 7012.419971] [c000200b36333650] [c0080000161c4158] hda_reg_write+0x120/0x3b0 [snd_hda_core]
[ 7012.419974] [c000200b363336c0] [c0000000008c87e8] _regmap_write+0x98/0x190
[ 7012.419977] [c000200b36333710] [c0000000008ca5b4] regmap_write+0x74/0xc0
[ 7012.419981] [c000200b36333750] [c0080000161c47e4] snd_hdac_regmap_write_raw+0x4c/0x130 [snd_hda_core]
[ 7012.419985] [c000200b36333790] [c008000016485d80] hdmi_pcm_open+0x168/0x4a0 [snd_hda_codec_hdmi]
[ 7012.419989] [c000200b36333820] [c0080000162a12e8] azx_pcm_open+0x1b0/0x3d0 [snd_hda_codec]
[ 7012.419995] [c000200b36333890] [c0080000160ab3dc] snd_pcm_open_substream+0xb4/0x1a0 [snd_pcm]
[ 7012.419998] [c000200b36333920] [c0080000160ab5d4] snd_pcm_open+0x10c/0x2e0 [snd_pcm]
[ 7012.420002] [c000200b363339b0] [c0080000160ab8c4] snd_pcm_playback_open+0x6c/0xa8 [snd_pcm]
[ 7012.420008] [c000200b363339f0] [c00800000f9c0750] snd_open+0x108/0x240 [snd]
[ 7012.420011] [c000200b36333a90] [c000000000401ee8] chrdev_open+0x128/0x270
[ 7012.420015] [c000200b36333af0] [c0000000003f4f10] do_dentry_open+0x1e0/0x450
[ 7012.420017] [c000200b36333b50] [c0000000004123e8] do_last+0x318/0xa40
[ 7012.420018] [c000200b36333c00] [c000000000412c04] path_openat+0xf4/0x3f0
[ 7012.420020] [c000200b36333c80] [c0000000004147b0] do_filp_open+0x80/0x100
[ 7012.420022] [c000200b36333db0] [c0000000003f7268] do_sys_open+0x228/0x2f0
[ 7012.420025] [c000200b36333e30] [c00000000000b288] system_call+0x5c/0x70
[ 7012.420055] EEH: Detected PCI bus error on PHB#33-PE#0
[ 7012.420059] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[ 7012.420063] EEH: Notify device drivers to shutdown
[ 7012.420072] EEH: Beginning: 'error_detected(IO frozen)'
[ 7012.420102] EEH: PE#0 (PCI 0033:01:00.1): driver not EEH aware
[ 7012.420104] EEH: PE#0 (PCI 0033:01:00.0): driver not EEH aware
[ 7012.420106] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'none'
[ 7012.420116] EEH: Collect temporary log
[ 7012.420163] EEH: of node=0033:01:00.1
[ 7012.420166] EEH: PCI device/vendor: ffffffff
[ 7012.420168] EEH: PCI cmd/status register: ffffffff
[ 7012.420170] EEH: PCI-E capabilities and status follow:
[ 7012.420179] EEH: PCI-E 00: ffffffff ffffffff ffffffff ffffffff 
[ 7012.420187] EEH: PCI-E 10: ffffffff ffffffff ffffffff ffffffff 
[ 7012.420188] EEH: PCI-E 20: ffffffff 
[ 7012.420189] EEH: PCI-E AER capability register set follows:
[ 7012.420197] EEH: PCI-E AER 00: ffffffff ffffffff ffffffff ffffffff 
[ 7012.420204] EEH: PCI-E AER 10: ffffffff ffffffff ffffffff ffffffff 
[ 7012.420211] EEH: PCI-E AER 20: ffffffff ffffffff ffffffff ffffffff 
[ 7012.420214] EEH: PCI-E AER 30: ffffffff ffffffff 
[ 7012.420216] EEH: of node=0033:01:00.0
[ 7012.420218] EEH: PCI device/vendor: ffffffff
[ 7012.420220] EEH: PCI cmd/status register: ffffffff
[ 7012.420221] EEH: PCI-E capabilities and status follow:
[ 7012.420229] EEH: PCI-E 00: ffffffff ffffffff ffffffff ffffffff 
[ 7012.420236] EEH: PCI-E 10: ffffffff ffffffff ffffffff ffffffff 
[ 7012.420237] EEH: PCI-E 20: ffffffff 
[ 7012.420238] EEH: PCI-E AER capability register set follows:
[ 7012.420246] EEH: PCI-E AER 00: ffffffff ffffffff ffffffff ffffffff 
[ 7012.420253] EEH: PCI-E AER 10: ffffffff ffffffff ffffffff ffffffff 
[ 7012.420261] EEH: PCI-E AER 20: ffffffff ffffffff ffffffff ffffffff 
[ 7012.420267] EEH: PCI-E AER 30: ffffffff ffffffff 
[ 7012.420270] PHB4 PHB#51 Diag-data (Version: 1)
[ 7012.420271] brdgCtl:    00000002
[ 7012.420273] RootSts:    00060020 00402000 c1010008 00100107 00000000
[ 7012.420274] RootErrSts: 00000000 00000020 00000000
[ 7012.420276] PhbSts:     0000001c00000000 0000001c00000000
[ 7012.420277] Lem:        0000000100000000 0000000000000000 0000000100000000
[ 7012.420278] PhbErr:     000000c000000000 0000008000000000 2148000098000240 a008400000000000
[ 7012.420280] RegbErr:    0090000000000000 0010000000000000 4800003c00000000 0000000000000200
[ 7012.420282] PE[000] A/B: 8000000000000000 8000000000000000
[ 7012.420285] PE[..1fe] A/B: as above
[ 7012.420286] PE[1ff] A/B: b740002a01000000 8000000000000000
[ 7012.420287] EEH: Reset with hotplug activity
[ 7012.817635] iommu: Removing device 0033:01:00.1 from group 3
[ 7012.817682] pci 0033:01:00.1: Dropping the link to 0033:01:00.0
[ 7012.818009] [drm] amdgpu: finishing device.

Revision history for this message

Joel Stanley (shenki) wrote on 2018-07-30:

Download full text (3.8 KiB)

With -rc7:

[ 333.596521] EEH: PHB#33 failure detected, location: N/A
[ 333.596563] CPU: 12 PID: 811 Comm: kworker/u257:1 Not tainted 4.18.0-041800rc7-generic #201807292230
[ 333.596576] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 333.596578] Call Trace:
[ 333.596582] [c000000fec1036c0] [c000000000d4d6fc] dump_stack+0xb0/0xf4 (unreliable)
[ 333.596587] [c000000fec103700] [c00000000003b114] eeh_dev_check_failure+0x514/0x5e0
[ 333.596589] [c000000fec1037a0] [c00000000003b26c] eeh_check_failure+0x8c/0xd0
[ 333.596616] [c000000fec1037e0] [c00800000d5119f8] amdgpu_mm_rreg+0x240/0x2a0 [amdgpu]
[ 333.596649] [c000000fec103840] [c00800000d623250] amdgpu_cgs_read_register+0x28/0x50 [amdgpu]
[ 333.596685] [c000000fec103860] [c00800000d6ce81c] dce110_timing_generator_get_vblank_counter+0x44/0x70 [amdgpu]
[ 333.596717] [c000000fec103880] [c00800000d6f4430] dc_stream_get_vblank_counter+0x88/0xb0 [amdgpu]
[ 333.596752] [c000000fec1038a0] [c00800000d67f5f4] dm_vblank_get_counter+0x4c/0xa8 [amdgpu]
[ 333.596774] [c000000fec103900] [c00800000d518630] amdgpu_get_vblank_counter_kms+0xa8/0x250 [amdgpu]
[ 333.596808] [c000000fec1039b0] [c00800000d67c1b8] amdgpu_dm_do_flip+0xd0/0x3b0 [amdgpu]
[ 333.596844] [c000000fec103b00] [c00800000d6801a0] amdgpu_dm_atomic_commit_tail+0x7f8/0xf20 [amdgpu]
[ 333.596850] [c000000fec103c60] [c00800000cb72da4] commit_tail+0x6c/0xe0 [drm_kms_helper]
[ 333.596853] [c000000fec103c90] [c0000000001385f0] process_one_work+0x2b0/0x560
[ 333.596855] [c000000fec103d20] [c000000000138928] worker_thread+0x88/0x610
[ 333.596858] [c000000fec103dc0] [c0000000001415dc] kthread+0x1ac/0x1c0
[ 333.596861] [c000000fec103e30] [c00000000000b65c] ret_from_kernel_thread+0x5c/0x80
[ 333.596886] EEH: Detected error on PHB#33
[ 333.596890] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[ 333.596891] EEH: Notify device drivers to shutdown
[ 333.596895] EEH: Beginning: 'error_detected(IO frozen)'
[ 333.596898] EEH: PE#1fe (PCI 0033:00:00.0): no driver
[ 333.596900] EEH: PE#0 (PCI 0033:01:00.1): driver not EEH aware
[ 333.596902] EEH: PE#0 (PCI 0033:01:00.0): driver not EEH aware
[ 333.596904] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'none'
[ 333.596908] EEH: Collect temporary log
[ 333.596910] PHB4 PHB#51 Diag-data (Version: 1)
[ 333.596911] brdgCtl: 00000002
[ 333.596913] RootSts: 00000020 00402000 e9010008 00100107 00000000
[ 333.596915] nFir: 0000800000000000 0030001c00000000 0000800000000000
[ 333.596916] PhbSts: 0000001800000000 0000001800000000
[ 333.596918] Lem: 0004000100000100 0000000000000000 0000000100000000
[ 333.596919] PhbErr: 000005a000000000 0000008000000000 2148000098000240 a008400000000000
[ 333.596921] PhbTxeErr: 0000200000000000 0000200000000000 4000000000000000 0000000000000000
[ 333.596923] RxeMrgErr: 0000000000000001 0000000000000001 0000000000000000 0000000000000000
[ 333.596925] PblErr: 0000000000000800 0000000000000800 0000000000000000 00000000028de410
[ 333.596926] RegbErr: 0010001000000000 0010000000000000 4800003c00000000 0000000000000200
[ 333.596929...

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

tags:	added: kernel-bug-exists-upstream
Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Ubuntulinux package

[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
linux package

[drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout