amdgpu hangs from time to time with *ERROR* Waiting for fences timed out!

Bug #1883493 reported by Florian Hars
58
This bug affects 11 people
Affects Status Importance Assigned to Milestone
Linux
Unknown
Unknown
linux (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

un 15 08:30:42 alhazen kernel: [ 1566.155810] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
Jun 15 08:30:47 alhazen kernel: [ 1566.159792] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
Jun 15 08:30:47 alhazen kernel: [ 1571.020144] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=535493, emitted seq=535495
Jun 15 08:30:47 alhazen kernel: [ 1571.020216] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 3664 thread Xorg:cs0 pid 3694
Jun 15 08:30:47 alhazen kernel: [ 1571.020218] [drm] GPU recovery disabled.

Mouse pointer still moves, but apart from that the display is frozen. Music keeps playing.

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-image-5.4.0-37-generic 5.4.0-37.41
ProcVersionSignature: Ubuntu 5.4.0-37.41-generic 5.4.41
Uname: Linux 5.4.0-37-generic x86_64
ApportVersion: 2.20.11-0ubuntu27.2
Architecture: amd64
CasperMD5CheckResult: skip
CurrentDesktop: ubuntu:GNOME
Date: Mon Jun 15 09:09:56 2020
InstallationDate: Installed on 2020-05-28 (17 days ago)
InstallationMedia: Ubuntu 20.04 LTS "Focal Fossa" - Release amd64 (20200423)
IwConfig:
 enp4s0 no wireless extensions.

 lo no wireless extensions.
MachineType: System manufacturer System Product Name
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.4.0-37-generic root=/dev/mapper/vgubuntu-root ro quiet splash acpi-enforce-resources=lax vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-5.4.0-37-generic N/A
 linux-backports-modules-5.4.0-37-generic N/A
 linux-firmware 1.187
RfKill:

SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 07/02/2019
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 0604
dmi.board.asset.tag: Default string
dmi.board.name: PRIME X570-PRO
dmi.board.vendor: ASUSTeK COMPUTER INC.
dmi.board.version: Rev X.0x
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr0604:bd07/02/2019:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnPRIMEX570-PRO:rvrRevX.0x:cvnDefaultstring:ct3:cvrDefaultstring:
dmi.product.family: To be filled by O.E.M.
dmi.product.name: System Product Name
dmi.product.sku: SKU
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Revision history for this message
Florian Hars (hars) wrote :
Revision history for this message
Balint Harmath (bharmath) wrote :

Could you please describe a bit more of the circumstances?
Please provide what program(s) are you running when this happens. Is there anything else significant to the graphical usage of the computer running above the causing program - eg. picture editor?

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Florian Hars (hars) wrote :

This may or may not be the same issue as https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1875459 but this is not reliably reproducible, and I don't even have chrome installed.

Revision history for this message
Florian Hars (hars) wrote :

What happens is that I am doing something and suddenly nothing moves except the mouse pointer and video calls become audio calls which I can only leave by switching the computer off. I tend to have obs-studio 25.0.3+dfsg1-2 running. for the last two days I had it running with AMD_DEBUG=nongg and it didn't freeze in that time, which is moderate evidence (from the behaviour in the preceding days I'd have expected more than 0.5 freezes in that time) for the suspicion that it may indeed be related to the other issue and the upstream bug linked from there.

Revision history for this message
Florian Hars (hars) wrote :

After running ereything with AMD_DEBUG=nongg, the "Waiting for fences" seems to be mostly gone, and I now get sdma0 timeouts. So this seems to be part of the gereral cluster of failures that seem to plague the linux navi drivers since the beginning.

Jun 22 07:24:43 alhazen kernel: [ 748.740480] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=134116, emitted seq=134118
Jun 22 07:24:43 alhazen kernel: [ 748.740549] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 3589 thread Xorg:cs0 pid 3591
Jun 22 07:24:43 alhazen kernel: [ 748.740552] [drm] GPU recovery disabled.
Jun 22 07:25:28 alhazen kernel: [ 794.386634] GpuWatchdog[5797]: segfault at 0 ip 0000556cdabbccb9 sp 00007f6a540a06c0 error 6 in chrome[556cd6a4e000+7095000]
Jun 22 07:25:28 alhazen kernel: [ 794.386642] Code: 00 79 09 48 8b 7d c0 e8 d5 14 2b fc c7 45 c0 aa aa aa aa 0f ae f0 41 8b 84 24 e0 00 00 00 89 45 c0 48 8d 7d c0 e8 b7 31 e9 fb <c7> 04 25 00 00 00 00 37 13 00 00 48 83 c4 38 5b 41 5c 41 5d 41 5e
Jun 22 07:25:38 alhazen kernel: [ 804.548718] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=29322, emitted seq=29326
Jun 22 07:25:38 alhazen kernel: [ 804.548788] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Jun 22 07:25:38 alhazen kernel: [ 804.548791] [drm] GPU recovery disabled.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please test latest drm-tip kernel:
https://kernel.ubuntu.com/~kernel-ppa/mainline/drm-tip/current/

If the issue persists please file an upstream bug at:
https://gitlab.freedesktop.org/drm/amd/issues

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Liz Fong-Jones (lizthegrey) wrote :

This issue also seems to be happening for Renoir with 5.4.0 kernel series, when DXVK/vulkan is used. I will try drm-tip and report back.

Revision history for this message
Liz Fong-Jones (lizthegrey) wrote :
Download full text (4.9 KiB)

drm-tip is unusable (application is slow/doesn't even launch correctly)

5.9.0-rc2:

Aug 28 19:39:30 foxglove kernel: [ 77.392548] [drm:amdgpu_dm_commit_planes [amdgpu]] *ERROR* Waiting for fences timed out!
ug 28 19:39:30 foxglove kernel: [ 82.181755] gmc_v9_0_process_interrupt: 2992 callbacks suppressed
Aug 28 19:39:30 foxglove kernel: [ 82.181760] amdgpu 0000:0c:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:6 pasid:32771, for process WoW.exe pid 4330 thread WoW.exe pid 4330)
Aug 28 19:39:30 foxglove kernel: [ 82.181763] amdgpu 0000:0c:00.0: amdgpu: in page starting at address 0x0000800080000000 from client 27
Aug 28 19:39:30 foxglove kernel: [ 82.181765] amdgpu 0000:0c:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00601431
Aug 28 19:39:30 foxglove kernel: [ 82.181766] amdgpu 0000:0c:00.0: amdgpu: Faulty UTCL2 client ID: 0xa
Aug 28 19:39:30 foxglove kernel: [ 82.181767] amdgpu 0000:0c:00.0: amdgpu: MORE_FAULTS: 0x1
Aug 28 19:39:30 foxglove kernel: [ 82.181767] amdgpu 0000:0c:00.0: amdgpu: WALKER_ERROR: 0x0
Aug 28 19:39:30 foxglove kernel: [ 82.181768] amdgpu 0000:0c:00.0: amdgpu: PERMISSION_FAULTS: 0x3
Aug 28 19:39:30 foxglove kernel: [ 82.181769] amdgpu 0000:0c:00.0: amdgpu: MAPPING_ERROR: 0x0
Aug 28 19:39:30 foxglove kernel: [ 82.181770] amdgpu 0000:0c:00.0: amdgpu: RW: 0x0
Aug 28 19:39:30 foxglove kernel: [ 82.183442] amdgpu 0000:0c:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:6 pasid:32771, for process WoW.exe pid 4330 thread WoW.exe pid 4330)
Aug 28 19:39:30 foxglove kernel: [ 82.183444] amdgpu 0000:0c:00.0: amdgpu: in page starting at address 0x0000800080000000 from client 27
Aug 28 19:39:30 foxglove kernel: [ 82.183446] amdgpu 0000:0c:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00601431
Aug 28 19:39:30 foxglove kernel: [ 82.183446] amdgpu 0000:0c:00.0: amdgpu: Faulty UTCL2 client ID: 0xa
Aug 28 19:39:30 foxglove kernel: [ 82.183447] amdgpu 0000:0c:00.0: amdgpu: MORE_FAULTS: 0x1
Aug 28 19:39:30 foxglove kernel: [ 82.183448] amdgpu 0000:0c:00.0: amdgpu: WALKER_ERROR: 0x0
Aug 28 19:39:30 foxglove kernel: [ 82.183449] amdgpu 0000:0c:00.0: amdgpu: PERMISSION_FAULTS: 0x3
Aug 28 19:39:30 foxglove kernel: [ 82.183450] amdgpu 0000:0c:00.0: amdgpu: MAPPING_ERROR: 0x0
Aug 28 19:39:30 foxglove kernel: [ 82.183450] amdgpu 0000:0c:00.0: amdgpu:: RW: 0x0
Aug 28 19:39:30 foxglove kernel: [ 82.185128] amdgpu 0000:0c:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:6 pasid:32771, for process WoW.exe pid 4330 thread WoW.exe pid 4330)
Aug 28 19:39:30 foxglove kernel: [ 82.185131] amdgpu 0000:0c:00.0: amdgpu: in page starting at address 0x0000800080000000 from client 27
Aug 28 19:39:30 foxglove kernel: [ 82.185132] amdgpu 0000:0c:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00601431
Aug 28 19:39:30 foxglove kernel: [ 82.185133] amdgpu 0000:0c:00.0: amdgpu: Faulty UTCL2 client ID: 0xa
Aug 28 19:39:30 foxglove kernel: [ 82.185134] amdgpu 0000:0c:00.0: amdgpu: MORE_FAULTS: 0x1
Aug 28 19:39:30 foxglove kernel: [ 82.185135] amdgpu 0000:0c:00.0: amdgpu: WALKER_ERROR: 0x0
Aug 28 19:39:30 foxglo...

Read more...

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Liz,

Please file an upstream bug at
https://gitlab.freedesktop.org/drm/amd/issues

Revision history for this message
Liz Fong-Jones (lizthegrey) wrote :
Revision history for this message
Peter Silva (peter-bsqt) wrote :

for people using stable 20.04:

I used apt-cache to notice kernel 5.8 was in the repos (no special ones in use.)
I don't get why it is there and not used, but decided to try it.

I did:

sudo apt install linux-image-5.8.0-33-generic

then rebooted, and had no networking... then did:

sudo apt install linux-modules-extra-5.8.0-33-generic

and everything is working again, and I can use google maps without fear.
not sure If I will get kernel updates though...

Revision history for this message
fabtagon (fabtagon) wrote :

#7 (hopefully) solved it for me.

Latest Ubuntu-20.4.2 update supplied me with kernel 5.8.0-48. There the problem existed.

Going after #7's suggestion, I installed 5.12.0-051200rc6drmtip20210410-generic and now the problem did not occour for 20+ hours of operating "dangerous" applications which previously were sufficient to trigger the problem (LibreOffice, generic Java (office) applications, wine with simple games).

Thank you, @kaihengfeng!

Revision history for this message
fabtagon (fabtagon) wrote :

Update: issue seems to have reappeared even with 5.12.0-051200rc6drmtip20210410-generic, unfortunately. I got a white blank screen while using LibreOffice and Firefox. System was no longer reachable via network and nothing got written to syslog, so probably the kernel has died hard.

Revision history for this message
Liz Fong-Jones (lizthegrey) wrote :

Changing status belatedly to invalid. Upstream the issue was diagnosed as insufficient voltage supplied to the SoC, and was worked around by increasing SoC voltage in BIOS.

Changed in linux (Ubuntu):
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.