Some shaders can crash the amdgpu kernel module, tainting

Bug #1950205 reported by Jack Miller
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Hi there,

I am experiencing, at random, the AMD-GPU "fences timed out!" bug reported previously as being either a kernel issue (https://bugzilla.kernel.org/show_bug.cgi?id=213145) or a mesa (https://gitlab.freedesktop.org/mesa/mesa/-/issues/4866) issue. The later was apparently fixed in 21.2.4 -- I have experienced it both on the stock mesa package versions current in play in 21.10 and the "bleeding edge" git-based builds. It is difficult to reproduce deterministically. For me, at present, roleplaying game Pathfinder: Wrath of the Righteous running under proton with Lutris (0.5.8.3/wine version lutris-fshack-6.14-3-x86_64 with DXVK v1.8.1L) triggers far more infrequently on the lowest graphics settings, but occasionally absolutely deterministically at certain points (I have a save game that can do this in the highly unlikely event that anyone wants to take me up on the issue and has the GOG game -- email me at <email address hidden>).

This manifests as extreme screen corruption requiring a restart of at a minimum lightdm to fix, and more likely the whole computer. Memory issues can occur either on the card or on the host, namely:

[ 1208.225974] [drm:amdgpu_dm_commit_planes [amdgpu]] *ERROR* Waiting for fences timed out!
[ 1213.099991] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=112591, emitted seq=112593
[ 1213.100119] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Wrath.exe pid 18793 thread Wrath.exe pid 18984
[ 1213.100224] amdgpu 0000:08:00.0: amdgpu: GPU reset begin!
[ 1213.349988] [drm:amdgpu_dm_commit_planes [amdgpu]] *ERROR* Waiting for fences timed out!
[ 1213.350131] [drm:dce110_vblank_set [amdgpu]] *ERROR* Failed to get VBLANK!
[ 1213.509424] amdgpu: cp is busy, skip halt cp
[ 1213.777795] amdgpu: rlc is busy, skip halt rlc
[ 1213.778816] amdgpu 0000:08:00.0: amdgpu: BACO reset
[ 1214.425001] amdgpu 0000:08:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 1214.427319] [drm] PCIE GART of 256M enabled (table at 0x000000F4007E9000).
[ 1214.427335] [drm] VRAM is lost due to GPU reset!
[ 1214.600175] ------------[ cut here ]------------
[ 1214.600178] amdgpu 0000:08:00.0: drm_WARN_ON(atomic_read(&vblank->refcount) == 0)
[ 1214.600244] WARNING: CPU: 7 PID: 0 at drivers/gpu/drm/drm_vblank.c:1210 drm_vblank_put+0xef/0x100 [drm]
[ 1214.600274] Modules linked in: xfrm_user xfrm_algo xt_addrtype br_netfilter vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) rfcomm ashmem_linux(C) binder_linux iptable_mangle xt_comment iptable_nat iptable_filter bpfilter xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nf_conntrack_netlink xt_conntrack nft_chain_nat nf_nat xt_NFQUEUE xt_tcpudp nf_conntrack nft_compat nf_defrag_ipv6 nf_defrag_ipv4 tcp_diag inet_diag nft_counter nfnetlink_queue nf_tables nfnetlink bridge stp llc cmac algif_hash algif_skcipher af_alg bnep overlay binfmt_misc nls_iso8859_1 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) snd_hda_codec_realtek intel_rapl_msr znvpair(PO) snd_hda_codec_generic intel_rapl_common spl(O) ledtrig_audio snd_hda_codec_hdmi edac_mce_amd snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec kvm_amd snd_hda_core btusb ath9k snd_hwdep btrtl kvm ath9k_common snd_pcm btbcm btintel ath9k_hw snd_seq_midi rapl bluetooth snd_seq_midi_event ath snd_rawmidi
[ 1214.600304] ecdh_generic joydev mac80211 snd_seq ecc input_leds snd_seq_device apple_mfi_fastcharge snd_timer snd eeepc_wmi soundcore ccp efi_pstore wmi_bmof cfg80211 k10temp libarc4 mac_hid sch_fq_codel nct6775 hwmon_vid msr nfsd parport_pc ppdev auth_rpcgss lp nfs_acl parport lockd grace sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress dm_crypt raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_logitech_dj hid_apple hid_logitech_hidpp hid_generic usbhid hid mfd_aaeon asus_wmi sparse_keymap amdgpu video iommu_v2 gpu_sched i2c_algo_bit drm_ttm_helper ttm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops cec crypto_simd rc_core cryptd drm r8169 ahci i2c_piix4 xhci_pci libahci xhci_pci_renesas realtek wmi
[ 1214.600336] CPU: 7 PID: 0 Comm: swapper/7 Tainted: P C OE 5.13.0-21-generic #21-Ubuntu
[ 1214.600337] Hardware name: System manufacturer System Product Name/TUF GAMING X570-PLUS, BIOS 1201 09/09/2019
[ 1214.600339] RIP: 0010:drm_vblank_put+0xef/0x100 [drm]
[ 1214.600353] Code: 8b 7f 08 4c 8b 67 50 4d 85 e4 74 22 e8 ea 55 ef d7 48 c7 c1 d8 22 4e c0 4c 89 e2 48 c7 c7 75 d5 4d c0 48 89 c6 e8 c0 75 31 d8 <0f> 0b eb 8a 4c 8b 27 eb d9 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[ 1214.600354] RSP: 0018:ffffa654803a8d78 EFLAGS: 00010086
[ 1214.600356] RAX: 0000000000000000 RBX: ffff94289d0a0000 RCX: 0000000000000027
[ 1214.600357] RDX: ffff942f7ebd89c8 RSI: 0000000000000001 RDI: ffff942f7ebd89c0
[ 1214.600358] RBP: ffffa654803a8d80 R08: 0000000000000000 R09: ffffa654803a8b68
[ 1214.600359] R10: ffffa654803a8b60 R11: ffffffff99b55268 R12: ffff9428818a9ef0
[ 1214.600360] R13: 0000000000000086 R14: ffff94289d0a0178 R15: ffff9428e4aa8300
[ 1214.600361] FS: 0000000000000000(0000) GS:ffff942f7ebc0000(0000) knlGS:0000000000000000
[ 1214.600362] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1214.600363] CR2: 00005604fb2cbd00 CR3: 0000000201ea4000 CR4: 0000000000350ee0
[ 1214.600364] Call Trace:
[ 1214.600365] <IRQ>
[ 1214.600368] drm_crtc_vblank_put+0x17/0x20 [drm]
[ 1214.600382] dm_pflip_high_irq+0xeb/0x2a0 [amdgpu]
[ 1214.600563] amdgpu_dm_irq_handler+0x8f/0x1d0 [amdgpu]
[ 1214.600691] amdgpu_irq_dispatch+0xbb/0x1d0 [amdgpu]
[ 1214.600788] ? try_to_wake_up+0x1f9/0x530
[ 1214.600792] amdgpu_ih_process+0x81/0x100 [amdgpu]
[ 1214.600886] amdgpu_irq_handler+0x26/0xa0 [amdgpu]
[ 1214.600978] __handle_irq_event_percpu+0x45/0x170
[ 1214.600982] handle_irq_event+0x59/0xc0
[ 1214.600983] handle_edge_irq+0x8c/0x220
[ 1214.600985] __common_interrupt+0x43/0xa0
[ 1214.600987] common_interrupt+0x88/0xa0
[ 1214.600991] </IRQ>
[ 1214.600991] asm_common_interrupt+0x1e/0x40
[ 1214.600993] RIP: 0010:cpuidle_enter_state+0xcc/0x360
[ 1214.600996] Code: 3d 71 76 a8 67 e8 64 d2 75 ff 49 89 c6 0f 1f 44 00 00 31 ff e8 05 de 75 ff 80 7d d7 00 0f 85 04 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 ff 0f 88 10 01 00 00 49 63 d7 4c 89 f1 48 2b 4d c8 48 8d 04
[ 1214.600997] RSP: 0018:ffffa65480177e50 EFLAGS: 00000246
[ 1214.600998] RAX: ffff942f7ebece80 RBX: 0000000000000002 RCX: 000000000000001f
[ 1214.600999] RDX: 0000000000000000 RSI: 000000001f9ade42 RDI: 0000000000000000
[ 1214.601000] RBP: ffffa65480177e88 R08: 0000011acbcee9b8 R09: 000000000001f100
[ 1214.601000] R10: 0000000000000001 R11: ffff942f7ebeb884 R12: ffff942885663000
[ 1214.601001] R13: ffffffff99c5dcc0 R14: 0000011acbcee9b8 R15: 0000000000000002
[ 1214.601003] cpuidle_enter+0x2e/0x40
[ 1214.601005] cpuidle_idle_call+0x132/0x1d0
[ 1214.601007] do_idle+0x83/0xf0
[ 1214.601008] cpu_startup_entry+0x20/0x30
[ 1214.601009] start_secondary+0x11f/0x160
[ 1214.601011] secondary_startup_64_no_verify+0xc2/0xcb
[ 1214.601014] ---[ end trace 8462b4e5fbcd0db4 ]---
[ 1214.627337] [drm] UVD and UVD ENC initialized successfully.
[ 1214.727404] [drm] VCE initialized successfully.
[ 1214.734437] amdgpu 0000:08:00.0: amdgpu: recover vram bo from shadow start
[ 1214.745196] amdgpu 0000:08:00.0: amdgpu: recover vram bo from shadow done
[ 1214.745200] [drm] Skip scheduling IBs!
[ 1214.745201] [drm] Skip scheduling IBs!
[ 1214.745235] amdgpu 0000:08:00.0: amdgpu: GPU reset(2) succeeded!
[ 1214.745249] [drm] Skip scheduling IBs!
<snip of many lines of similar content>
[ 1214.745295] [drm] Skip scheduling IBs!
[ 1214.745417] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 1214.747847] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 1214.749613] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 1214.781172] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

<--- post a restart--->

[ 350.292849] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 350.302003] amdgpu 0000:08:00.0: amdgpu: GPU fault detected: 146 0x06e0480c for process plasmashell pid 6872 thread plasmashel:cs0 pid 6920
[ 350.302011] amdgpu 0000:08:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001062DC
[ 350.302013] amdgpu 0000:08:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A04800C
[ 350.302014] amdgpu 0000:08:00.0: amdgpu: VM fault (0x0c, vmid 5, pasid 32772) at page 1073884, read from 'TC4' (0x54433400) (72)
[ 360.390588] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 370.634583] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 380.870588] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 391.110593] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 401.350601] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 403.698235] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 403.698716] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 403.699095] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 403.699410] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 403.699703] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 403.700029] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 403.700406] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 403.700694] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 403.701001] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 403.701370] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 411.590611] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 421.830634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

For the record, the following are the (as current, git-built) versions of mesa in question:

mesa-va-drivers/impish,now 22.0~git2111060600.c4d904~oibaf~i amd64 [installed,automatic]
  Mesa VA-API video acceleration drivers

mesa-vdpau-drivers/impish,now 22.0~git2111060600.c4d904~oibaf~i amd64 [installed,automatic]
  Mesa VDPAU video acceleration drivers

mesa-vulkan-drivers/impish,now 22.0~git2111060600.c4d904~oibaf~i amd64 [installed,automatic]
  Mesa Vulkan graphics drivers

This does not appear to have actually altered anything but potentially rules out their `if (count <0 ...` to `if (count <= 0 ...` patch fixing it.

I'm posting this here because (a) I suspect others will see this issue, and (b) I don't know if I should post it to upstream directly -- I also don't know if upstream is kernel.org or mesa (who takes ownership for amdgpu?)

ProblemType: Bug
DistroRelease: Ubuntu 21.10
Package: linux-image-generic 5.13.0.21.32
ProcVersionSignature: Ubuntu 5.13.0-21.21-generic 5.13.18
Uname: Linux 5.13.0-21-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
ApportVersion: 2.20.11-0ubuntu71
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: jackmiller 6414 F.... pulseaudio
 /dev/snd/pcmC1D0p: jackmiller 6414 F...m pulseaudio
 /dev/snd/controlC0: jackmiller 6414 F.... pulseaudio
CasperMD5CheckResult: unknown
CurrentDesktop: KDE
Date: Mon Nov 8 21:35:48 2021
InstallationDate: Installed on 2020-06-26 (500 days ago)
InstallationMedia: Ubuntu 20.04 LTS "Focal Fossa" - Release amd64 (20200423)
MachineType: System manufacturer System Product Name
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.13.0-21-generic root=/dev/mapper/vgubuntu-root ro acpi_enforce_resources=lax splash amdgpu.vm_fragment_size=9 crashkernel=512M-:192M vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-5.13.0-21-generic N/A
 linux-backports-modules-5.13.0-21-generic N/A
 linux-firmware 1.201.1
SourcePackage: linux
StagingDrivers: ashmem_linux
UpgradeStatus: Upgraded to impish on 2021-10-24 (14 days ago)
dmi.bios.date: 09/09/2019
dmi.bios.release: 5.14
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 1201
dmi.board.asset.tag: Default string
dmi.board.name: TUF GAMING X570-PLUS
dmi.board.vendor: ASUSTeK COMPUTER INC.
dmi.board.version: Rev X.0x
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr1201:bd09/09/2019:br5.14:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnTUFGAMINGX570-PLUS:rvrRevX.0x:cvnDefaultstring:ct3:cvrDefaultstring:skuSKU:
dmi.product.family: To be filled by O.E.M.
dmi.product.name: System Product Name
dmi.product.sku: SKU
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Revision history for this message
Jack Miller (landak) wrote :
description: updated
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Jack Miller (landak) wrote :

Edit -- it transpires that, despite what I said, the magic <0 to <=0 change hasn't been made on the main branch of mesa's git repo yet. I've successfully compiled a local copy (with meson & ninja) and am running it in a separate root to avoid clobbering the system libraries using LD_LIBRARY_PATH and friends which can get glxgears working at least. Frankly I am in a little over my depth. I'd like to really be able to build the packages from source, apply a two character / one line diff as suggested upstream to see if it actually works, and then test further.

Any advice very welcome ;-).

Revision history for this message
Jack Miller (landak) wrote :

Edit: Please close this bug -- it is a duplicate of an upstream issue in dxvk (https://gitlab.freedesktop.org/mesa/mesa/-/issues/4866) and not the kernel per se (although that makes life worse by clearing out vram and hard-resetting the gpu once the crash occurs, which X and DMs have no way to recover from).

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Ubuntu 21.10 is no longer supported but the same issue is being tracked in bug 1980831.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.