Some shaders can crash the amdgpu kernel module, tainting
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
Hi there,
I am experiencing, at random, the AMD-GPU "fences timed out!" bug reported previously as being either a kernel issue (https:/
This manifests as extreme screen corruption requiring a restart of at a minimum lightdm to fix, and more likely the whole computer. Memory issues can occur either on the card or on the host, namely:
[ 1208.225974] [drm:amdgpu_
[ 1213.099991] [drm:amdgpu_
[ 1213.100119] [drm:amdgpu_
[ 1213.100224] amdgpu 0000:08:00.0: amdgpu: GPU reset begin!
[ 1213.349988] [drm:amdgpu_
[ 1213.350131] [drm:dce110_
[ 1213.509424] amdgpu: cp is busy, skip halt cp
[ 1213.777795] amdgpu: rlc is busy, skip halt rlc
[ 1213.778816] amdgpu 0000:08:00.0: amdgpu: BACO reset
[ 1214.425001] amdgpu 0000:08:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 1214.427319] [drm] PCIE GART of 256M enabled (table at 0x000000F4007E9
[ 1214.427335] [drm] VRAM is lost due to GPU reset!
[ 1214.600175] ------------[ cut here ]------------
[ 1214.600178] amdgpu 0000:08:00.0: drm_WARN_
[ 1214.600244] WARNING: CPU: 7 PID: 0 at drivers/
[ 1214.600274] Modules linked in: xfrm_user xfrm_algo xt_addrtype br_netfilter vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) rfcomm ashmem_linux(C) binder_linux iptable_mangle xt_comment iptable_nat iptable_filter bpfilter xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nf_conntrack_
[ 1214.600304] ecdh_generic joydev mac80211 snd_seq ecc input_leds snd_seq_device apple_mfi_
[ 1214.600336] CPU: 7 PID: 0 Comm: swapper/7 Tainted: P C OE 5.13.0-21-generic #21-Ubuntu
[ 1214.600337] Hardware name: System manufacturer System Product Name/TUF GAMING X570-PLUS, BIOS 1201 09/09/2019
[ 1214.600339] RIP: 0010:drm_
[ 1214.600353] Code: 8b 7f 08 4c 8b 67 50 4d 85 e4 74 22 e8 ea 55 ef d7 48 c7 c1 d8 22 4e c0 4c 89 e2 48 c7 c7 75 d5 4d c0 48 89 c6 e8 c0 75 31 d8 <0f> 0b eb 8a 4c 8b 27 eb d9 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[ 1214.600354] RSP: 0018:ffffa65480
[ 1214.600356] RAX: 0000000000000000 RBX: ffff94289d0a0000 RCX: 0000000000000027
[ 1214.600357] RDX: ffff942f7ebd89c8 RSI: 0000000000000001 RDI: ffff942f7ebd89c0
[ 1214.600358] RBP: ffffa654803a8d80 R08: 0000000000000000 R09: ffffa654803a8b68
[ 1214.600359] R10: ffffa654803a8b60 R11: ffffffff99b55268 R12: ffff9428818a9ef0
[ 1214.600360] R13: 0000000000000086 R14: ffff94289d0a0178 R15: ffff9428e4aa8300
[ 1214.600361] FS: 000000000000000
[ 1214.600362] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1214.600363] CR2: 00005604fb2cbd00 CR3: 0000000201ea4000 CR4: 0000000000350ee0
[ 1214.600364] Call Trace:
[ 1214.600365] <IRQ>
[ 1214.600368] drm_crtc_
[ 1214.600382] dm_pflip_
[ 1214.600563] amdgpu_
[ 1214.600691] amdgpu_
[ 1214.600788] ? try_to_
[ 1214.600792] amdgpu_
[ 1214.600886] amdgpu_
[ 1214.600978] __handle_
[ 1214.600982] handle_
[ 1214.600983] handle_
[ 1214.600985] __common_
[ 1214.600987] common_
[ 1214.600991] </IRQ>
[ 1214.600991] asm_common_
[ 1214.600993] RIP: 0010:cpuidle_
[ 1214.600996] Code: 3d 71 76 a8 67 e8 64 d2 75 ff 49 89 c6 0f 1f 44 00 00 31 ff e8 05 de 75 ff 80 7d d7 00 0f 85 04 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 ff 0f 88 10 01 00 00 49 63 d7 4c 89 f1 48 2b 4d c8 48 8d 04
[ 1214.600997] RSP: 0018:ffffa65480
[ 1214.600998] RAX: ffff942f7ebece80 RBX: 0000000000000002 RCX: 000000000000001f
[ 1214.600999] RDX: 0000000000000000 RSI: 000000001f9ade42 RDI: 0000000000000000
[ 1214.601000] RBP: ffffa65480177e88 R08: 0000011acbcee9b8 R09: 000000000001f100
[ 1214.601000] R10: 0000000000000001 R11: ffff942f7ebeb884 R12: ffff942885663000
[ 1214.601001] R13: ffffffff99c5dcc0 R14: 0000011acbcee9b8 R15: 0000000000000002
[ 1214.601003] cpuidle_
[ 1214.601005] cpuidle_
[ 1214.601007] do_idle+0x83/0xf0
[ 1214.601008] cpu_startup_
[ 1214.601009] start_secondary
[ 1214.601011] secondary_
[ 1214.601014] ---[ end trace 8462b4e5fbcd0db4 ]---
[ 1214.627337] [drm] UVD and UVD ENC initialized successfully.
[ 1214.727404] [drm] VCE initialized successfully.
[ 1214.734437] amdgpu 0000:08:00.0: amdgpu: recover vram bo from shadow start
[ 1214.745196] amdgpu 0000:08:00.0: amdgpu: recover vram bo from shadow done
[ 1214.745200] [drm] Skip scheduling IBs!
[ 1214.745201] [drm] Skip scheduling IBs!
[ 1214.745235] amdgpu 0000:08:00.0: amdgpu: GPU reset(2) succeeded!
[ 1214.745249] [drm] Skip scheduling IBs!
<snip of many lines of similar content>
[ 1214.745295] [drm] Skip scheduling IBs!
[ 1214.745417] [drm:amdgpu_
[ 1214.747847] [drm:amdgpu_
[ 1214.749613] [drm:amdgpu_
[ 1214.781172] [drm:amdgpu_
<--- post a restart--->
[ 350.292849] [drm:amdgpu_
[ 350.302003] amdgpu 0000:08:00.0: amdgpu: GPU fault detected: 146 0x06e0480c for process plasmashell pid 6872 thread plasmashel:cs0 pid 6920
[ 350.302011] amdgpu 0000:08:00.0: amdgpu: VM_CONTEXT1_
[ 350.302013] amdgpu 0000:08:00.0: amdgpu: VM_CONTEXT1_
[ 350.302014] amdgpu 0000:08:00.0: amdgpu: VM fault (0x0c, vmid 5, pasid 32772) at page 1073884, read from 'TC4' (0x54433400) (72)
[ 360.390588] [drm:amdgpu_
[ 370.634583] [drm:amdgpu_
[ 380.870588] [drm:amdgpu_
[ 391.110593] [drm:amdgpu_
[ 401.350601] [drm:amdgpu_
[ 403.698235] [drm:amdgpu_
[ 403.698716] [drm:amdgpu_
[ 403.699095] [drm:amdgpu_
[ 403.699410] [drm:amdgpu_
[ 403.699703] [drm:amdgpu_
[ 403.700029] [drm:amdgpu_
[ 403.700406] [drm:amdgpu_
[ 403.700694] [drm:amdgpu_
[ 403.701001] [drm:amdgpu_
[ 403.701370] [drm:amdgpu_
[ 411.590611] [drm:amdgpu_
[ 421.830634] [drm:amdgpu_
For the record, the following are the (as current, git-built) versions of mesa in question:
mesa-va-
Mesa VA-API video acceleration drivers
mesa-vdpau-
Mesa VDPAU video acceleration drivers
mesa-vulkan-
Mesa Vulkan graphics drivers
This does not appear to have actually altered anything but potentially rules out their `if (count <0 ...` to `if (count <= 0 ...` patch fixing it.
I'm posting this here because (a) I suspect others will see this issue, and (b) I don't know if I should post it to upstream directly -- I also don't know if upstream is kernel.org or mesa (who takes ownership for amdgpu?)
ProblemType: Bug
DistroRelease: Ubuntu 21.10
Package: linux-image-generic 5.13.0.21.32
ProcVersionSign
Uname: Linux 5.13.0-21-generic x86_64
NonfreeKernelMo
ApportVersion: 2.20.11-0ubuntu71
Architecture: amd64
AudioDevicesInUse:
USER PID ACCESS COMMAND
/dev/snd/
/dev/snd/pcmC1D0p: jackmiller 6414 F...m pulseaudio
/dev/snd/
CasperMD5CheckR
CurrentDesktop: KDE
Date: Mon Nov 8 21:35:48 2021
InstallationDate: Installed on 2020-06-26 (500 days ago)
InstallationMedia: Ubuntu 20.04 LTS "Focal Fossa" - Release amd64 (20200423)
MachineType: System manufacturer System Product Name
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=
RelatedPackageV
linux-
linux-
linux-firmware 1.201.1
SourcePackage: linux
StagingDrivers: ashmem_linux
UpgradeStatus: Upgraded to impish on 2021-10-24 (14 days ago)
dmi.bios.date: 09/09/2019
dmi.bios.release: 5.14
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 1201
dmi.board.
dmi.board.name: TUF GAMING X570-PLUS
dmi.board.vendor: ASUSTeK COMPUTER INC.
dmi.board.version: Rev X.0x
dmi.chassis.
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.
dmi.modalias: dmi:bvnAmerican
dmi.product.family: To be filled by O.E.M.
dmi.product.name: System Product Name
dmi.product.sku: SKU
dmi.product.
dmi.sys.vendor: System manufacturer
This change was made by a bot.