Graphics Crashes Frequently Making Machine Unusable

Bug #2051975 reported by Brett Holman
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
New
Undecided
Unassigned

Bug Description

The issue is very reproducible. It happened within a minute or two of using my machine after updating from Lunar to Mantic. I tried upgrading to Noble to see if a newer kernel might have the fix - it didn't. My machine is unusable and I'm happy to repro if anyone has the fix

Sometimes (such as in the case of the logs grabbed from this report), the GPU is able to reset and I just have to restart my Gnome session and log back in. Far more frequently, however, reset fails and I have to hard reboot. In all cases I can ssh into my system to debug.

The crash symptoms start with a sudden freeze of the desktop environment for a few seconds, followed by the screen becoming pixelated and multi-colored (typical gpu crash in my experience).

Crashes typically happen when under load (firefox open, switching applications) and are less likely to happen when I'm only using gnome-terminal.

My system has a Vega 56 GPU using the amdgpu driver and has 3 actively used monitors.

[Thu Feb 1 11:22:17 2024] [drm:do_aquire_global_lock.isra.0 [amdgpu]] *ERROR* [CRTC:75:crtc-2] hw_done or flip_done timed out
[Thu Feb 1 11:22:17 2024] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_high timeout, signaled seq=175374, emitted seq=175375
[Thu Feb 1 11:22:17 2024] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 8117 thread gnome-shel:cs0 pid 8149
[Thu Feb 1 11:22:17 2024] amdgpu 0000:1f:00.0: amdgpu: GPU reset begin!
[Thu Feb 1 11:22:18 2024] [drm] psp gfx command UNLOAD_TA(0x2) failed and response status is (0x117)
[Thu Feb 1 11:22:18 2024] amdgpu 0000:1f:00.0: amdgpu: BACO reset
[Thu Feb 1 11:22:18 2024] amdgpu 0000:1f:00.0: amdgpu: GPU reset succeeded, trying to resume
[Thu Feb 1 11:22:18 2024] [drm] PCIE GART of 512M enabled.
[Thu Feb 1 11:22:18 2024] [drm] PTB located at 0x000000F400000000
[Thu Feb 1 11:22:18 2024] [drm] VRAM is lost due to GPU reset!
[Thu Feb 1 11:22:18 2024] [drm] PSP is resuming...
[Thu Feb 1 11:22:19 2024] [drm] reserve 0x400000 from 0xf5fec00000 for PSP TMR
[Thu Feb 1 11:22:19 2024] [drm] kiq ring mec 2 pipe 1 q 0
[Thu Feb 1 11:22:19 2024] [drm] UVD and UVD ENC initialized successfully.
[Thu Feb 1 11:22:19 2024] [drm] VCE initialized successfully.
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring gfx_low uses VM inv eng 1 on hub 0
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring gfx_high uses VM inv eng 4 on hub 0
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 5 on hub 0
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 6 on hub 0
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 7 on hub 0
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 8 on hub 0
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 9 on hub 0
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 10 on hub 0
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 11 on hub 0
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 12 on hub 0
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 13 on hub 0
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 8
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 8
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 8
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 8
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 8
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 8
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 8
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring vce0 uses VM inv eng 9 on hub 8
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring vce1 uses VM inv eng 10 on hub 8
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: ring vce2 uses VM inv eng 11 on hub 8
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: recover vram bo from shadow start
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: recover vram bo from shadow done
[Thu Feb 1 11:22:19 2024] amdgpu 0000:1f:00.0: amdgpu: GPU reset(2) succeeded!
[Thu Feb 1 11:22:19 2024] [drm] Skip scheduling IBs!

Also worth noting is that the system has lots of UBSAN array-index-out-of-bounds warnings, though I believe that these are probably harmless:

[Wed Jan 31 09:47:40 2024] ===============================================================================
=
[Wed Jan 31 09:47:40 2024] UBSAN: array-index-out-of-bounds in /build/linux-xR9R4C/linux-6.6.0/drivers/gpu
/drm/amd/amdgpu/../pm/powerplay/hwmgr/vega10_processpptables.c:1052:5
[Wed Jan 31 09:47:40 2024] index 1 is out of range for type 'ATOM_Vega10_Voltage_Lookup_Record [1]'
[Wed Jan 31 09:47:40 2024] CPU: 4 PID: 191 Comm: (udev-worker) Not tainted 6.6.0-14-generic #14-Ubuntu
[Wed Jan 31 09:47:40 2024] Hardware name: Micro-Star International Co., Ltd. MS-7B79/X470 GAMING PLUS (MS-
7B79), BIOS A.40 06/28/2018
[Wed Jan 31 09:47:40 2024] Call Trace:
[Wed Jan 31 09:47:40 2024] <TASK>
[Wed Jan 31 09:47:40 2024] dump_stack_lvl+0x48/0x70
[Wed Jan 31 09:47:40 2024] dump_stack+0x10/0x20
[Wed Jan 31 09:47:40 2024] __ubsan_handle_out_of_bounds+0xc6/0x110
[Wed Jan 31 09:47:40 2024] get_vddc_lookup_table.isra.0+0xd5/0xf0 [amdgpu]
[Wed Jan 31 09:47:40 2024] vega10_pp_tables_initialize+0x460/0x510 [amdgpu]
[Wed Jan 31 09:47:40 2024] hwmgr_hw_init+0x7b/0x1e0 [amdgpu]
[Wed Jan 31 09:47:40 2024] pp_hw_init+0x16/0x50 [amdgpu]
[Wed Jan 31 09:47:40 2024] amdgpu_device_ip_init+0x48e/0x900 [amdgpu]
[Wed Jan 31 09:47:40 2024] amdgpu_device_init+0x975/0x1100 [amdgpu]
[Wed Jan 31 09:47:40 2024] amdgpu_driver_load_kms+0x1a/0x1c0 [amdgpu]
[Wed Jan 31 09:47:40 2024] amdgpu_pci_probe+0x175/0x490 [amdgpu]
[Wed Jan 31 09:47:40 2024] local_pci_probe+0x47/0xb0
[Wed Jan 31 09:47:40 2024] pci_call_probe+0x55/0x1a0
[Wed Jan 31 09:47:40 2024] pci_device_probe+0x84/0x120
[Wed Jan 31 09:47:40 2024] ? srso_return_thunk+0x5/0x10
[Wed Jan 31 09:47:40 2024] really_probe+0x1c7/0x410
[Wed Jan 31 09:47:40 2024] __driver_probe_device+0x8c/0x180
[Wed Jan 31 09:47:40 2024] driver_probe_device+0x24/0xd0
[Wed Jan 31 09:47:40 2024] __driver_attach+0x10b/0x210
[Wed Jan 31 09:47:40 2024] ? __pfx___driver_attach+0x10/0x10
[Wed Jan 31 09:47:40 2024] bus_for_each_dev+0x8d/0xf0
[Wed Jan 31 09:47:40 2024] driver_attach+0x1e/0x30
[Wed Jan 31 09:47:40 2024] bus_add_driver+0x156/0x260
[Wed Jan 31 09:47:40 2024] driver_register+0x5e/0x130
[Wed Jan 31 09:47:40 2024] ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
[Wed Jan 31 09:47:40 2024] __pci_register_driver+0x62/0x70
[Wed Jan 31 09:47:40 2024] amdgpu_init+0x69/0xff0 [amdgpu]
[Wed Jan 31 09:47:40 2024] ? srso_return_thunk+0x5/0x10
[Wed Jan 31 09:47:40 2024] do_one_initcall+0x5e/0x340
[Wed Jan 31 09:47:40 2024] do_init_module+0x97/0x290
[Wed Jan 31 09:47:40 2024] load_module+0xba1/0xcf0
[Wed Jan 31 09:47:40 2024] ? vfree.part.0+0xf0/0x280
[Wed Jan 31 09:47:40 2024] ? srso_return_thunk+0x5/0x10
[Wed Jan 31 09:47:40 2024] ? kfree+0x78/0x120
[Wed Jan 31 09:47:40 2024] init_module_from_file+0x96/0x100
[Wed Jan 31 09:47:40 2024] ? srso_return_thunk+0x5/0x10
[Wed Jan 31 09:47:40 2024] ? init_module_from_file+0x96/0x100
[Wed Jan 31 09:47:40 2024] idempotent_init_module+0x11c/0x2b0
[Wed Jan 31 09:47:40 2024] __x64_sys_finit_module+0x64/0xd0
[Wed Jan 31 09:47:40 2024] do_syscall_64+0x5c/0x90
[Wed Jan 31 09:47:40 2024] ? do_syscall_64+0x68/0x90
[Wed Jan 31 09:47:40 2024] ? srso_return_thunk+0x5/0x10
[Wed Jan 31 09:47:40 2024] ? ksys_read+0x73/0x100
[Wed Jan 31 09:47:40 2024] ? srso_return_thunk+0x5/0x10
[Wed Jan 31 09:47:40 2024] ? exit_to_user_mode_prepare+0x30/0xb0
[Wed Jan 31 09:47:40 2024] ? srso_return_thunk+0x5/0x10
[Wed Jan 31 09:47:40 2024] ? syscall_exit_to_user_mode+0x37/0x60
[Wed Jan 31 09:47:40 2024] ? srso_return_thunk+0x5/0x10
[Wed Jan 31 09:47:40 2024] ? do_syscall_64+0x68/0x90
[Wed Jan 31 09:47:40 2024] ? do_syscall_64+0x68/0x90
[Wed Jan 31 09:47:40 2024] ? do_syscall_64+0x68/0x90
[Wed Jan 31 09:47:40 2024] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[Wed Jan 31 09:47:40 2024] RIP: 0033:0x72d4f7e5ac7d
[Wed Jan 31 09:47:40 2024] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 81 0d 00 f7 d8 64 89 01 48
[Wed Jan 31 09:47:40 2024] RSP: 002b:00007fffd372b738 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[Wed Jan 31 09:47:40 2024] RAX: ffffffffffffffda RBX: 000064f9c0d57670 RCX: 000072d4f7e5ac7d
[Wed Jan 31 09:47:40 2024] RDX: 0000000000000004 RSI: 000072d4f7fd744a RDI: 0000000000000019
[Wed Jan 31 09:47:40 2024] RBP: 000072d4f7fd744a R08: 0000000000000040 R09: fffffffffffffde0
[Wed Jan 31 09:47:40 2024] R10: fffffffffffffe18 R11: 0000000000000246 R12: 0000000000020000
[Wed Jan 31 09:47:40 2024] R13: 000064f9c0d575c0 R14: 0000000000000000 R15: 000064f9c0dde630
[Wed Jan 31 09:47:40 2024] </TASK>
[Wed Jan 31 09:47:40 2024] ================================================================================

ProblemType: Bug
DistroRelease: Ubuntu 24.04
Package: linux-image-6.6.0-14-generic 6.6.0-14.14
ProcVersionSignature: Ubuntu 6.6.0-14.14-generic 6.6.3
Uname: Linux 6.6.0-14-generic x86_64
ApportVersion: 2.27.0-0ubuntu6
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/seq: holmanb 7815 F.... pipewire
 /dev/snd/controlC0: holmanb 7825 F.... wireplumber
 /dev/snd/controlC2: holmanb 7825 F.... wireplumber
 /dev/snd/controlC1: holmanb 7825 F.... wireplumber
CRDA: N/A
CasperMD5CheckResult: pass
CurrentDesktop: ubuntu:GNOME
Date: Thu Feb 1 11:34:21 2024
InstallationDate: Installed on 2021-10-20 (834 days ago)
InstallationMedia: Ubuntu 21.04 "Hirsute Hippo" - Release amd64 (20210420)
IwConfig:
 lo no wireless extensions.

 enp24s0 no wireless extensions.

 virbr0 no wireless extensions.
MachineType: {report['dmi.sys.vendor']} {report['dmi.product.name']}
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-6.6.0-14-generic root=/dev/mapper/vgubuntu-root ro quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-6.6.0-14-generic N/A
 linux-backports-modules-6.6.0-14-generic N/A
 linux-firmware 20230919.git3672ccab-0ubuntu2.5
RfKill:

SourcePackage: linux
UpgradeStatus: Upgraded to noble on 2024-01-30 (2 days ago)
dmi.bios.date: 06/28/2018
dmi.bios.release: 5.13
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: A.40
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: X470 GAMING PLUS (MS-7B79)
dmi.board.vendor: Micro-Star International Co., Ltd.
dmi.board.version: 2.0
dmi.chassis.asset.tag: To be filled by O.E.M.
dmi.chassis.type: 3
dmi.chassis.vendor: Micro-Star International Co., Ltd.
dmi.chassis.version: 2.0
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvrA.40:bd06/28/2018:br5.13:svnMicro-StarInternationalCo.,Ltd.:pnMS-7B79:pvr2.0:rvnMicro-StarInternationalCo.,Ltd.:rnX470GAMINGPLUS(MS-7B79):rvr2.0:cvnMicro-StarInternationalCo.,Ltd.:ct3:cvr2.0:skuTobefilledbyO.E.M.:
dmi.product.family: To be filled by O.E.M.
dmi.product.name: MS-7B79
dmi.product.sku: To be filled by O.E.M.
dmi.product.version: 2.0
dmi.sys.vendor: Micro-Star International Co., Ltd.

Revision history for this message
Brett Holman (holmanb) wrote :
Brett Holman (holmanb)
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.