amdgpu kernel crash

Bug #1940690 reported by pigs-on-the-swim
This bug report is a duplicate of:  Bug #1956845: amdgpu: [gfxhub0] retry page fault. Edit Remove
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

[ 0.000000] Linux version 5.4.0-81-generic (buildd@lgw01-amd64-052) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #91-Ubuntu SMP Thu Jul 15 19:09:17 UTC 2021 (Ubuntu 5.4.0-81.91-generic 5.4.128)
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.4.0-81-generic root=UUID=2fac2ccc-b353-4ced-a8e5-7e5a7f0fe5f3 ro

[ 217.837643] amdgpu 0000:09:00.0: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32772, for process Xorg pid 7499 thread Xorg:cs0 pid 7500)
[ 217.837647] amdgpu 0000:09:00.0: in page starting at address 0x000080010797e000 from client 27
[ 217.837649] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[ 217.837651] amdgpu 0000:09:00.0: MORE_FAULTS: 0x1
[ 217.837653] amdgpu 0000:09:00.0: WALKER_ERROR: 0x0
[ 217.837655] amdgpu 0000:09:00.0: PERMISSION_FAULTS: 0x3
[ 217.837657] amdgpu 0000:09:00.0: MAPPING_ERROR: 0x0
[ 217.837658] amdgpu 0000:09:00.0: RW: 0x0
[ 217.837668] amdgpu 0000:09:00.0: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32772, for process Xorg pid 7499 thread Xorg:cs0 pid 7500)
[ 217.837670] amdgpu 0000:09:00.0: in page starting at address 0x00008001079a6000 from client 27
[ 217.837672] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[ 217.837674] amdgpu 0000:09:00.0: MORE_FAULTS: 0x1
[ 217.837675] amdgpu 0000:09:00.0: WALKER_ERROR: 0x0
[ 217.837677] amdgpu 0000:09:00.0: PERMISSION_FAULTS: 0x3
[ 217.837679] amdgpu 0000:09:00.0: MAPPING_ERROR: 0x0
[ 217.837681] amdgpu 0000:09:00.0: RW: 0x0
[ 217.837698] amdgpu 0000:09:00.0: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32772, for process Xorg pid 7499 thread Xorg:cs0 pid 7500)
[ 217.837703] amdgpu 0000:09:00.0: in page starting at address 0x00008001079ce000 from client 27
[ 217.837708] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[ 217.837712] amdgpu 0000:09:00.0: MORE_FAULTS: 0x1
[ 217.837716] amdgpu 0000:09:00.0: WALKER_ERROR: 0x0
[ 217.837721] amdgpu 0000:09:00.0: PERMISSION_FAULTS: 0x3
[ 217.837725] amdgpu 0000:09:00.0: MAPPING_ERROR: 0x0
[ 217.837729] amdgpu 0000:09:00.0: RW: 0x0
[ 217.838669] amdgpu 0000:09:00.0: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32772, for process Xorg pid 7499 thread Xorg:cs0 pid 7500)
[ 217.838673] amdgpu 0000:09:00.0: in page starting at address 0x000080010797e000 from client 27
[ 217.838675] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[ 217.838677] amdgpu 0000:09:00.0: MORE_FAULTS: 0x1
[ 217.838679] amdgpu 0000:09:00.0: WALKER_ERROR: 0x0
[ 217.838681] amdgpu 0000:09:00.0: PERMISSION_FAULTS: 0x3
[ 217.838683] amdgpu 0000:09:00.0: MAPPING_ERROR: 0x0
[ 217.838684] amdgpu 0000:09:00.0: RW: 0x0
[ 217.838695] amdgpu 0000:09:00.0: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32772, for process Xorg pid 7499 thread Xorg:cs0 pid 7500)
[ 217.838697] amdgpu 0000:09:00.0: in page starting at address 0x00008001079a6000 from client 27
[ 217.838699] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[ 217.838701] amdgpu 0000:09:00.0: MORE_FAULTS: 0x1
[ 217.838703] amdgpu 0000:09:00.0: WALKER_ERROR: 0x0
[ 217.838704] amdgpu 0000:09:00.0: PERMISSION_FAULTS: 0x3
[ 217.838706] amdgpu 0000:09:00.0: MAPPING_ERROR: 0x0
[ 217.838708] amdgpu 0000:09:00.0: RW: 0x0
[ 217.838727] amdgpu 0000:09:00.0: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32772, for process Xorg pid 7499 thread Xorg:cs0 pid 7500)
[ 217.838732] amdgpu 0000:09:00.0: in page starting at address 0x00008001079ce000 from client 27
[ 217.838736] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[ 217.838741] amdgpu 0000:09:00.0: MORE_FAULTS: 0x1
[ 217.838746] amdgpu 0000:09:00.0: WALKER_ERROR: 0x0
[ 217.838750] amdgpu 0000:09:00.0: PERMISSION_FAULTS: 0x3
[ 217.838755] amdgpu 0000:09:00.0: MAPPING_ERROR: 0x0
[ 217.838759] amdgpu 0000:09:00.0: RW: 0x0
[ 217.839694] amdgpu 0000:09:00.0: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32772, for process Xorg pid 7499 thread Xorg:cs0 pid 7500)
[ 217.839698] amdgpu 0000:09:00.0: in page starting at address 0x000080010797e000 from client 27
[ 217.839700] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[ 217.839702] amdgpu 0000:09:00.0: MORE_FAULTS: 0x1
[ 217.839703] amdgpu 0000:09:00.0: WALKER_ERROR: 0x0
[ 217.839705] amdgpu 0000:09:00.0: PERMISSION_FAULTS: 0x3
[ 217.839707] amdgpu 0000:09:00.0: MAPPING_ERROR: 0x0
[ 217.839709] amdgpu 0000:09:00.0: RW: 0x0
[ 217.992553] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: xorg 1:7.7+19ubuntu14
ProcVersionSignature: Ubuntu 5.4.0-80.90-generic 5.4.124
Uname: Linux 5.4.0-80-generic x86_64
ApportVersion: 2.20.11-0ubuntu27.18
Architecture: amd64
BootLog:

CasperMD5CheckResult: skip
CompositorRunning: None
Date: Fri Aug 20 17:39:16 2021
DistUpgraded: 2020-11-08 09:01:34,443 ERROR got error from PostInstallScript ./xorg_fix_proprietary.py (g-exec-error-quark: Kindprozess »./xorg_fix_proprietary.py« konnte nicht ausgeführt werden (Datei oder Verzeichnis nicht gefunden) (8))
DistroCodename: focal
DistroVariant: ubuntu
ExtraDebuggingInterest: Yes
GraphicsCard:
 Advanced Micro Devices, Inc. [AMD/ATI] Picasso [1002:15d8] (rev c8) (prog-if 00 [VGA controller])
   Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Picasso [1002:15d8]
InstallationDate: Installed on 2020-10-31 (293 days ago)
InstallationMedia: Ubuntu 14.04.4 LTS "Trusty Tahr" - Release amd64 (20160217.1)
MachineType: To Be Filled By O.E.M. To Be Filled By O.E.M.
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.4.0-80-generic root=UUID=2fac2ccc-b353-4ced-a8e5-7e5a7f0fe5f3 ro
Renderer: Software
SourcePackage: xorg
Symptom: display
Title: Xorg crash
UnitySupportTest: Error: command ['/usr/lib/nux/unity_support_test', '-p', '-f'] failed with exit code 5: Error: unable to open display
UpgradeStatus: Upgraded to focal on 2020-11-08 (285 days ago)
XorgConf:
 Section "InputClass"
     Identifier "middle button emulation class"
     MatchIsPointer "on"
     Option "Emulate3Buttons" "on"
 EndSection
dmi.bios.date: 06/18/2020
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: P4.20
dmi.board.name: B450 Pro4
dmi.board.vendor: ASRock
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 3
dmi.chassis.vendor: To Be Filled By O.E.M.
dmi.chassis.version: To Be Filled By O.E.M.
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvrP4.20:bd06/18/2020:svnToBeFilledByO.E.M.:pnToBeFilledByO.E.M.:pvrToBeFilledByO.E.M.:rvnASRock:rnB450Pro4:rvr:cvnToBeFilledByO.E.M.:ct3:cvrToBeFilledByO.E.M.:
dmi.product.family: To Be Filled By O.E.M.
dmi.product.name: To Be Filled By O.E.M.
dmi.product.sku: To Be Filled By O.E.M.
dmi.product.version: To Be Filled By O.E.M.
dmi.sys.vendor: To Be Filled By O.E.M.
mtime.conffile..etc.apport.crashdb.conf: 2021-08-13T08:27:44.287879
version.compiz: compiz 1:0.9.14.1+20.04.20200211-0ubuntu1
version.libdrm2: libdrm2 2.4.105-3~20.04.1
version.libgl1-mesa-dri: libgl1-mesa-dri 21.0.3-0ubuntu0.3~20.04.1
version.libgl1-mesa-glx: libgl1-mesa-glx 21.0.3-0ubuntu0.3~20.04.1
version.xserver-xorg-core: xserver-xorg-core 2:1.20.11-1ubuntu1~20.04.2
version.xserver-xorg-input-evdev: xserver-xorg-input-evdev 1:2.10.6-1
version.xserver-xorg-video-ati: xserver-xorg-video-ati 1:19.1.0-1
version.xserver-xorg-video-intel: xserver-xorg-video-intel 2:2.99.917+git20200226-1
version.xserver-xorg-video-nouveau: xserver-xorg-video-nouveau 1:1.0.16-1
xserver.bootTime: Sat Nov 7 23:42:22 2020
xserver.configfile: default
xserver.devices:
 input Power Button KEYBOARD, id 6
 input Video Bus KEYBOARD, id 7
 input Power Button KEYBOARD, id 8
 input Dell Dell USB Keyboard KEYBOARD, id 9
 input PixArt Cherry USB Optical Mouse MOUSE, id 10
xserver.errors:
 open /dev/dri/card0: No such file or directory
 open /dev/dri/card0: No such file or directory
 Screen 0 deleted because of no matching config section.
 AIGLX: reverting to software rendering
xserver.logfile: /var/log/Xorg.0.log
xserver.outputs:

xserver.version: 2:1.19.6-1ubuntu4.7

Revision history for this message
pigs-on-the-swim (dw1-4) wrote :
Revision history for this message
pigs-on-the-swim (dw1-4) wrote :

this is a duplicate of #1939417.

summary: - Xorg crash
+ amdgpu kernel crash
affects: xorg (Ubuntu) → linux (Ubuntu)
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
torel (torehl) wrote :
Download full text (7.7 KiB)

Hardware: DP Epyc Milan 7763 node with 2 qty AMD Instinct Mi100

Kernel: ubuntu 18.04.6LTS w/linux-hwe 5.4.0-107-generic

ROCm 5.1.0 and AMDGPU version: 5.13.20.5.1 driver

Homegrown software developed using ROCm 5.1.0.

Might this be related?

Logs:

[304726.475355] beegfs: enabling unsafe global rkey
[304734.912424] amdgpu 0000:23:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32769, for process hyprep pid 122284 thread hyprep pid 122284)
[304734.928526] amdgpu 0000:23:00.0: amdgpu: in page starting at address 0x0000000001753000 from IH client 0x1b (UTCL2)
[304734.939972] amdgpu 0000:23:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[304734.948130] amdgpu 0000:23:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[304734.955858] amdgpu 0000:23:00.0: amdgpu: MORE_FAULTS: 0x1
[304734.962115] amdgpu 0000:23:00.0: amdgpu: WALKER_ERROR: 0x0
[304734.968441] amdgpu 0000:23:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[304734.975196] amdgpu 0000:23:00.0: amdgpu: MAPPING_ERROR: 0x0
[304734.981580] amdgpu 0000:23:00.0: amdgpu: RW: 0x0
[304735.568400] amdgpu 0000:23:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32769, for process pid 0 thread pid 0)
[304735.582318] amdgpu 0000:23:00.0: amdgpu: in page starting at address 0x0000000001753000 from IH client 0x1b (UTCL2)
[304735.593722] amdgpu 0000:23:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[304735.601851] amdgpu 0000:23:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[304735.609465] amdgpu 0000:23:00.0: amdgpu: MORE_FAULTS: 0x0
[304735.615686] amdgpu 0000:23:00.0: amdgpu: WALKER_ERROR: 0x0
[304735.621994] amdgpu 0000:23:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[304735.628737] amdgpu 0000:23:00.0: amdgpu: MAPPING_ERROR: 0x0
[304735.635104] amdgpu 0000:23:00.0: amdgpu: RW: 0x0
[321839.599489] beegfs: enabling unsafe global rkey

Driver
Apr 02 22:19:59 n004 kernel: [drm] amdgpu kernel modesetting enabled.
Apr 02 22:19:59 n004 kernel: [drm] amdgpu version: 5.13.20.5.1
Apr 02 22:19:59 n004 kernel: amdgpu: Ignoring ACPI CRAT on non-APU system
Apr 02 22:19:59 n004 kernel: amdgpu: Virtual CRAT table created for CPU
Apr 02 22:19:59 n004 kernel: amdgpu: Topology: Add CPU node
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: remove_conflicting_pci_framebuffers: bar 0: 0x67800000000 -> 0x67fffffffff
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: remove_conflicting_pci_framebuffers: bar 2: 0x68000000000 -> 0x680001fffff
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xeb400000 -> 0xeb47ffff
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: enabling device (0000 -> 0003)
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: Fetched VBIOS from ROM BAR
Apr 02 22:19:59 n004 kernel: amdgpu: ATOM BIOS: 113-D3431401-100
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: MEM ECC is active.
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: SRAM ECC is active.
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: RAS INFO: ras initialized successfully, ha...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.