ubuntu 22.04.4 lock up with nvidia driver, "NVRM: GPU 0000:01:00.0: GPU has fallen off the bus."
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
nvidia-graphics-drivers-535 (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
I'm using Nvidia's driver on Ubuntu 22.04.4 without major issues since about a year (or so) on this DELL Latitude 5531 notebook (01:00.0 3D controller: NVIDIA Corporation TU117M [GeForce MX550] (rev a1)). Yesterday, I had an ubuntu update, so I use driver 535 now, the previous one was 525.
After some hours after going to have lunch, I found my notebook locked up, I could not use my keyboard, mouse, and the fan of the notebook was almost "screaming" already. I don't remember a single event like this before the upgrade (with 525, and older ones before).
Now, it happened again after a hour or so, and seems to be a regular theme from now :( Which is - for sure - very problematic since I had to hard power-off the system losing all non-saved work and workflow all the time.
ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: nvidia-driver-535 535.161.
ProcVersionSign
Uname: Linux 6.5.0-26-generic x86_64
NonfreeKernelMo
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: amd64
CasperMD5CheckR
CurrentDesktop: ubuntu:GNOME
Date: Fri Apr 5 15:52:11 2024
InstallationDate: Installed on 2022-09-23 (560 days ago)
InstallationMedia: Ubuntu 22.04.1 LTS "Jammy Jellyfish" - Release amd64 (20220809.1)
SourcePackage: nvidia-
UpgradeStatus: No upgrade log present (probably fresh install)
I've encountered this issue with this one: https:/
description: | updated |
Kernel log relevant part:
Apr 5 15:37:00 rygel kernel: [ 2925.004224] pcieport 0000:00:01.0: PME: Spurious native interrupt! 2007-c450- 19a1-f764cae07b 00 bug-report. sh as root to collect this data before nvidia. prime:11cbb timed out (hint:intel_ atomic_ commit_ ready [i915])
Apr 5 15:37:25 rygel kernel: [ 2949.648804] pcieport 0000:00:01.0: PME: Spurious native interrupt!
Apr 5 15:37:36 rygel kernel: [ 2960.804915] pcieport 0000:00:01.0: PME: Spurious native interrupt!
Apr 5 15:37:48 rygel kernel: [ 2973.253151] pcieport 0000:00:01.0: PME: Spurious native interrupt!
Apr 5 15:38:00 rygel kernel: [ 2984.989466] pcieport 0000:00:01.0: PME: Spurious native interrupt!
Apr 5 15:41:38 rygel kernel: [ 3202.837449] pcieport 0000:00:01.0: PME: Spurious native interrupt!
Apr 5 15:42:17 rygel kernel: [ 3241.782062] pcieport 0000:00:01.0: PME: Spurious native interrupt!
Apr 5 15:42:57 rygel kernel: [ 3282.354685] NVRM: GPU at PCI:0000:01:00: GPU-78091d7e-
Apr 5 15:42:57 rygel kernel: [ 3282.354689] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Apr 5 15:42:57 rygel kernel: [ 3282.354691] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Apr 5 15:42:57 rygel kernel: [ 3282.354730] NVRM: A GPU crash dump has been created. If possible, please run
Apr 5 15:42:57 rygel kernel: [ 3282.354730] NVRM: nvidia-
Apr 5 15:42:57 rygel kernel: [ 3282.354730] NVRM: the NVIDIA kernel module is unloaded.
Apr 5 15:43:02 rygel kernel: [ 3287.474709] NVRM: Error in service of callback
Apr 5 15:48:05 rygel kernel: [ 3590.071558] Asynchronous wait on fence NVIDIA:
Please note, that the "Spurious native interrupt" thing may or may not be related, I see those since the very beginning all the time, when I first used this notebook, it seems it's a constant thing. So maybe that could be ignored though. Like that, this line:
Mar 3 10:13:07 rygel kernel: [ 634.685830] workqueue: pm_runtime_work hogged CPU for >11428us 16 times, consider switching to WQ_UNBOUND
is also a very frequent guest of mine :) since ages, no idea what it means though, or connected to the problem or not.
lspci:
01:00.0 3D controller: NVIDIA Corporation TU117M [GeForce MX550] (rev a1) ,D1-,D2- ,D3hot+ ,D3cold+ )
Subsystem: Dell Device 0b0f
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 190
IOMMU group: 18
Region 0: Memory at 8e000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at 6000000000 (64-bit, prefetchable) [size=256M]
Region 3: Memory at 6010000000 (64-bit, prefetchable) [size=32M]
Region 5: I/O ports at 3000 [size=128]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00b18 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- Pw...