ubuntu 22.04.4 lock up with nvidia driver, "NVRM: GPU 0000:01:00.0: GPU has fallen off the bus."

Bug #2060303 reported by LGB [Gábor Lénárt]
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
nvidia-graphics-drivers-535 (Ubuntu)
New
Undecided
Unassigned

Bug Description

I'm using Nvidia's driver on Ubuntu 22.04.4 without major issues since about a year (or so) on this DELL Latitude 5531 notebook (01:00.0 3D controller: NVIDIA Corporation TU117M [GeForce MX550] (rev a1)). Yesterday, I had an ubuntu update, so I use driver 535 now, the previous one was 525.

After some hours after going to have lunch, I found my notebook locked up, I could not use my keyboard, mouse, and the fan of the notebook was almost "screaming" already. I don't remember a single event like this before the upgrade (with 525, and older ones before).

Now, it happened again after a hour or so, and seems to be a regular theme from now :( Which is - for sure - very problematic since I had to hard power-off the system losing all non-saved work and workflow all the time.

ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: nvidia-driver-535 535.161.07-0ubuntu0.22.04.1
ProcVersionSignature: Ubuntu 6.5.0-26.26~22.04.1-generic 6.5.13
Uname: Linux 6.5.0-26-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: amd64
CasperMD5CheckResult: pass
CurrentDesktop: ubuntu:GNOME
Date: Fri Apr 5 15:52:11 2024
InstallationDate: Installed on 2022-09-23 (560 days ago)
InstallationMedia: Ubuntu 22.04.1 LTS "Jammy Jellyfish" - Release amd64 (20220809.1)
SourcePackage: nvidia-graphics-drivers-535
UpgradeStatus: No upgrade log present (probably fresh install)

I've encountered this issue with this one: https://bugs.launchpad.net/ubuntu/+source/mutter/+bug/2059847 please see my comment: https://bugs.launchpad.net/ubuntu/+source/mutter/+bug/2059847/comments/37

Revision history for this message
LGB [Gábor Lénárt] (lgb) wrote :
Revision history for this message
LGB [Gábor Lénárt] (lgb) wrote (last edit ):
Download full text (6.7 KiB)

Kernel log relevant part:

Apr 5 15:37:00 rygel kernel: [ 2925.004224] pcieport 0000:00:01.0: PME: Spurious native interrupt!
Apr 5 15:37:25 rygel kernel: [ 2949.648804] pcieport 0000:00:01.0: PME: Spurious native interrupt!
Apr 5 15:37:36 rygel kernel: [ 2960.804915] pcieport 0000:00:01.0: PME: Spurious native interrupt!
Apr 5 15:37:48 rygel kernel: [ 2973.253151] pcieport 0000:00:01.0: PME: Spurious native interrupt!
Apr 5 15:38:00 rygel kernel: [ 2984.989466] pcieport 0000:00:01.0: PME: Spurious native interrupt!
Apr 5 15:41:38 rygel kernel: [ 3202.837449] pcieport 0000:00:01.0: PME: Spurious native interrupt!
Apr 5 15:42:17 rygel kernel: [ 3241.782062] pcieport 0000:00:01.0: PME: Spurious native interrupt!
Apr 5 15:42:57 rygel kernel: [ 3282.354685] NVRM: GPU at PCI:0000:01:00: GPU-78091d7e-2007-c450-19a1-f764cae07b00
Apr 5 15:42:57 rygel kernel: [ 3282.354689] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Apr 5 15:42:57 rygel kernel: [ 3282.354691] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Apr 5 15:42:57 rygel kernel: [ 3282.354730] NVRM: A GPU crash dump has been created. If possible, please run
Apr 5 15:42:57 rygel kernel: [ 3282.354730] NVRM: nvidia-bug-report.sh as root to collect this data before
Apr 5 15:42:57 rygel kernel: [ 3282.354730] NVRM: the NVIDIA kernel module is unloaded.
Apr 5 15:43:02 rygel kernel: [ 3287.474709] NVRM: Error in service of callback
Apr 5 15:48:05 rygel kernel: [ 3590.071558] Asynchronous wait on fence NVIDIA:nvidia.prime:11cbb timed out (hint:intel_atomic_commit_ready [i915])

Please note, that the "Spurious native interrupt" thing may or may not be related, I see those since the very beginning all the time, when I first used this notebook, it seems it's a constant thing. So maybe that could be ignored though. Like that, this line:

Mar 3 10:13:07 rygel kernel: [ 634.685830] workqueue: pm_runtime_work hogged CPU for >11428us 16 times, consider switching to WQ_UNBOUND

is also a very frequent guest of mine :) since ages, no idea what it means though, or connected to the problem or not.

lspci:

01:00.0 3D controller: NVIDIA Corporation TU117M [GeForce MX550] (rev a1)
 Subsystem: Dell Device 0b0f
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0
 Interrupt: pin A routed to IRQ 190
 IOMMU group: 18
 Region 0: Memory at 8e000000 (32-bit, non-prefetchable) [size=16M]
 Region 1: Memory at 6000000000 (64-bit, prefetchable) [size=256M]
 Region 3: Memory at 6010000000 (64-bit, prefetchable) [size=32M]
 Region 5: I/O ports at 3000 [size=128]
 Capabilities: [60] Power Management version 3
  Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
  Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
 Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
  Address: 00000000fee00b18 Data: 0000
 Capabilities: [78] Express (v2) Endpoint, MSI 00
  DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
   ExtTag+ AttnBtn- AttnInd- Pw...

Read more...

Revision history for this message
LGB [Gábor Lénárt] (lgb) wrote :
description: updated
Revision history for this message
LGB [Gábor Lénárt] (lgb) wrote (last edit ):

I'm sorry for the flood of comments, but it happened again two times already, and I guess I see a pattern when it occurs: if I use the computer to browse the web, watch some videos, or writing some C program with vim, or whatever, it's fine. However if I let it alone for more than 5-10 mins or so (no need to screen lock, or bother with closing the lid of the notebook or anything!) then it happens.

It seems if I only leave a "top" command running in a terminal window, it's enough to avoid the "idle'ing" problem (+ lock up) or whatever it is ...

Interestingly since then I am careful no to leave the GPU idle to lock up then, I haven't seen these lines in the kernel log either which was the case btw with the previous nvidia driver(s) as well (though there were no lock-ups, but the log msgs were there):

pcieport 0000:00:01.0: PME: Spurious native interrupt!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.