'gpu has fallen off the bus' after return from suspend

Bug #1847937 reported by Wojciech Wieckowski
66
This bug affects 12 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

With this message the only way to resume work is to reboot the system using power button.

With kernel linux-image-5.3.0-13-generic everything is OK.
Problem appeared afer upgrade to linux-image-5.3.0-17-generic and still happens in linux-image-5.3.0-18-generic.

ProblemType: Bug
DistroRelease: Ubuntu 19.10
Package: linux-image-5.3.0-18-generic 5.3.0-18.19+1
ProcVersionSignature: Ubuntu 5.3.0-13.14-generic 5.3.0
Uname: Linux 5.3.0-13-generic x86_64
ApportVersion: 2.20.11-0ubuntu8
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC2: player 4451 F.... pulseaudio
 /dev/snd/controlC1: player 4451 F.... pulseaudio
 /dev/snd/controlC0: player 4451 F.... pulseaudio
CurrentDesktop: KDE
Date: Mon Oct 14 01:27:04 2019
HibernationDevice: RESUME=UUID=a58bb700-5818-4ca6-8351-fd4e5e28ea9e
MachineType: LENOVO 20BG0016US
ProcFB: 0 i915drmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.3.0-13-generic root=UUID=d0c33b75-6f04-4d53-bc3b-91924e3bcf91 ro quiet splash vga=current loglevel=0 vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-5.3.0-13-generic N/A
 linux-backports-modules-5.3.0-13-generic N/A
 linux-firmware 1.183
SourcePackage: linux
UpgradeStatus: Upgraded to eoan on 2019-10-08 (5 days ago)
WifiSyslog:

dmi.bios.date: 05/30/2018
dmi.bios.vendor: LENOVO
dmi.bios.version: GNET88WW (2.36 )
dmi.board.asset.tag: Not Available
dmi.board.name: 20BG0016US
dmi.board.vendor: LENOVO
dmi.board.version: 0B98401 Pro
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Not Available
dmi.modalias: dmi:bvnLENOVO:bvrGNET88WW(2.36):bd05/30/2018:svnLENOVO:pn20BG0016US:pvrThinkPadW540:rvnLENOVO:rn20BG0016US:rvr0B98401Pro:cvnLENOVO:ct10:cvrNotAvailable:
dmi.product.family: ThinkPad W540
dmi.product.name: 20BG0016US
dmi.product.sku: LENOVO_MT_20BG
dmi.product.version: ThinkPad W540
dmi.sys.vendor: LENOVO

Revision history for this message
Wojciech Wieckowski (xplwowi) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Wojciech Wieckowski (xplwowi) wrote :

Correction - when I boot 5.3.0-13 no Nvidia proprietary driver is loaded. That's why the bug does not appear here.

Revision history for this message
Wojciech Wieckowski (xplwowi) wrote :

Wrong package.
Bug is caused by nvidia dkms driver patch created to solve another bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1836630

Changed in linux (Ubuntu):
status: Confirmed → Invalid
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

We've discussed this with Nvidia and they confirmed [1] and/or [2] can solve the issue. Can you please try:
- [PATCH] PCI: PM: Fix pci_power_up() [1]
- [PATCH v2 0/2] PCI: Add missing link delays [2]

Let me know if you need a kernel binary package.

[1] https://lore.kernel.org<email address hidden>/T/#mf60a04c0abefe088af45e706f5619e7fa584cc7e
[2] https://<email address hidden>/T/#m1eff5c97b885fcc655874be028074a48c0d1d21f

Changed in linux (Ubuntu):
status: Invalid → Triaged
Revision history for this message
Wojciech Wieckowski (xplwowi) wrote :

If this is not a problem for you, I would ask for binaries.

Both threads contain different versions of patches and I don't know if I should take exactly that ones from the messages you are linking to or the last available.

Thanks in advance!

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :
Revision history for this message
Wojciech Wieckowski (xplwowi) wrote :

Thank you. I just finished testing all three variants and none of them solve the problem.

This time I used second computer to login via SSH and grab dmesg logs.

Problematic part always begins with this line:
nvidia 0000:01:00.0: Refused to change power state, currently in D3

Revision history for this message
Wojciech Wieckowski (xplwowi) wrote :
Revision history for this message
Wojciech Wieckowski (xplwowi) wrote :
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you please try:
https://people.canonical.com/~khfeng/lp1847937-3/

This has a slightly modified version of
https://<email address hidden>/

Revision history for this message
Wojciech Wieckowski (xplwowi) wrote :

Unfortunately, nothing has changed.
As for now only calling pci_save_state() makes system usable on return from S3.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please file an upstream bug at https://bugzilla.kernel.org/
Product: Drivers
Component: PCI

And subscribe me. We need to discuss this with PCI/PM folks.

Revision history for this message
rojer (rojer9) wrote :

getting the same error, Dell XPS 15, GeForce GTX 1650, kernel 5.3.0-24-generic, nvidia driver 435.21

Revision history for this message
rojer (rojer9) wrote :
Revision history for this message
Jonathon Padfield (jonpads) wrote :

I'm seeing the same problem; after waking from suspend it wakes to a black console with the message "GPU has fallen off the bus".

This is with a Dell Precision 5540, NVIDIA Corporation TU117GLM [Quadro T1000 Mobile]

Kubuntu, Linux 5.3.0-24-generic #26-Ubuntu SMP Thu Nov 14 01:33:18 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

The failures happens with both nvidia-driver-435 and nvidia-driver-440 driver packages.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

rojer and Jonathan,

Do those systems use s2idle by default?

Revision history for this message
Andrés (asmateus) wrote :

Kai-Heng Feng your note solved it for me. I had the same PC as Jonathan (Dell Precision 5540, NVIDIA Corporation TU117GLM [Quadro T1000 Mobile]) with:

Arch Linux, Linux 5.3.8-arch1-1 x86_64 GNU/Linux

Solution is as described here https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1808957, which reduces to adding the kernel boot option:

mem_sleep_default=deep

Revision history for this message
Josh Yavor (schwascore) wrote :

Unfortunately upon investigation of the fix Andrés mentioned in #18, I found that my system (Eoan, 5.3.0-24, Lenovo W540) was already configured for deep and not s2idle and is still affected by the 'gpu has fallen off the bus' issue.

Revision history for this message
Jonathon Padfield (jonpads) wrote :

Sorry for the delay in replying. I guess I missed a notification. I guess it's using s2idle by default, yes.

jonpad@jonpad-laptop ~ $ uname -a
Linux jonpad-laptop 5.4.0-9-generic #12-Ubuntu SMP Mon Dec 16 22:34:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
jonpad@jonpad-laptop ~ $ sudo journalctl | grep "PM: suspend" | tail -2
Jan 16 08:51:52 jonpad-laptop kernel: PM: suspend entry (s2idle)
Jan 16 09:00:23 jonpad-laptop kernel: PM: suspend exit
jonpad@jonpad-laptop ~ $ cat /sys/power/mem_sleep
[s2idle] deep

Is it worth trying the latest set of nvidia drivers and seeing if that fixes it? At the moment, I've blacklisted all the nvidea/nouveau drivers, and the laptop just uses the intel chip.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

AFAIK there are many fixes in new nvidia driver.

Revision history for this message
Da Kuang (kuangda) wrote :

I still have the same issue which is frustrating... I am using XPS15, GeForce GTX 1650, kernel 5.3.0-7625-generic, Nvidia driver 440.44. My OS is Pop!_OS 19.10.

I think my Nvidia driver is already the latest version and mem_sleep is deep by itself.

$ cat /sys/power/mem_sleep
$ [s2idle] deep

Is there any other approach to try to fix the problem? Thanks!

Revision history for this message
Samuel Yvon (syvonum) wrote :

If I might add to this issue;

I also have an XPS15, GTX 1650 and Ubuntu and POP!Os both trigger the issue (kernel 5.3.0-7625-generic). I also have deep as mem_sleep and use s2idle.

I have tried the following parameters without succcess:

$ sudo kernelstub -a "rcutree.rcu_idle_gp_delay=1 acpi_osi=! acpi_osi='Linux'"
$ sudo kernelstub -a "pcie_acpi=off"

(In combination and alone)

I have also tried forcing persistence with the driver which also did not work.

Here is another thread in mint forums
https://forums.linuxmint.com/viewtopic.php?p=1728952&sid=d2f654dfa1082400eeea98c9fbf01918#p1728952

The added parameters seems to work for them, but I could not resolve the issue. Nvidia driver is also at the latest version.

Revision history for this message
Jonathon Padfield (jonpads) wrote :

Recent updates (kernel I guess) seem to have fixed this for me. Drivers are loaded and being used, laptop goes to sleep and wakes up fine.

jonpad@jonpad-laptop ~ $ uname -a
Linux jonpad-laptop 5.4.0-12-generic #15-Ubuntu SMP Tue Jan 21 15:12:29 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
jonpad@jonpad-laptop ~ $ sudo journalctl | grep "PM: suspend" | tail -2
Feb 10 08:16:25 jonpad-laptop kernel: PM: suspend entry (s2idle)
Feb 10 09:39:04 jonpad-laptop kernel: PM: suspend exit
jonpad@jonpad-laptop ~ $ nvidia-detector
nvidia-driver-440
jonpad@jonpad-laptop ~ $ sudo lsmod | grep -i nvidia
nvidia_uvm 954368 0
nvidia_drm 45056 3
nvidia_modeset 1114112 2 nvidia_drm
nvidia 20422656 79 nvidia_uvm,nvidia_modeset
drm_kms_helper 184320 2 nvidia_drm,i915
drm 487424 20 drm_kms_helper,nvidia_drm,i915
ipmi_msghandler 106496 2 ipmi_devintf,nvidia

Revision history for this message
Samuel Yvon (syvonum) wrote :

I upgraded my kernel to 5.4.18 and the issue is still appearing for me.

$ uname -a
Linux pop-os 5.4.18-050418-generic #202002051737 SMP Wed Feb 5 22:42:14 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ nvidia-detector
nvidia-driver-440

Changed in linux (Ubuntu):
status: Triaged → Fix Released
Revision history for this message
Samuel Yvon (syvonum) wrote :

Hi,

Where was this fixed? If you are referring to the aforementioned kernel, I can confirm that the bug is still present.

Revision history for this message
Ronen Druker (rdrkr) wrote :

Same issue here

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

If default suspend method is s2idle but deep was chosen instead, expect to see buggy behavior.

If this issue happen under s2idle, please file a new bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.