No video after wake from S3 due to Nvidia driver crash

Bug #1946303 reported by Boris Gjenero
160
This bug affects 30 people
Affects Status Importance Assigned to Milestone
nvidia-graphics-drivers-470 (Ubuntu)
Confirmed
Undecided
Unassigned
nvidia-graphics-drivers-510 (Ubuntu)
Confirmed
Undecided
Unassigned
nvidia-graphics-drivers-525 (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Since upgrading to Ubuntu 21.10, my computer sometimes fails to properly wake from suspend. It does start running again, but there is no video output. I'm attaching text for two crashes from kernel log output. First is:
/var/lib/dkms/nvidia/470.63.01/build/nvidia/nv.c:3967 nv_restore_user_channels+0xce/0xe0 [nvidia]
Second is:
/var/lib/dkms/nvidia/470.63.01/build/nvidia/nv.c:4162 nv_set_system_power_state+0x2c8/0x3d0 [nvidia]

Apparently I'm not the only one having this problem with 470 drivers.
https://forums.linuxmint.com/viewtopic.php?t=354445
https://forums.developer.nvidia.com/t/fixed-suspend-resume-issues-with-the-driver-version-470/187150

Driver 470 uses the new suspend mechanism via /usr/lib/systemd/system-sleep/nvidia. But I was using that mechanism with driver 460 in Ubuntu 21.04 and sleep was reliable then. Right now I'm going back to driver 460.

ProblemType: Bug
DistroRelease: Ubuntu 21.10
Package: nvidia-driver-470 470.63.01-0ubuntu4
ProcVersionSignature: Ubuntu 5.13.0-16.16-generic 5.13.13
Uname: Linux 5.13.0-16-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
ApportVersion: 2.20.11-0ubuntu70
Architecture: amd64
CasperMD5CheckResult: unknown
CurrentDesktop: KDE
Date: Wed Oct 6 23:24:02 2021
SourcePackage: nvidia-graphics-drivers-470
UpgradeStatus: Upgraded to impish on 2021-10-02 (4 days ago)

Revision history for this message
Boris Gjenero (boris-gjenero) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nvidia-graphics-drivers-470 (Ubuntu):
status: New → Confirmed
Revision history for this message
Boris Gjenero (boris-gjenero) wrote :

I got upgraded back to 470 because new 460 transitional dummy packages were released, which caused the upgrade. I could have prevented that via apt-mark hold.

When I was getting these problems with 470 I had nvidia.NVreg_PreserveVideoMemoryAllocations=1 in my kernel command line, via /etc/default/grub. After being upgraded to 470 again, I changed that to nvidia.NVreg_PreserveVideoMemoryAllocations=0. So far, waking from sleep has been reliable. (Also, neither Firefox nor KDE Plasma have problems after sleep with this. I had set that option to 1 in the past because it seemed to help with earlier versions of Firefox and Plasma.)

Revision history for this message
Ciro Santilli 六四事件 法轮功 (cirosantilli) wrote :

Repro same stack traces on 510 after some trial and error at: https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-510/+bug/1953674

Key discovery there: it reproduces almost every time when my laptop is on battery power with the power cord unplugged, otherwise it generally only happens if I suspend for a certain amount of time.

Also simply having the opencl/CUDA packages enabled also causes suspend problems even if you switch back to NOUVEAU.

I also see some ACPI error messages which only appear to show up with NVIDIA enabled:

ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20210331/nsarguments-61)
ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.LPCB.EC.HKEY.DEVT.PEGS], AE_NOT_FOUND (20210331/psargs-330)

No Local Variables are initialized for Method [DEVT]

Initialized Arguments for Method [DEVT]: (1 arguments defined for method invocation)
Arg0: 00000000e4c8db84 <Obj> Integer 00000000000000D3

ACPI Error: Aborting method \_SB.PCI0.LPCB.EC.HKEY.DEVT due to previous error (AE_NOT_FOUND) (20210331/psparse-529)
ACPI Error: Aborting method \_SB.PCI0.LPCB.EC._Q4F due to previous error (AE_NOT_FOUND) (20210331/psparse-529)
ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.LPCB.EC.HKEY.DEVT.PEGS], AE_NOT_FOUND (20210331/psargs-330)

No Local Variables are initialized for Method [DEVT]

Revision history for this message
Till Kleisli (tillkleisli) wrote :
Revision history for this message
Ciro Santilli 六四事件 法轮功 (cirosantilli) wrote :

BTW, the workaround from: https://forums.developer.nvidia.com/t/fixed-suspend-resume-issues-with-the-driver-version-470/187150/3?u=cirosantilli worked for me. I wouldn't expect that workaround to be affected by lightdm vs gdm3, but maybe it is:

sudo systemctl stop nvidia-suspend.service
sudo systemctl stop nvidia-hibernate.service
sudo systemctl stop nvidia-resume.service

sudo systemctl disable nvidia-suspend.service
sudo systemctl disable nvidia-hibernate.service
sudo systemctl disable nvidia-resume.service

sudo mv /lib/systemd/system-sleep/nvidia ~/nvidia.bak

Revision history for this message
Ciro Santilli 六四事件 法轮功 (cirosantilli) wrote :

That other issue could be the same, I just wish someone there would have reproduced the exact stack trace attached by Boris here to be more certain, which I reproduce as:

WARNING: CPU: 0 PID: 18016 at /var/lib/dkms/nvidia/510.47.03/build/nvidia/nv.c:3935 nv_restore_user_channels+0xce/0xe0 [nvidia]

Call Trace:
<TASK>
nv_set_system_power_state+0x22b/0x3e0 [nvidia]
nv_procfs_write_suspend+0xe9/0x140 [nvidia]
proc_reg_write+0x5a/0x90
? __cond_resched+0x1a/0x50
vfs_write+0xc3/0x250
ksys_write+0x67/0xe0
__x64_sys_write+0x19/0x20
do_syscall_64+0x61/0xb0
? exit_to_user_mode_prepare+0x37/0xb0
? syscall_exit_to_user_mode+0x27/0x50
? __x64_sys_newfstatat+0x1c/0x20
? do_syscall_64+0x6e/0xb0
? syscall_exit_to_user_mode+0x27/0x50
? do_syscall_64+0x6e/0xb0
? asm_exc_page_fault+0x8/0x30
entry_SYSCALL_64_after_hwframe+0x44/0xae

WARNING: CPU: 0 PID: 18016 at /var/lib/dkms/nvidia/510.47.03/build/nvidia/nv.c:4152 nv_set_system_power_state+0x2d0/0x3e0 [nvidia]

nv_procfs_write_suspend+0xe9/0x140 [nvidia]
proc_reg_write+0x5a/0x90
? __cond_resched+0x1a/0x50
vfs_write+0xc3/0x250
ksys_write+0x67/0xe0
__x64_sys_write+0x19/0x20
do_syscall_64+0x61/0xb0
? exit_to_user_mode_prepare+0x37/0xb0
? syscall_exit_to_user_mode+0x27/0x50
? __x64_sys_newfstatat+0x1c/0x20
? do_syscall_64+0x6e/0xb0
? syscall_exit_to_user_mode+0x27/0x50
? do_syscall_64+0x6e/0xb0
? asm_exc_page_fault+0x8/0x30
entry_SYSCALL_64_after_hwframe+0x44/0xae

Revision history for this message
Till Kleisli (tillkleisli) wrote (last edit ):

It wasn't fixed with switching to gdm3, sorry, I deleted my comment above.

I applied the workaround you referred to and by now, it seems to work!

https://forums.developer.nvidia.com/t/fixed-suspend-resume-issues-with-the-driver-version-470/187150/3?u=cirosantilli

Tested on Ubuntu 21.10 with a GM107GLM Quadro M2000M and the current drivers (510.47)

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nvidia-graphics-drivers-510 (Ubuntu):
status: New → Confirmed
Revision history for this message
ChrisK5 (chrisk5) wrote (last edit ):

The workaround from

https://forums.developer.nvidia.com/t/fixed-suspend-resume-issues-with-the-driver-version-470/187150/3?u=cirosantilli

works for me too.

Thanks for mentioning it Mr. Santilli and thanks fixing it Mr. Humblebee!^^

Revision history for this message
dhenry (tfc-duke) wrote :

This issue still exists with Ubuntu 22.04 and NVIDIA driver 510 (GeForce GTX 980M).

Revision history for this message
ManOnTheMoon (manonthemoon) wrote :

Just tested with the solution by Mr. Humblebee. It works for both x11 and wayland for 22.05 for NIVIDIA 510 driver.

Revision history for this message
Christians (script-vollbio) wrote (last edit ):

I confirm that the workaround suggested above works on x11, GeForce GTX 1050 Mobile, driver NVIDIA 510

systemctl disable nvidia-hibernate.service nvidia-resume.service nvidia-suspend.service

UPDATE 2022-06-02: Following system updates, resume using 510 fails. I moved back to 470 and resume works every time.
UPDATE 2022-06-17: now resume fails every time for both, 470 and 510

Revision history for this message
David Nichols (nicholsman) wrote :

Resume fails with a black screen won Ubuntu 22.04 RTX 3070 Mobile, driver 510 with:
- Xorg. nvidia services enabled: resume fails
- Xorg, nvidia services disabled: resume fails
- XWayland, nvidia services disabled: resume fails

However it does work with:
- Xwayland, nvidia services enabled: resume works

laptop: Juno Neptune 15 v3

Revision history for this message
Salvatore Gabriele La Greca (gabriele97) wrote :

I updated to day to Pop OS! 22.04 (based on Ubuntu 22.04) and I am pretty sure that the bug I am going to present you now is related to the one mentioned in this thread.

I can't suspend my laptop anymore after the update. I tried both with nvidia drivers 470 and 515. What happens is that if I try to suspend, the system shows some dmesg log and stays there forever. Using another TTY i got the full dmesg log: https://pastebin.com/1ki2pmFK.

There are three things that took my attention:

[ 56.240577] pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 56.240579] pcieport 0000:00:1c.0: device [8086:a112] error status/mask=00000001/00002000
[ 56.240581] pcieport 0000:00:1c.0: [ 0] RxErr

[ 60.021424] BUG: kernel NULL pointer dereference, address: 0000000000000000

[ 60.021663] <TASK>
[ 60.021671] ? _main_loop+0x140/0x140 [nvidia_uvm]
[ 60.021720] nv_kthread_q_flush+0x1a/0x70 [nvidia_uvm]
[ 60.021765] uvm_suspend+0xae/0x1f0 [nvidia_uvm]
[ 60.021817] uvm_suspend_entry.part.0+0x6e/0xa0 [nvidia_uvm]
[ 60.021872] uvm_suspend_entry+0x24/0x30 [nvidia_uvm]
[ 60.021920] nv_uvm_suspend+0x33/0x50 [nvidia]
[ 60.022523] nv_set_system_power_state+0x2e9/0x3d0 [nvidia]
[ 60.023035] nv_procfs_write_suspend+0xe9/0x180 [nvidia]
[ 60.023553] proc_reg_write+0x5b/0xa0
[ 60.023567] vfs_write+0xb8/0x280
[ 60.023580] ksys_write+0x67/0xe0
[ 60.023587] __x64_sys_write+0x19/0x20
[ 60.023595] do_syscall_64+0x5c/0x80
[ 60.023605] ? do_syscall_64+0x69/0x80
[ 60.023613] ? syscall_exit_to_user_mode+0x26/0x50
[ 60.023626] ? __x64_sys_close+0x11/0x50
[ 60.023635] ? do_syscall_64+0x69/0x80
[ 60.023643] ? ksys_dup3+0xab/0x110
[ 60.023654] ? exit_to_user_mode_prepare+0x37/0xb0
[ 60.023667] ? syscall_exit_to_user_mode+0x26/0x50
[ 60.023679] ? do_syscall_64+0x69/0x80
[ 60.023687] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 60.023697] RIP: 0033:0x7f9156b14a37

[ 60.101307] note: nvidia-sleep.sh[4753] exited with preempt_count 1

I tried to disable the three systemd services but when I try I get the following errors:

Stopping disk
NVRM: GPU 0000:05:00.0: PreserveVideoMemoryAllocations module parameter is set. System Power Management attempted without driver procfs suspend interface. Please refer to the 'Configuring Power Management Support' section >
PM: pci_pm_suspend(): nv_pmops_suspend+0x0/0x20 [nvidia] returns -5
PM: dpm_run_callback(): pci_pm_suspend+0x0/0x160 returns -5
nvidia 0000:05:00.0: PM: failed to suspend async: error -5`

Revision history for this message
caine (cainethanatos) wrote :

also issue with the new drivers :
- nvidia-graphics-drivers-515 (Ubuntu)

Revision history for this message
Gerhard Radatz (gerhard-radatz) wrote :

also affects me after upgrading from 20.04 to 22.04.
Suspend/resume worked fine on 20.04 with nvidia-470 driver.
Now it fails with every driver I tried (470, 510, 515) running Xorg session.
Also the suggested workaround does not work for me.

Revision history for this message
Gerhard Radatz (gerhard-radatz) wrote :

Update: After several failed attempts of reverting back to 20.04 (which all showed that resume from suspend with nvidia-470 driver got stuck, although it did work previously), I finally decided to do a fresh install of Ubuntu 22.04 (using the existing partitions with no formatting, so my /home and other data remained intact).

And voila: now resume from suspend works without problems (using driver 515.65.01)

So maybe this issue is somehow related to the upgrade process?

Revision history for this message
Boris Gjenero (boris-gjenero) wrote :

After putting nvidia.NVreg_PreserveVideoMemoryAllocations=0 in the kernel command line I don't remember any black screens after wake from suspend. They certainly haven't happened in Ubuntu 22.04. The computer is always usable immediately after waking.

However, I have gotten lockups some time later after waking. These seem to never happen if I only use my main monitor, connected to the DVI port of the GeForce GT 710. That is the way X is configured to start. However, if I switch to the second monitor at some point after booting (also DVI but connected to the HDMI port), then lockups can happen, even after I switch back to the primary monitor. I use xrandr to switch. This is with the 470 driver.

Revision history for this message
ojb (olejorgenb) wrote (last edit ):

The suggested workaround (disabling nvidia systemd suspend services) worked for me. Driver 470, GTX 660, Old MSI MB (B75MA-P45). I tried forcing GDM to use XOrg first, so I've only tested on a XOrg only system.

Upgraded system (from 21.10). Note: I could not find anything in the logs (no trace of nv_restore_user_channels, or similar)

Revision history for this message
Jorge Pérez Lara (jorgesgk) wrote (last edit ):

Facing here the exact same issues under 22.10, upgraded from 22.04. I will try to reinstall the drivers. I already tried this without success: https://forums.developer.nvidia.com/t/fixed-suspend-resume-issues-with-the-driver-version-470/187150/3?u=cirosantilli

Edit: Nope, it didn't work.

Revision history for this message
Manuel Corrales (mcr16986) wrote (last edit ):

Solved: updating kernel from 5.15.0.52.58 to 5.18.0.

After six hours trying all kind of workarounds, drivers versions, bios configurations, wayland/xorg, etc. The one only thing that seems to has solved the built-in black screen after suspend/blank-screen lock, is update kernel to 5.18.0.
Of course, other related issues like doesn't detect external monitors, brightness adjustment, etc are solved too. For now...

Thinkpad P1 Gen5 (21DDS23R00)- i7-12800H - NVIDIA® RTX™ A1000 4GB
Ubuntu 22.04 (Fresh Install)
Nvidia Driver 520

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nvidia-graphics-drivers-525 (Ubuntu):
status: New → Confirmed
Revision history for this message
Mauro Amico (mauro-amico) wrote :

Same issue.
None of the workarounds tried solved the problem.

* Ubuntu 22.10 (fresh install)
* HP HP ZBook Power 15.6 inch G9 Mobile Workstation PC/89C0, BIOS U97 Ver. 01.03.00 08/24/2022
* nvidia-driver-525 (525.60.11-0ubuntu0.22.10.1)
* Linux kernel 5.19.0-26-generic

Revision history for this message
Amr Bekhit (amrbekhit) wrote (last edit ):

Just wanted to confirm that this also affects me:

* Ubuntu 22.04.2 LTS
* HP Pavilion al001nt (HP HP Pavilion Notebook/820A, BIOS F.56 12/22/2020)
* nvidia-driver-525 (Driver Version: 525.89.02)
* Kernel 5.19.0-38-generic #39~22.04.1-Ubuntu

I've attached the log file to this message.

I'm happy to report that the workaround to disable the system suspend appears to work fine.

Revision history for this message
Mateusz Łącki (mateusz-lacki) wrote (last edit ):

ust wanted to confirm that this also affects me:

* Ubuntu 23.10 and 23.04 LTS
* GTX970 graphic card, both under wayland and X11
* nvidia-driiver 545 that is 545.29.02, but it did not work in 535, 515 or 495 either
* Kernel 6.5.0-10-generic

The error messages are exactly the same

as per:
https://forums.developer.nvidia.com/t/not-coming-back-from-suspend/176446

using

/usr/bin/nvidia-sleep.sh suspend
/usr/bin/nvidia-sleep.sh resume

from terminal from another machine turns off and back on the graphics (and reduces power drain accoring to my power meter)

but:

/usr/bin/nvidia-sleep.sh suspend
pm-suspend
/usr/bin/nvidia-sleep.sh resume

results in the black screen (computer available from network). The dmesg messages are identical as above, the same procedure fail in nv.c, but in a different line of (different driver version).

and the workaround does not work: the computer with these services disabled wakes up even less: reacts to ping, but not to ssh.
I am happy to provide extra logs/cats of my settings, but I need guidance what is of value.

Revision history for this message
Mateusz Łącki (mateusz-lacki) wrote :

I noticed that if I change driver to a *-server line, the suspend works just like that. I have tested 525 and 535 server driver package. In non-server line of drivers the above issues exist in all drivers installable through GUI 470, 535, and in the newest 545.29.02 as well.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.