[nvidia] Corrupted/missing shell textures when switching users or resuming from suspend

Bug #1876632 reported by Tom D.
196
This bug affects 29 people
Affects Status Importance Assigned to Milestone
GNOME Shell
New
Unknown
Mutter
Fix Released
Unknown
OEM Priority Project
Confirmed
Critical
Unassigned
gnome-shell (Ubuntu)
Triaged
High
Unassigned
mutter (Ubuntu)
Triaged
High
Unassigned
nvidia-graphics-drivers-440 (Ubuntu)
Won't Fix
High
Unassigned
nvidia-graphics-drivers-460 (Ubuntu)
Won't Fix
High
Unassigned
nvidia-graphics-drivers-470 (Ubuntu)
Triaged
High
Unassigned
nvidia-graphics-drivers-510 (Ubuntu)
Won't Fix
High
Unassigned
nvidia-graphics-drivers-525 (Ubuntu)
Triaged
High
Unassigned
nvidia-graphics-drivers-535 (Ubuntu)
Triaged
High
Unassigned

Bug Description

[Impact]

The Nvidia driver corrupts and/or forgets its textures when resuming from suspend, by design. Documented here:

https://www.khronos.org/registry/OpenGL/extensions/NV/NV_robustness_video_memory_purge.txt

Although it's so awkward to implement everywhere that realistically compositors will never support it fully. Instead we're waiting for an Nvidia driver fix.

*NOTE* that this is actually not a common problem because the system must be using Nvidia as the primary GPU to be affected. So generally only desktop users will encounter the bug, not laptops. And even then, only desktops that use suspend/resume and VT switching may trigger it, if ever. Even in development the bug cannot be reproduced reliably.

[Workarounds]

* Always log into a Xorg session and if corruption occurs then type: Alt+F2, R, Enter

* https://download.nvidia.com/XFree86/Linux-x86_64/510.54/README/powermanagement.html#PreserveAllVide719f0

[Test Plan]

[Where problems could occur]

[Original Bug Report]

I recently installed ubuntu 20.04 on my computer, and I am running into an issue when I do the following:

* Login with a user on desktop
* Select switch user, and login as second user
* Switch user again, and return to original user

At this point, text and icons in the menubar / sidebar are corrupted. Text and icons in normal windows appear correctly. I have attached a screenshot of what this looks like.

Screenshots: https://imgur.com/a/3ZFDLMc

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: xorg 1:7.7+19ubuntu14
ProcVersionSignature: Ubuntu 5.4.0-28.32-generic 5.4.30
Uname: Linux 5.4.0-28-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
.proc.driver.nvidia.gpus.0000.09.00.0: Error: [Errno 21] Is a directory: '/proc/driver/nvidia/gpus/0000:09:00.0'
.proc.driver.nvidia.registry: Binary: ""
.proc.driver.nvidia.suspend: suspend hibernate resume
.proc.driver.nvidia.suspend_depth: default modeset uvm
.proc.driver.nvidia.version:
 NVRM version: NVIDIA UNIX x86_64 Kernel Module 440.64 Fri Feb 21 01:17:26 UTC 2020
 GCC version:
ApportVersion: 2.20.11-0ubuntu27
Architecture: amd64
BootLog: Error: [Errno 13] Permission denied: '/var/log/boot.log'
CasperMD5CheckResult: skip
CompositorRunning: None
CurrentDesktop: ubuntu:GNOME
Date: Sun May 3 18:12:45 2020
DistUpgraded: Fresh install
DistroCodename: focal
DistroVariant: ubuntu
ExtraDebuggingInterest: Yes
GraphicsCard:
 NVIDIA Corporation GP104 [GeForce GTX 1070] [10de:1b81] (rev a1) (prog-if 00 [VGA controller])
   Subsystem: ASUSTeK Computer Inc. GP104 [GeForce GTX 1070] [1043:85a0]
InstallationDate: Installed on 2020-05-03 (0 days ago)
InstallationMedia: Ubuntu 20.04 LTS "Focal Fossa" - Release amd64 (20200423)
MachineType: Gigabyte Technology Co., Ltd. AX370-Gaming
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.4.0-28-generic root=/dev/mapper/vgubuntu-root ro quiet splash vt.handoff=7
SourcePackage: xorg
Symptom: display
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 06/19/2017
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: F3
dmi.board.asset.tag: Default string
dmi.board.name: AX370-Gaming-CF
dmi.board.vendor: Gigabyte Technology Co., Ltd.
dmi.board.version: se1
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvrF3:bd06/19/2017:svnGigabyteTechnologyCo.,Ltd.:pnAX370-Gaming:pvrDefaultstring:rvnGigabyteTechnologyCo.,Ltd.:rnAX370-Gaming-CF:rvrse1:cvnDefaultstring:ct3:cvrDefaultstring:
dmi.product.family: Default string
dmi.product.name: AX370-Gaming
dmi.product.sku: Default string
dmi.product.version: Default string
dmi.sys.vendor: Gigabyte Technology Co., Ltd.
version.compiz: compiz N/A
version.libdrm2: libdrm2 2.4.101-2
version.libgl1-mesa-dri: libgl1-mesa-dri 20.0.4-2ubuntu1
version.libgl1-mesa-glx: libgl1-mesa-glx 20.0.4-2ubuntu1
version.nvidia-graphics-drivers: nvidia-graphics-drivers-* N/A
version.xserver-xorg-core: xserver-xorg-core 2:1.20.8-2ubuntu2
version.xserver-xorg-input-evdev: xserver-xorg-input-evdev N/A
version.xserver-xorg-video-ati: xserver-xorg-video-ati 1:19.1.0-1
version.xserver-xorg-video-intel: xserver-xorg-video-intel 2:2.99.917+git20200226-1
version.xserver-xorg-video-nouveau: xserver-xorg-video-nouveau 1:1.0.16-1

Revision history for this message
Tom D. (dickmant) wrote :
summary: - Corrupted fonts in display output when using multiple users
+ [nvidia] Corrupted fonts in display output when using multiple users
affects: xorg (Ubuntu) → mutter (Ubuntu)
tags: added: nvidia
summary: - [nvidia] Corrupted fonts in display output when using multiple users
+ [nvidia] Corrupted shell textures when using multiple users
summary: - [nvidia] Corrupted shell textures when using multiple users
+ [nvidia] Corrupted shell textures when switching users or resuming from
+ suspend
Changed in gnome-shell (Ubuntu):
status: New → Confirmed
Changed in mutter (Ubuntu):
status: New → Confirmed
Changed in gnome-shell (Ubuntu):
status: Confirmed → Invalid
Changed in mutter (Ubuntu):
importance: Undecided → High
status: Confirmed → Triaged
Changed in gnome-shell (Ubuntu):
status: Invalid → Triaged
Changed in gnome-shell (Ubuntu):
importance: Undecided → High
Revision history for this message
Jack Howarth (jwhowarth) wrote : Re: [nvidia] Corrupted shell textures when switching users or resuming from suspend

I've seen the same issue with focal and System76 Stable PPA's 460.32.03-1pop0~1611601564~20.04~11a4029~dev nvidia packaging. All text was corrupted after waking from suspend but it has only occurred once so far.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nvidia-graphics-drivers-440 (Ubuntu):
status: New → Confirmed
Changed in nvidia-graphics-drivers-460 (Ubuntu):
status: New → Confirmed
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

If you are using a Xorg session (and you probably are with Nvidia) then restarting gnome-shell might work around the problem:

1. Alt+F2

2. Type: r

3. Hit Enter

Revision history for this message
Alan AZZERA (azzera-alan) wrote (last edit ):

Same problem occurs for me with nvidia-driver-470. The graphical effect / issue is a bit different, yet. See attached pic... My Ubuntu 20.04 is as up-to-date as possible.

The last suggested solution is working for me (restarting gnome-shell), but this is still annoying... I've just read this : https://www.reddit.com/r/gnome/comments/k5ea3g/text_and_icon_glitching_out_after_resuming_from/ so it seems we won't have any solution before quite a long time... Fed up with nvidia !

Changed in nvidia-graphics-drivers-470 (Ubuntu):
status: New → Confirmed
summary: - [nvidia] Corrupted shell textures when switching users or resuming from
- suspend
+ [nvidia] Corrupted/missing shell textures when switching users or
+ resuming from suspend
Changed in nvidia-graphics-drivers-510 (Ubuntu):
status: New → Confirmed
Revision history for this message
Daniel van Vugt (vanvugt) wrote (last edit ):

I wonder if Nvidia has made progress on this in 510.60 since they say:

"Fixed a regression that could cause OpenGL applications to hang or render incorrectly after suspend/resume cycles or VT-switches"

[https://www.nvidia.com/download/driverResults.aspx/187162/en-us]

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

See also bug 1855757.

tags: added: resume suspend-resume
description: updated
description: updated
Changed in nvidia-graphics-drivers-510 (Ubuntu):
importance: Undecided → High
status: Confirmed → Triaged
Changed in nvidia-graphics-drivers-470 (Ubuntu):
importance: Undecided → High
status: Confirmed → Triaged
Changed in nvidia-graphics-drivers-440 (Ubuntu):
status: Confirmed → Triaged
importance: Undecided → High
Changed in nvidia-graphics-drivers-460 (Ubuntu):
importance: Undecided → High
status: Confirmed → Triaged
description: updated
description: updated
description: updated
description: updated
description: updated
tags: added: jammy
description: updated
jeremyszu (os369510)
Changed in oem-priority:
importance: Undecided → Critical
status: New → Confirmed
tags: added: oem-priority
Revision history for this message
jeremyszu (os369510) wrote :

Copied the finding from https://bugs.launchpad.net/ubuntu/+source/gdm3/+bug/1969142 since this bug is created long time.

Pieces of log
```
 四 14 22:55:10 kernel: NVRM: GPU at PCI:0000:01:00: GPU-581d669f-ec00-4588-ca5f-15ab24136974
 四 14 22:55:10 kernel: NVRM: Xid (PCI:0000:01:00): 13, pid=798, Graphics Exception: Shader Program Header 18 Error
 四 14 22:55:10 kernel: NVRM: Xid (PCI:0000:01:00): 13, pid=798, Graphics Exception: ESR 0x405840=0x82040000
 四 14 22:55:10 kernel: NVRM: Xid (PCI:0000:01:00): 13, pid=798, Graphics Exception: ESR 0x405848=0x80000000
 四 14 22:55:10 kernel: NVRM: Xid (PCI:0000:01:00): 13, pid=1270, Graphics Exception: ChID 0010, Class 0000c597, Offset 00000000, Data 00000000
```

For nvidia-driver, u-d-c, we need to follow
https://download.nvidia.com/XFree86/Linux-x86_64/510.54/README/powermanagement.html
to setup the parameters after some checks.

I appended the parameters:
options nvidia NVreg_TemporaryFilePath=/run
options nvidia NVreg_PreserveVideoMemoryAllocations=1

and then the problem solved.

Tried 510.60 as well and the problem is persistent.

tags: added: jiayi wayland
Revision history for this message
Henrik Harmsen (henrik-harmsen) wrote (last edit ):

I can confirm that adding this file fixes the problem on Ubuntu 22.04 (using Nvidia 3080 card):

$ cat /etc/modprobe.d/nvidia-power-management.conf
options nvidia NVreg_PreserveVideoMemoryAllocations=1

I did not have to set NVreg_TemporaryFilePath

With this set I can also use Wayland but that brings other problems, such as really stuttery video playback in at least firefox + youtube. So I stayed in X11 for now. Just note that 22.04 will default to Wayland with the above file present so I had to manually choose Ubuntu on X.org in the login screen.

I used a fresh install of 22.04.

Revision history for this message
Stephen (belrik) wrote :

Blank screen on 22.04 after upgrade. Re-installed nvidia 510 driver.

Even adding the following to /etc/modprobe.d/nvidia-power-management.conf does not fix the problem.

options nvidia NVreg_TemporaryFilePath=/run
options nvidia NVreg_PreserveVideoMemoryAllocations=1

Tried both gdm3 and lightdm, always get a black screen on resume. Using both MATE and default Ubuntu environments.

Revision history for this message
jeremyszu (os369510) wrote :

For comment#10:

NVreg_TemporaryFilePath means to use the specific path, if you didn't specify it then it will default to use /tmp. (if your /tmp has enough space to use then it's fine)

FWIK, the firefox is now from snap and the stable firefox is disable Wayland. Thus, it could be the other problem.

---

For comment#11:

Looks like the other issue, probably attach the log on a new bug report?

Revision history for this message
Marcos Alano (mhalano) wrote (last edit ):

I tried to set both options mentioned on comment #11 and with X11 after resume I got corrupted lock screen and corrupted shell. I had to restart shell with alt+f2 and type r. Furthermore, I think this is the kind of problem the bug talks about.
I have a laptop with an option on BIOS to always use the NVIDIA card instead of hybrid approach, so in some cases laptops will be affected too because they act as desktops in this aspect.

Revision history for this message
jeremyszu (os369510) wrote :

Hi Marcos,

For bug description:
---
[Workarounds]

* Always log into a Xorg session and if corruption occurs then type: Alt+F2, R, Enter

* https://download.nvidia.com/XFree86/Linux-x86_64/510.54/README/powermanagement.html#PreserveAllVide719f0
---

It should be workaround by adding "NVreg_PreserveVideoMemoryAllocations".
Would you please share $ cat /proc/driver/nvidia/params ?

Revision history for this message
Marcos Alano (mhalano) wrote :

@os369510,

Here is the content of the file as requested.

```
ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 1
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
RegisterForACPIEvents: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 1
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableGpuFirmware: 18
EnableDbgBreakpoint: 0
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: "/run"
ExcludedGpus: ""
```

Revision history for this message
jeremyszu (os369510) wrote :

Hi Macros,

would you please share the output of `journalctl -b` when issuing happening?

Revision history for this message
Marcos Alano (mhalano) wrote :

I executed 'journalctl -b -f' to get the logs (they are in the file) and executed systemctl suspend.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I've seen most people talking about the first solution that Nvidia documents, which is probably slow:

  "Save allocations in an unnamed temporary file"

The faster solution would be:

  "S0ix-based power management"

Both are documented in https://download.nvidia.com/XFree86/Linux-x86_64/510.54/README/powermanagement.html#PreserveAllVide719f0

I think this highlights why the bug exists in the first place -- keeping Nvidia GPU memory alive during sleep requires more power than the GPU would usually receive (or that you would like it to use) in a sleep state.

Revision history for this message
Henrik Harmsen (henrik-harmsen) wrote :

Daniel: Desktop motherboards usually do not support "S0ix-based power management" so the other option is the only viable one for desktops.

Revision history for this message
Sebastian Heiden (seb-heiden) wrote :

I'm using a RTX 3080 Ti on a Z690 Desktop motherboard and enabling the S0ix option fixed the problem for me.

I've added the following option "nvidia.NVreg_EnableS0ixPowerManagement=1" to GRUB_CMDLINE_LINUX_DEFAULT.

Revision history for this message
Ari (ari-reads) wrote :

Unfortunately nvidia.NVreg_EnableS0ixPowerManagement=1 did not work on an old Asus Rog Maximus VIII mobo (Kabylake), nvidia 1650.

I also tried this:
/etc/modprobe.d/nvidia-power-management.conf
options nvidia NVreg_PreserveVideoMemoryAllocations=1

and it didn't work either. Missing textures after suspend. Any other hint?

Revision history for this message
Sebastian Heiden (seb-heiden) wrote :

Try nvidia.NVreg_PreserveVideoMemoryAllocations=1 as suggested by @vanvugt

You may wanna change the file path of the temporary file thought for better performance.

Revision history for this message
Ari (ari-reads) wrote (last edit ):

@Sebastian adding both the kernel parameter and the two lines in the nvidia module conf file works for "systemctl suspend" - the machine can now resume OK when on wayland.

Curiously though, when I try systemctl suspend-then-hibernate (my default way of suspending), I get the following in dmesg:

NVRM: GPU 0000:02:00.0: PreserveVideoMemoryAllocations module parameter is set. System Power Management attempted without driver procfs suspend interface. Please refer to the 'Configuring Power Management Support' section in the driver README.

and the machine does not suspend (I get a login prompt after 30 seconds or so). After this issue is triggered once, "systemctl suspend" also seems to break until the next reboot.

Note that "systemctl hibernate" works fine on wayland and Xorg. Also, on xorg, systemctl suspend-then-hibernate also works fine.
Could this be a bug on systemd?

Changed in mutter:
status: Unknown → New
Revision history for this message
Robert Carroll (robswc) wrote :

>adding both the kernel parameter and the two lines in the nvidia module conf file works for "systemctl suspend" - the machine can now resume OK when on wayland.

I've tried this solution. There's a chance I've done it wrong though.

cat /etc/modprobe.d/nvidia-power-management.conf
options nvidia NVreg_TemporaryFilePath=/run
options nvidia NVreg_PreserveVideoMemoryAllocations=1

AND:

cat /etc/default/grub | grep "CMD"
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nvidia.NVreg_EnableS0ixPowerManagement=1"
GRUB_CMDLINE_LINUX=""

Upon systemctl suspend the monitor loses connection then gets its back after about 30 seconds and shows the login screen.

Revision history for this message
Marcos Alano (mhalano) wrote :

@robswc

I understand what you mean. I get the solution from this page:
https://forums.developer.nvidia.com/t/trouble-suspending-with-510-39-01-linux-5-16-0-freezing-of-tasks-failed-after-20-009-seconds/200933/12?u=user126177
I don't even need anyone of these configurations, just the method indicated in the post.

Revision history for this message
Robert Carroll (robswc) wrote :

@mhalano

I actually just found that too!

Its amazing. I've since put it here:

https://github.com/robswc/ubuntu-22-nvidia-suspend-fix-script/blob/main/README.md

To quickly run through the steps for my other PCs.

tags: added: jellyfish-edge-staging
Changed in gnome-shell:
status: Unknown → New
Changed in mutter:
status: New → Fix Released
Revision history for this message
Henrik Harmsen (henrik-harmsen) wrote :

Dear Mr Bug Watch Updater: If you are referring to https://gitlab.gnome.org/GNOME/mutter/-/issues/1942 then I would like to inform you that it was closed without fixing and the problem still remains so please explain in what way the fix is released.

Thanks.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

'Fix Released' is the only supported closed state for upstream bug tasks, even when an upstream bug is closed without a fix.

jeremyszu (os369510)
tags: added: nvidia-wayland
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

'nvidia-wayland' means a bug only happens on Wayland, which this does not.

We use the 'wayland' tag to indicate that a bug can occur on Wayland, which was already done in comment #9.

tags: removed: nvidia-wayland
tags: added: kinetic
Changed in nvidia-graphics-drivers-525 (Ubuntu):
importance: Undecided → High
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nvidia-graphics-drivers-525 (Ubuntu):
status: New → Confirmed
tags: added: nvidia-wayland
tags: removed: nvidia-wayland
Changed in nvidia-graphics-drivers-535 (Ubuntu):
importance: Undecided → High
status: New → Triaged
Changed in nvidia-graphics-drivers-525 (Ubuntu):
status: Confirmed → Triaged
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

nvidia 440 etc are EOL

Changed in nvidia-graphics-drivers-440 (Ubuntu):
status: Triaged → Won't Fix
Changed in nvidia-graphics-drivers-460 (Ubuntu):
status: Triaged → Won't Fix
Changed in nvidia-graphics-drivers-510 (Ubuntu):
status: Triaged → Won't Fix
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

This sounds relevant:

> This threshold can be set to 0 in order to prevent the video memory from being turned off.

https://download.nvidia.com/XFree86/Linux-x86_64/535.129.03/README/dynamicpowermanagement.html#VidMemThreshold

Revision history for this message
Henrik Harmsen (henrik-harmsen) wrote (last edit ):

Daniel, I tried setting that threshold to zero but it made the desktop apps unstable after coming back from resume (Nvidia 3080Ti). Maybe it doesn't work on desktop GPUs. So I went back to the solution I wrote about in comment 10. Maybe that solution should be used to solve this problem? It looks like it is already used in e.g. Tumbleweed.
Thanks.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Yes the solution in comment 10 is indirectly mentioned above in [Workarounds], but it's not very obvious:

https://download.nvidia.com/XFree86/Linux-x86_64/510.54/README/powermanagement.html#PreserveAllVide719f0

Revision history for this message
Moritz Strübe (morty-mt) wrote :

I think we have some options to fix this. Yet, which would be a solution that could be packaged? If I understand https://download.nvidia.com/XFree86/Linux-x86_64/510.54/README/powermanagement.html correctly there are the following options:

Keep the GPU-Memory Running
===========================

Pro:
* Only need to set a kernel module parameter
Con:
* Only works in S2
* Uses more power than necessary (at least if more than 256 MB graphic memory are used)
* Does not work with suspend-to-disk (at least if I understood things right)

Copy Memory to tmpfs
====================

Dump the graphic-memory into tmpfs. I wouldn't suggest using /run, but creating a driver-specific tmpfs that is large enough to hold the graphic memory.

Pro:
* Just works, as the kernel takes care of paging the tmpfs to disk in case of stand-by or suspend-to-disk.
Con:
* Delayed suspend and wakeup
* Unexpected swap-usage -> OOM-Killer might kick in on suspend if there is not enough or no swap.

Get everyone to work around it
==============================

Get all the affected applications to update their video-memory after suspend-to-X.

Pro:
* Does not have any of the Cons of the other solutions
Con:
* Probably very tedious.

Keep it as it is
================

Pro:
* Minimal effort
* Kind of works most of the time.

Con:
* Nobody is really happy.

Comments?

To post a comment you must log in.