[nvidia] Frequent GPU Resets (and GUI Freeze) by gsd-media-keys

Bug #1970072 reported by ebsf
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
nvidia-graphics-drivers-470 (Ubuntu)
New
Undecided
Unassigned

Bug Description

A fresh install of Ubuntu 20.04.4 has GUI freezes frequently (i.e., a dozen times daily) at inconsistent intervals and times, and not obviously in response to any particular user input. The system then is unresponsive to user input, although the mouse cursor does continue to move about.

The syslog messages reveal the message from gsd-media-keys "[GFX1]: Device reset due to WR context"

The freeze also is preceded by often hundreds of pulseaudio messages related to latency issues, notwithstanding that no audio is playing. Often, gdm-x-session messages appear referencing the NVIDIA GPU and "WAIT". The pulseaudio message "setting avail_min=87496" often is the last message and if not, one of the last, before the reboot.

Also common before is a gnome-shell message "Ignored exception from dbus method: Gio.DBusError: GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name com.gonzaarcr.appmenu was not provided by any .service files", which precedes the pulseaudio messages.

Happy to provide any other information but would welcome some guidance because I'm on day 11 or 12 of this installation and wearing out my welcome at Google.

Ubuntu 20.04.4 LTS

apt-cache returns:
  Installed: 3.36.9-0ubuntu0.20.04.2
  Candidate: 3.36.9-0ubuntu0.20.04.2
  Version table:
 *** 3.36.9-0ubuntu0.20.04.2 500
        500 http://us.archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     3.36.4-1ubuntu1~20.04.2 500
        500 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages
     3.36.1-5ubuntu1 500
        500 http://us.archive.ubuntu.com/ubuntu focal/main amd64 Packages

Expected behavior: Stability, especially from the fourth release of a LTS.

Behavior: Persistent GUI freezes

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: gnome-shell 3.36.9-0ubuntu0.20.04.2
ProcVersionSignature: Ubuntu 5.13.0-40.45~20.04.1-generic 5.13.19
Uname: Linux 5.13.0-40-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
ApportVersion: 2.20.11-0ubuntu27.23
Architecture: amd64
CasperMD5CheckResult: skip
Date: Sat Apr 23 16:26:45 2022
DisplayManager: gdm3
GsettingsChanges:

InstallationDate: Installed on 2022-04-09 (14 days ago)
InstallationMedia: Ubuntu 20.04.3 LTS "Focal Fossa" - Release amd64 (20210819)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
RelatedPackageVersions: mutter-common 3.36.9-0ubuntu0.20.04.2
SourcePackage: gnome-shell
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
ebsf (eb-9) wrote :
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

If the GUI freezes but the cursor still moves then it's the gnome-shell process that's frozen while Xorg is not frozen. Although gnome-shell may become frozen by kernel or hardware problems, not just its own bugs.

Next time the problem happens, please:

1. Wait 10 seconds.

2. Reboot.

3. Run:

   journalctl -b-1 > prevboot.txt
   lspci -k > lspci.txt

4. Attach the resulting text files here.

5. Check for crashes using these instructions: https://wiki.ubuntu.com/Bugs/Responses#Missing_a_crash_report_or_having_a_.crash_attachment

Changed in gnome-shell (Ubuntu):
status: New → Incomplete
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Since Ubuntu 22.04 was released a few days ago it may be a good idea to start on that version:

https://ubuntu.com/download/desktop

Revision history for this message
ebsf (eb-9) wrote :

Regarding the file request, please see attached.

Revision history for this message
ebsf (eb-9) wrote :

Trying the attachment again.

The interface won't allow designation of more than one.

Revision history for this message
ebsf (eb-9) wrote :

Regarding 22.04, please advise what reason exists to think it might be better.

Also, please advise why I should spend days reinstalling because of someone else's failure instead of the failure being addressed.

Besides, note https://bugs.launchpad.net/subiquity/+bug/1970140. The 22.04 server installer fails. Some idiot thought to implement kernel video drivers in the live iso, and further, to switch to nouveau without disabling kernel mode setting, and further, not to provide any boot alternatives, and further, to provide no documentation (the server reference is for 20.04. One cannot imagine a more sublime or humiliating failure, for all to see. I'll need a reason to take the time even to download the ISO.

Revision history for this message
Daniel van Vugt (vanvugt) wrote (last edit ):

Thanks for the attachments. It looks like the Nvidia kernel driver encountered problems, which then caused problems in the Nvidia Xorg driver, and then Xorg crashed.

I would suggest the first thing to do is to remove the kernel parameter 'nvidia-drm.modeset=1' because that wasn't a mature supported option back in Ubuntu 20.04 (but it is in 22.04).

tags: added: nvidia
no longer affects: gnome-shell (Ubuntu)
summary: - Frequent GPU Resets (and GUI Freeze) by gsd-media-keys
+ [nvidia] Frequent GPU Resets (and GUI Freeze) by gsd-media-keys
tags: added: nvidia-drm.modeset
Revision history for this message
ebsf (eb-9) wrote :

Thanks. I'm glad to learn that the problems could be discerned.

I made the change you suggested to /etc/default/grub and did <# update-grub>.

I also removed the Fildem menu GNOME extension, messages about which preceded every crash.

Attached are the logs from te latest hang, which occurred just now. The trigger was nothing but using Firefox and opening windows as I was scrolling through a feed.

I'll be interested to learn your diagnosis this time, also.

It also would be good to knwo what is causing the Nvidia kernel driver to fail, so I can get in front of that, if possible.

It also would be good to rule out hardware failure, if possible or, on the other hand, to find evidence of such.

Thanks again.

Revision history for this message
ebsf (eb-9) wrote :

Here's the other file ...

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

The log in comment #9 still shows kernel problems in nvidia-modeset interleaved with pulseaudio underrun errors. It's hard to tell which is the real problem, if either, because any buggy kernel driver could cause such issues for everything else. It's just that audio and nvidia are the things logging errors.

Perhaps the first thing to try is to close Firefox, stop all audio output, and see if graphics problems occur during that time.

I know of no way to rule out hardware failure, but I think software failure is more common. So maybe also try a newer kernel like https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17.5/amd64/ ... which the driver should still support (http://us.download.nvidia.com/XFree86/Linux-x86_64/470.103.01/README/minimumrequirements.html) but I'm not sure what the latest kernel version is that nvidia-470 has been tested with.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Another suggestion, which costs money but might turn out to the be only solution, would be to change to a different graphics card that would allow you to use a different driver. If not AMD then an Nvidia generation that can use the latest driver series (version 510), such as GT 1030 or GTX 1650 (which I mention because you can run those particular cards without needing a power supply cable).

You can check the supported driver series at https://www.nvidia.com/Download/index.aspx

Revision history for this message
ebsf (eb-9) wrote :
Download full text (9.9 KiB)

Thanks for the thoughts.

I've got enough clutter around not to need to spend several hundred dollars for a test card.

I also have no idea how to compile a new kernel, but do know that my being asked that confirms someone goofed.

Anyway, maybe something else is afoot. Some noodles, please let me know what you think:

For starters: I've noticed that sox fails periodically. I've got a couple of scripts that play a brief OGG system sound every half hour, one on the hour, one five minutes before, to pace my day. Actually, the first is set for 3 seconds later than 00, to avoid conflicts. A .timer triggers a .service, which runs a one-line shell script that calls play (a sox command).

This works until it doesn't, whereupon there are error logs, as follows (others that differ are further down ...). (What exactly starts tripping it up?):

Apr 27 19:00:03 greystone systemd[1]: Starting Play sounds | pomo-in.service...
Apr 27 19:00:03 greystone bash[31026]: DIRFILE is /srv/greystone-data/users/eric/data/computer/media/sounds/chirps-notes/service-login.oga.
Apr 27 19:00:03 greystone bash[31027]: ALSA lib pcm_dmix.c:1089:(snd_pcm_dmix_open) unable to open slave
Apr 27 19:00:03 greystone kernel: traps: play[31027] trap divide error ip:7ff27953a37a sp:7ffc76cb7fa0 error:0 in libsox_fmt_ao.so[7ff27953a000+1000]
Apr 27 19:00:03 greystone bash[31026]: /home/locsh/pomo-in.sh: line 38: 31027 Floating point exception(core dumped) play $DIR$FILE
Apr 27 19:00:03 greystone systemd[1]: pomo-in.service: Main process exited, code=exited, status=136/n/a
Apr 27 19:00:03 greystone systemd[1]: pomo-in.service: Failed with result 'exit-code'.

Perhaps these offer some clues.

Separately, xrandr-related message storms crop up (revise xorg.conf?):

Apr 27 19:00:35 greystone gsd-color[5002]: no xrandr-DVI-I-0 device found: Failed to find output xrandr-DVI-I-0

Another weird one (no such device exists):

Apr 27 19:15:32 greystone dbus-daemon[1033]: [system] Activating via systemd: service name='net.reactivated.Fprint' unit='fprintd.service' requested by ':1.88' (uid=1000 pid=4878 comm="/usr/bin/gnome-shell " label="unconfined")
Apr 27 19:15:33 greystone systemd[1]: Starting Fingerprint Authentication Daemon...
Apr 27 19:15:33 greystone dbus-daemon[1033]: [system] Successfully activated service 'net.reactivated.Fprint'
Apr 27 19:15:33 greystone systemd[1]: Started Fingerprint Authentication Daemon.

Then (who/what is doing this, and why?):

Apr 27 19:15:35 greystone gnome-shell[4878]: Window manager warning: Overwriting existing binding of keysym 31 with keysym 31 (keycode a).
Apr 27 19:15:35 greystone gnome-shell[4878]: Window manager warning: Overwriting existing binding of keysym 32 with keysym 32 (keycode b).
Apr 27 19:15:35 greystone gnome-shell[4878]: Window manager warning: Overwriting existing binding of keysym 33 with keysym 33 (keycode c).
Apr 27 19:15:35 greystone gnome-shell[4878]: Window manager warning: Overwriting existing binding of keysym 34 with keysym 34 (keycode d).
Apr 27 19:15:35 greystone gnome-shell[4878]: Window manager warning: Overwriting existing binding of keysym 35 with keysym 35 (keycode e).
Apr 27 19:15:35 greystone gnome-shell[4878]: W...

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Lots of points there:

* The Nvidia GT 1030 is only around $100 if you shop around, but certainly don't spend any money unless you can afford it and really need to. I only suggested it because that's the cheapest way to get to Nvidia's latest 510 driver series, which may solve some bugs.

* You don't need to compile any kernels. Just download the pre-compiled packages from https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17.5/amd64/

* sox, pulseaudio, or any other high level sound issues are not directly relevant. What is relevant is to try with *nothing* playing sound so as to eliminate the possibility of the kernel sound modules being busy.

* gsd-color[5002]: no xrandr-DVI-I-0 device found: Failed to find output xrandr-DVI-I-0 does not appear to be relevant.

* Fprint / Fingerprint log messages are normal and not relevant.

* Window manager warning: Overwriting existing binding is bug 1857392, also not relevant here.

I think the next steps are:

 (a) test without any sound playing, particularly no web browser open; and

 (b) try a newer kernel like https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17.5/amd64/

Revision history for this message
ebsf (eb-9) wrote :

I've already disabled the .timer units for the Pomodoro sounds.

I'll investigate the kernel angle.

Here's an odd angle: Could this be a Vulkan issue? I picked up on a thread that pre-Haswell GPUs and integrated CPUs may have trouble implementing Vulkan. Graphics are discrete but my CPU is a Xeon E5-2609v2 (Ivy Bridge). The original manifestation of the GUI hang was xorg locking up one of my CPU cores. That was with driver 390, and the original thread was being addressed at VMware, with the fix being disable or reconfigure Vulkan or its timeout. See https://communities.vmware.com/t5/VMware-Workstation-Pro/VM-Crash-16-2-1-build-18811642/m-p/2906574/emcs_t/S2h8ZW1haWx8dG9waWNfc3Vic2NyaXB0aW9ufEwySlVMWUVLSzRSU0VNfDI5MDY1NzR8U1VCU0NSSVBUSU9OU3xoSw#M175498

The CPU lockup was seen in the Nvidia forums as a matter of CPU offload that shouldn't have happened (overaggressive power management couldn't up-cycle the GPU before a rendering timeout offloaded to the CPU). Maybe the CPU lockup in my case wasn't because of the offload per se, but that the CPU's architecture didn't fully support the Vulkan-mode rendering being offloaded to it.

While this was being run down, my PSU failed, taking my HDDs with it (yay backups!). Whereas driver 390 was all that could get 20.04 running in 2020, driver 470 worked better with 20.04.4 last month and the CPU lockups ceased, if not the GUI freezes. Not a surprise that CPU offloading was better handled in 470 two years later, but perhaps Vulkan is still baked into the pie and my CPU's response to it is still throwing things off.

I'll note there are null characters being thrown about, and random division errors, not to mention occasional invalid packets. These suggest a wrench still in the works. Thoughts?

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I don't think Vulkan or your CPU is relevant. But graphics drivers do rely on CPU features so it's not impossible. Doesn't really matter though because the conclusions would still be the same per below...

This seems to be a kernel or hardware issue. If changing the kernel version does not improve the situation then I would only be left to suggest a different graphics card that would allow you to avoid the nvidia-470 driver, and avoid hardware issues (if any) in the existing Nvidia card.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.