GpuWatchdog segfault in libcef.so

Bug #2045951 reported by Ken Sharp
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux-signed-hwe-6.2 (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

The GpuWatchdog can segfault when using apps that use Electron (https://www.electronjs.org/) which uses the Chromium Embedded Framework (CEF) [Steam, Code, Spotify, Teams, Discord...]. Indeed the same crash can occur in Chrome and Brave.

In my specific case this occurs when I have switched off my monitor and left the computer go idle for a while (I don't know the timings exactly and I don't know how to force the situation). The computer remains responsive until I log back in from the Mate lock screen [note: if I lock my screen I get the light-locker log-in so I do not know what causes the machine to choose the Mate lock screen].

Upon logging back in the system becomes unresponsive. I can access the machine over SSH and force a reboot. The segfault in GpuWatchdog appears immediately on logging back in.

For me, this first appears to have occured on Dec 2.

Dec 2 22:49:52 ken kernel: [191969.402923] GpuWatchdog[9387]: segfault at 0 ip 00007efc77192336 sp 00007efc6b9fd370 error 6 in libcef.so[7efc72cef000+776f000] likely on CPU 3 (core 3, socket 0)
Dec 2 22:49:52 ken kernel: [191969.402960] Code: 89 de e8 3d ef 6e ff 80 7d cf 00 79 09 48 8b 7d b8 e8 be 65 2c 03 41 8b 84 24 e0 00 00 00 89 45 b8 48 8d 7d b8 e8 ca d7 b5 fb <c7> 04 25 00 00 00 00 37 13 00 00 48 83 c4 38 5b 41 5c 41 5d 41 5e

I can find no earlier entry in the syslog. My apt history shows nothing obvious but I'll attach it anyway.

This GpuWatchdog segfault in libcef.so seems to occur a surprising amount around the Internet but the trigger seems to vary. Nonetheless CEF shouldn't be causing systems to become unresponsive. It is possible (probable) there is a bug in the CEF, but equally it should not be able to make a system unresponsive unless there is a bug in the kernel or X – this occurs on all kinds of systems with all kinds of graphics so the graphics driver seems an unlikely cause. It is not unique to Ubuntu but I don't know where to start given all the possible components.

The only upstream reported kernel bug was closed as not enough information was supplied:
https://bugzilla.kernel.org/show_bug.cgi?id=209129

This has been reported for Ubuntu a number of times but I thought it best to start fresh:
https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-450/+bug/1896560
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1861294
https://bugs.launchpad.net/ubuntu/+source/syslog-ng/+bug/1903203
https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-amdgpu/+bug/1921286

From around the Interwebs:
https://bbs.archlinux.org/viewtopic.php?id=263124
https://askubuntu.com/q/1490916/170177
https://github.com/ValveSoftware/steam-for-linux/issues/7370
https://github.com/ValveSoftware/steam-for-linux/issues/9793
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1040399
https://unix.stackexchange.com/q/658684/45386
https://www.linuxquestions.org/questions/linux-software-2/kde-plasma-on-wayland-hardware-acceleration-lock-ups-4175718801/

For testing I tried the following but couldn't trigger this manually:
Windows+L to lock screen, switch monitor off, switch back on and log back in.
Windows+L to lock screen, switch monitor off, switch back on, choose "Switch user" and log back in.
xset dpms force off then wake the screen up.

Any ideas how to continue would be greatly appreciated.
I'll test the upstream kernel and report back. Might take a day or two. I'm not 100% convinced it's a kernel bug at this point though.
I also note the --disable-gpu-* options for CEF which I'll also test.

ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: linux-image-6.2.0-37-generic 6.2.0-37.38~22.04.1
ProcVersionSignature: Ubuntu 6.2.0-37.38~22.04.1-generic 6.2.16
Uname: Linux 6.2.0-37-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: amd64
CasperMD5CheckResult: unknown
CurrentDesktop: MATE
Date: Fri Dec 8 07:09:25 2023
SourcePackage: linux-signed-hwe-6.2
UpgradeStatus: No upgrade log present (probably fresh install)
modified.conffile..etc.apport.crashdb.conf: [modified]
mtime.conffile..etc.apport.crashdb.conf: 2019-08-06T11:56:22.315382

Revision history for this message
Ken Sharp (kennybobs) wrote :
Revision history for this message
Ken Sharp (kennybobs) wrote :
Revision history for this message
Ken Sharp (kennybobs) wrote :
Ken Sharp (kennybobs)
description: updated
description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-signed-hwe-6.2 (Ubuntu):
status: New → Confirmed
Revision history for this message
Jack (jack07) wrote :

Yeah, i have this bug since forever, occurs randomly when opening steam. It will load the steam window but not all the content inside then freeze.

Ubuntu 22.04
Kernel: 6.2.0-37-generic
Amdgpu Mesa 23.3.0 - kisak-mesa PPA (LLVM 15.0.7)

A crash from today

Dec 8 15:31:33 H-Linux steam.desktop[20716]: RegisterForAppOverview 2: 325ms
Dec 8 15:31:34 H-Linux <email address hidden>[2663]: unable to update icon for steam
Dec 8 15:32:20 H-Linux kernel: [ 3518.714561] GpuWatchdog[20825]: segfault at 0 ip 00007fb7cbd92bc6 sp 00007fb7c09fd370 error 6 in libcef.so[7fb7c78ef000+7770000] likely on CPU 0 (core 0, socket 0)
Dec 8 15:32:20 H-Linux kernel: [ 3518.714585] Code: 89 de e8 4d ee 6e ff 80 7d cf 00 79 09 48 8b 7d b8 e8 2e 66 2c 03 41 8b 84 24 e0 00 00 00 89 45 b8 48 8d 7d b8 e8 3a d1 b5 fb <c7> 04 25 00 00 00 00 37 13 00 00 48 83 c4 38 5b 41 5c 41 5d 41 5e

Ken Sharp (kennybobs)
description: updated
Revision history for this message
Ken Sharp (kennybobs) wrote :

I left the computer to lock itself and killed all steamwebhelper processes before attempting to log in. The system once again became unusable but no segault occurred.

It's possible the segfault is a symptom rather than a cure. Nonetheless there are no logs to help debug any further.

Revision history for this message
Ken Sharp (kennybobs) wrote :

The mainline kernel mapping has no entry for 6.2 (https://people.canonical.com/~kernel/info/kernel-version-map.html) so I took a guess and installed https://kernel.ubuntu.com/mainline/v6.2/ (AMD64). Testing this.

I also tried https://kernel.ubuntu.com/mainline/v6.7-rc5/ but the header package requires a newer libc and so failed. Hence my Nvidia drivers wouldn't build. So that wasn't an option.

Revision history for this message
ku4eto (ku4eto) wrote :

Similar seems to be also present when using Intel Iris Xe Graphics. While browsing, Gnome hangs with scrolldown. Its responsive for a bit to rightclicks or power button prompts, but does not accept anything else. Freezes in about a minute. It usually happens when Discord is open, with no specific timeframes or actions. Nothing shows in the syslogs, not sure if related.
When launching Steam, most of the time it hangs for about 30s, after which it opens normally. Not sure if split_lock is part of the issue.

Revision history for this message
Balrog (balrogx) wrote (last edit ):

Hi, I'm facing the same issue, but using gentoo.

However I managed to reproduce the bug, hope this helps!
Steps I took to reproduce:

#1 Start Signal-Desktop and Steam (not using steam runtime)
#2 Suspend, close laptop lid and turn off external monitor
#3 Open lid and turn on monitor (laptop resumes)
#4 Freeze

I were using SysRq (r+e+i) to "unfreeze" the system. Attaching hw-probe upload (includes dmesg)
https://linux-hardware.org/?probe=1ecec883dd

Please tell me, in case you find a fix, where to forward the bugreport to, in order it gets fixed on my system as well.

edit1: Suspend does work fine without Steam&Signal started, not sure if it's related to steam or signal
edit2: libcef.so does not appear to be used by signal for me, it's only occurrence on my system is within ~/.local/share/Steam/ubuntu12_64
I was not able to reproduce the freeze a second time *after* doing SyrRQ (r+e+i, killing all user processes but init) without rebooting -> this is a "bad" workaround, suspend works without errors (even steam&signal is running)
edit3: Found a workaround: Sandboxing steam/cef. Not sure what cef tried to access. https://bpa.st/NJGA works for me. Reproduced the bug again after rebooting using above method. With my bwrap sandbox script suspend/resume works without issues after reboot.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.