NVIDIA quad-GPU system crashes starting Wayland session

Bug #2015386 reported by dann frazier
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
mutter (Ubuntu)
Expired
Low
Unassigned

Bug Description

When logged into a desktop session, I created a new user account and then tried to "switch to" it. A new gnome session started for that user (test1), but after a few seconds, gnome-shell appears to have crashed. This only seems to happen on the first attempt to "switch to" a new user.

ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: xorg 1:7.7+23ubuntu2
ProcVersionSignature: Ubuntu 5.15.0-69.76-generic 5.15.87
Uname: Linux 5.15.0-69-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
.proc.driver.nvidia.capabilities.gpu0: Error: path was not a regular file.
.proc.driver.nvidia.capabilities.gpu1: Error: path was not a regular file.
.proc.driver.nvidia.capabilities.gpu2: Error: path was not a regular file.
.proc.driver.nvidia.capabilities.gpu3: Error: path was not a regular file.
.proc.driver.nvidia.capabilities.mig: Error: path was not a regular file.
.proc.driver.nvidia.gpus.0000.07.00.0: Error: path was not a regular file.
.proc.driver.nvidia.gpus.0000.08.00.0: Error: path was not a regular file.
.proc.driver.nvidia.gpus.0000.0e.00.0: Error: path was not a regular file.
.proc.driver.nvidia.gpus.0000.0f.00.0: Error: path was not a regular file.
.proc.driver.nvidia.registry: Binary: ""
.proc.driver.nvidia.suspend: suspend hibernate resume
.proc.driver.nvidia.suspend_depth: default modeset uvm
.proc.driver.nvidia.version:
 NVRM version: NVIDIA UNIX x86_64 Kernel Module 525.89.02 Wed Feb 1 23:23:25 UTC 2023
 GCC version:
ApportVersion: 2.20.11-0ubuntu82.3
Architecture: amd64
CasperMD5CheckResult: unknown
Date: Wed Apr 5 17:23:28 2023
DistUpgraded: Fresh install
DistroCodename: jammy
DistroVariant: ubuntu
ExtraDebuggingInterest: Yes, including running git bisection searches
MachineType: NVIDIA DGX Station
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-69-generic root=UUID=88df95a6-4fd9-475a-8b59-ad14df1ada5a ro
SourcePackage: xorg
Symptom: display
Title: Xorg crash
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 08/27/2018
dmi.bios.release: 5.11
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 0406
dmi.board.asset.tag: Default string
dmi.board.name: X99-E-10G WS
dmi.board.vendor: EMPTY
dmi.board.version: Rev 1.xx
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: EMPTY
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr0406:bd08/27/2018:br5.11:svnNVIDIA:pnDGXStation:pvrSystemVersion:rvnEMPTY:rnX99-E-10GWS:rvrRev1.xx:cvnEMPTY:ct3:cvrDefaultstring:sku920-22587-2510-000:
dmi.product.family: DGX
dmi.product.name: DGX Station
dmi.product.sku: 920-22587-2510-000
dmi.product.version: System Version
dmi.sys.vendor: NVIDIA
version.compiz: compiz N/A
version.libdrm2: libdrm2 2.4.113-2~ubuntu0.22.04.1
version.libgl1-mesa-dri: libgl1-mesa-dri 22.2.5-0ubuntu0.1~22.04.1
version.libgl1-mesa-glx: libgl1-mesa-glx N/A
version.nvidia-graphics-drivers: nvidia-graphics-drivers-* N/A
version.xserver-xorg-core: xserver-xorg-core 2:21.1.3-2ubuntu2.9
version.xserver-xorg-input-evdev: xserver-xorg-input-evdev N/A
version.xserver-xorg-video-ati: xserver-xorg-video-ati 1:19.1.0-2ubuntu1
version.xserver-xorg-video-intel: xserver-xorg-video-intel 2:2.99.917+git20210115-1
version.xserver-xorg-video-nouveau: xserver-xorg-video-nouveau 1:1.0.17-2build1

Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
Daniel van Vugt (vanvugt) wrote :
Download full text (3.9 KiB)

We need the robots to retrace that properly, but what I can figure so far is:

#0 __pthread_kill_implementation (no_tid=0, signo=11,
    threadid=139744913655232) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x7f18e5f005c0 (LWP 7141))]
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=11,
    threadid=139744913655232) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=11, threadid=139744913655232)
    at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=139744913655232, signo=signo@entry=11)
    at ./nptl/pthread_kill.c:89
#3 0x00007f18eb1c9476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4 0x0000561d62d507aa in ?? ()
#5 <signal handler called>
#6 __strlen_avx2_rtm () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:74
#7 0x00007f18d5f2192a in ?? ()
   from /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
#8 0x00007f18d5f21567 in ?? ()
   from /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
#9 0x00007f18d62afed0 in ?? ()
   from /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
#10 0x00007f18d62ab825 in ?? ()
   from /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
#11 0x00007f18d62ad4c7 in ?? ()
   from /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
#12 0x00007f18d61940da in ?? ()
   from /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
#13 0x00007f18d617523f in ?? ()
   from /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
#14 0x00007f18eae25cf2 in ?? ()
   from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#15 0x00007f18eae22b3c in ?? ()
   from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#16 0x00007f18eae22fee in ?? ()
   from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#17 0x00007f18eae33c71 in ?? ()
   from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#18 0x00007f18eae1bc76 in ?? ()
   from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#19 0x00007f18eae4aeb2 in ?? ()
   from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#20 0x00007f18eae4b233 in ?? ()
   from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#21 0x00007f18eae4b4c6 in ?? ()
   from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#22 0x00007f18eae4bf4f in ?? ()
   from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#23 0x00007f18eae53343 in ?? ()
   from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#24 0x00007f18ec109f80 in g_list_foreach ()
   from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#25 0x00007f18eae52e7b in ?? ()
   from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#26 0x00007f18eae55e30 in cogl_blit_framebuffer ()
   from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#27 0x00007f18eb695b4d in clutter_stage_view_before_swap_buffer ()
   from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-clutter-10.so.0
#28 0x00007f18eb434803 in ?? () from /lib/x86_64-linux-gnu/libmutter-10.so.0
#29 0x00007f18eb438f71 in ?? () from /lib/x86_64-linux-gnu/libmutter-10.so.0
#30 0x00007f18eb52482b in ?? () from /lib/x86_64-linux-gnu/libmutter-10.so.0
#31 0x00007f18eb696258 in ?? ()
   from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-clutter-10.so.0...

Read more...

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Three observations:

- Crashing in nouveau_dri.so which shouldn't be used at all when the proper Nvidia driver is installed.

- Unexplained crashes in cogl_blit_framebuffer() is something I've seen before when working on Nvidia multi-GPU support. So that sounds relevant; https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/2749

- The #1 gnome-shell crash in jammy (bug 1960590) is also Nvidia-specific but is not provably related yet...

affects: xorg (Ubuntu) → nvidia-graphics-drivers-525 (Ubuntu)
tags: added: need-amd64-retrace
tags: added: multigpu nvidia wayland wayland-session
tags: removed: need-amd64-retrace
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

OK that didn't work. We can't usually handle crash files as attachments... Please report the crash file again by running:

  ubuntu-bug _usr_bin_gnome-shell.1002.crash

or

  apport-cli _usr_bin_gnome-shell.1002.crash

from the machine.

Changed in gnome-shell (Ubuntu):
status: New → Incomplete
no longer affects: mesa (Ubuntu)
no longer affects: nvidia-graphics-drivers-525 (Ubuntu)
Revision history for this message
Daniel van Vugt (vanvugt) wrote :
tags: added: need-amd64-retrace
tags: removed: need-amd64-retrace
Revision history for this message
dann frazier (dannf) wrote :

I need to deploy the machine from scratch each time I access it, so I won't be able to run that command in the same install. I tried reproducing again from scratch, but this time "switch user" immediately aborted w/o generating a gnome-shell crash. I'm not sure what was different - I do know that the nvidia driver I was using has been updated in the distro, so at least that had changed.

Re: nouveau_dri.so - I realized that I'd only installed the nvidia driver packages and not the "full stack" - i.e. nvidia-driver-525 and dependencies. Perhaps I was running with an unsupported config the first time. I tried installing that stack and reproducing again, but this time I saw no problem doing "switch user" at all. At that point my reservation expired and I was unable to collect logs from this session.

Next I'll try to do this all over again, making sure to install both nvidia-driver-525 linux-modules-nvidia-525-generic and see if I can reproduce.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I've heard that our server/datacenter nvidia drivers intentionally don't support modesetting so that might lead to different behaviour again. The best thing would be to find out what driver packages the customer has installed.

Revision history for this message
dann frazier (dannf) wrote :

I reset all the relevant packages back to the versions used when I originally filed this bug (luckily I stashed an sosreport), and I was able to reproduce.

https://errors.ubuntu.com/user/b18128bc669a4a4d1f371a58f4f5d719833ab52c5ab794dbbc9d9e55b45d12516efcc5085f9d93e4d219bc6948f530cdaa31ddf8e34b7e9242e050c269d729c3

Let me know if that is what you needed.

This is using the UDA (non-server) drivers, but the user for originally reported the bug to me appears to have been using the -server drivers as you surmised. I tried switching to the 525-server driver and I still experienced an issue with switch user, but the symptoms are different. Instead of entering a new login for a few seconds followed by a gnome-shell crash, this seemed to immediately kick me out of the switched-to session with no gnome-shell crash. Since this is likely a different issue, I reported it as bug 2016193 to keep the threads separate.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Seems there is no stack trace in https://errors.ubuntu.com/oops/7a01bd23-da4e-11ed-ba6b-fa163ef35206

Can you find the original /var/crash/ file ?

Revision history for this message
dann frazier (dannf) wrote :

Attached

summary: - gnome-shell crashes after "switching to" a new user from an existing
- login
+ Multi-NVIDIA-GPU system crashes starting Wayland session
affects: gnome-shell (Ubuntu) → mutter (Ubuntu)
Changed in mutter (Ubuntu):
status: Incomplete → New
Revision history for this message
Daniel van Vugt (vanvugt) wrote :
Download full text (3.2 KiB)

Comment #11 is still the same as comment #3:

#0 __pthread_kill_implementation (no_tid=0, signo=11, threadid=140206817158592) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=11, threadid=140206817158592) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140206817158592, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#3 0x00007f8476b4b476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4 0x000055b6f0c1b7aa in ?? ()
#5 <signal handler called>
#6 __strlen_avx2_rtm () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:74
#7 0x00007f846972092a in ?? () from /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
#8 0x00007f8469720567 in ?? () from /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
#9 0x00007f8469aaeed0 in ?? () from /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
#10 0x00007f8469aaa825 in ?? () from /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
#11 0x00007f8469aac4c7 in ?? () from /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
#12 0x00007f84699930da in ?? () from /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
#13 0x00007f846997423f in ?? () from /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
#14 0x00007f84767a7cf2 in ?? () from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#15 0x00007f84767a4b3c in ?? () from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#16 0x00007f84767a4fee in ?? () from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#17 0x00007f84767b5c71 in ?? () from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#18 0x00007f847679dc76 in ?? () from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#19 0x00007f84767cceb2 in ?? () from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#20 0x00007f84767cd233 in ?? () from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#21 0x00007f84767cd4c6 in ?? () from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#22 0x00007f84767cdf4f in ?? () from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#23 0x00007f84767d5343 in ?? () from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#24 0x00007f8477a8bf80 in g_list_foreach () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#25 0x00007f84767d4e7b in ?? () from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#26 0x00007f84767d7e30 in cogl_blit_framebuffer () from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-cogl-10.so.0
#27 0x00007f8477017b4d in clutter_stage_view_before_swap_buffer () from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-clutter-10.so.0
#28 0x00007f8476db6803 in ?? () from /lib/x86_64-linux-gnu/libmutter-10.so.0
#29 0x00007f8476dbaf71 in ?? () from /lib/x86_64-linux-gnu/libmutter-10.so.0
#30 0x00007f8476ea682b in ?? () from /lib/x86_64-linux-gnu/libmutter-10.so.0
#31 0x00007f8477018258 in ?? () from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-clutter-10.so.0
#32 0x00007f8476fe0da9 in ?? () from /usr/lib/x86_64-linux-gnu/mutter-10/libmutter-clutter-10.so.0
#33 0x00007f8477a91d3b in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#34 0x00007f8477ae66c8 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#35 0x00007f8477a912b3 in g_main_loop_run () from /lib/x86_64-linux-gnu/...

Read more...

summary: - Multi-NVIDIA-GPU system crashes starting Wayland session
+ NVIDIA quad-GPU system crashes starting Wayland session
Revision history for this message
dann frazier (dannf) wrote (last edit ):

The .crash file from comment 11 has been submitted via ubuntu-bug in bug 2016299 - I'll leave it to the developers to determine which one should be the dupe.

tags: added: nvidia-wayland
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Bug 2016299 is a little helpful but strange it couldn't retrace beyond:

StacktraceTop:
 __strlen_avx2_rtm () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:74
 _mesa_hash_string (_key=0x0) at ../src/util/hash_table.c:662
 _mesa_hash_table_search (ht=0x55b6f40fefe0, key=0x0) at ../src/util/hash_table.c:348
 (anonymous namespace)::interface_block_definitions::lookup (var=0x55b6f6804240, this=0x7ffebdd6ecb0) at ../src/compiler/glsl_types.h:1065
 validate_intrastage_interface_blocks (prog=prog@entry=0x55b6f102f7f0, shader_list=shader_list@entry=0x55b6f5d1df50, num_shaders=num_shaders@entry=1) at ../src/compiler/glsl/link_interface_blocks.cpp:353

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Private bug 2016299 now has more complete stack traces.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Setting to low because high-end Nvidia datacenter hardware should be using Xorg instead for the foreseeable future.

Changed in mutter (Ubuntu):
importance: Undecided → Low
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I've set up a dual Nvidia system and cannot reproduce any crashes.

The second Nvidia card fails to display anything if a monitor is attached but that's basically the same as bug 1978905, and the same experimental patch (https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/3416) mostly fixes that.

Do we care about this bug anymore? Is it reproducible by anyone?

Changed in mutter (Ubuntu):
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for mutter (Ubuntu) because there has been no activity for 60 days.]

Changed in mutter (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.