Crash

Bug #2058838 reported by Byron
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
New
Undecided
Jose Ogando Justo
nvidia-graphics-drivers-535 (Ubuntu)
New
Undecided
Unassigned

Bug Description

When switching KVM my X session crashes. Syslog gets cleared as well for some reason.

ProblemType: Bug
DistroRelease: Ubuntu 24.04
Package: xorg 1:7.7+23ubuntu2
ProcVersionSignature: Ubuntu 6.8.0-11.11-generic 6.8.0-rc4
Uname: Linux 6.8.0-11-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
.proc.driver.nvidia.capabilities.gpu0: Error: path was not a regular file.
.proc.driver.nvidia.capabilities.mig: Error: path was not a regular file.
.proc.driver.nvidia.gpus.0000.01.00.0: Error: path was not a regular file.
.proc.driver.nvidia.registry: Binary: ""
.proc.driver.nvidia.suspend: suspend hibernate resume
.proc.driver.nvidia.suspend_depth: default modeset uvm
.proc.driver.nvidia.version:
 NVRM version: NVIDIA UNIX x86_64 Kernel Module 535.161.07 Sat Feb 17 22:55:48 UTC 2024
 GCC version:
ApportVersion: 2.28.0-0ubuntu1
Architecture: amd64
BootLog: Error: [Errno 13] Permission denied: '/var/log/boot.log'
CasperMD5CheckResult: pass
CompositorRunning: None
CurrentDesktop: ubuntu:GNOME
Date: Sat Mar 23 23:58:33 2024
DistUpgraded: Fresh install
DistroCodename: noble
DistroVariant: ubuntu
ExtraDebuggingInterest: Yes, including running git bisection searches
GraphicsCard:
 NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1) (prog-if 00 [VGA controller])
   Subsystem: NVIDIA Corporation GA102GL [RTX A5000] [10de:147e]
InstallationDate: Installed on 2022-04-07 (717 days ago)
InstallationMedia: Ubuntu 22.04 LTS "Jammy Jellyfish" - Daily amd64 (20220405)
MachineType: ASUS System Product Name
ProcEnviron:
 LANG=en_US.UTF-8
 PATH=(custom, no user)
 SHELL=/bin/bash
 TERM=xterm-256color
 XDG_RUNTIME_DIR=<set>
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-6.8.0-11-generic root=UUID=c56061b9-1478-4914-a919-96eb077fff48 ro quiet splash nvidia-drm.modeset=1 initcall_blacklist=acpi_cpufreq_init amd_pstate=active amd_pstate.shared_mem=1 acpi_enforce_resources=lax vt.handoff=7
SourcePackage: xorg
Symptom: display
Title: Xorg crash
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 09/26/2023
dmi.bios.release: 18.7
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 1807
dmi.board.asset.tag: Default string
dmi.board.name: PRIME B650-PLUS
dmi.board.vendor: ASUSTeK COMPUTER INC.
dmi.board.version: Rev 1.xx
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr1807:bd09/26/2023:br18.7:svnASUS:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnPRIMEB650-PLUS:rvrRev1.xx:cvnDefaultstring:ct3:cvrDefaultstring:skuSKU:
dmi.product.family: To be filled by O.E.M.
dmi.product.name: System Product Name
dmi.product.sku: SKU
dmi.product.version: System Version
dmi.sys.vendor: ASUS
mtime.conffile..etc.init.d.apport: 2024-02-22T08:20:00
version.compiz: compiz N/A
version.libdrm2: libdrm2 2.4.120-2
version.libgl1-mesa-dri: libgl1-mesa-dri 24.0.1-1ubuntu1
version.libgl1-mesa-glx: libgl1-mesa-glx N/A
version.nvidia-graphics-drivers: nvidia-graphics-drivers-* N/A
version.xserver-xorg-core: xserver-xorg-core 2:21.1.11-2ubuntu1
version.xserver-xorg-input-evdev: xserver-xorg-input-evdev N/A
version.xserver-xorg-video-ati: xserver-xorg-video-ati 1:22.0.0-1
version.xserver-xorg-video-intel: xserver-xorg-video-intel 2:2.99.917+git20210115-1
version.xserver-xorg-video-nouveau: xserver-xorg-video-nouveau 1:1.0.17-2build1

Revision history for this message
Byron (byron-goodman) wrote :
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It sounds like some part of the system has crashed. To help us find the cause of the crash please follow these steps:

1. Run these commands:
    journalctl -b0 > journal.txt
    journalctl -b-1 > prevjournal.txt
and attach the resulting text files here.

2. Look in /var/crash for crash files and if found run:
    ubuntu-bug YOURFILE.crash
Then tell us the ID of the newly-created bug.

3. If step 2 failed then look at https://errors.ubuntu.com/user/ID where ID is the content of file /var/lib/whoopsie/whoopsie-id on the machine. Do you find any links to recent problems on that page? If so then please send the links to us.

Please take care to avoid attaching .crash files to bugs as we are unable to process them as file attachments. It would also be a security risk for yourself.

affects: xorg (Ubuntu) → ubuntu
Changed in ubuntu:
status: New → Incomplete
summary: - Xorg crash
+ Crash
Revision history for this message
Byron (byron-goodman) wrote :

See attached. I found no .crash files that related to this in /var/crash or . See attached prevjournal.txt file. I'll post journal.txt next.

Revision history for this message
Byron (byron-goodman) wrote :

See attached file.

Revision history for this message
Byron (byron-goodman) wrote :

I also don't see any crashes related to this based on it's whoopsie ID. I do see several crashes though. Let me know if you want any of these. See attached image.

Revision history for this message
Byron (byron-goodman) wrote :
Download full text (11.7 KiB)

When this occurs, I am see something related to this in syslog.

2024-03-24T11:57:21.284582-05:00 jove kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm]

[GPU ID 0x00000100] Failed to grab modeset ownership

2024-03-24T13:17:39.284647-05:00 jove kernel: UBSAN: array-index-out-of-bounds in build/nvidia/535.161.07/build/nvidia-uvm/uvm_pmm_gpu.c:2364:28
2024-03-24T13:17:39.284648-05:00 jove kernel: index 0 is out of range for type 'uvm_gpu_chunk_t *[*]'
2024-03-24T13:17:39.284648-05:00 jove kernel: CPU: 13 PID: 123928 Comm: ghb Tainted: P W O 6.8.0-11-generic #11-Ubuntu
2024-03-24T13:17:39.284649-05:00 jove kernel: Hardware name: ASUS System Product Name/PRIME B650-PLUS, BIOS 1807 09/26/2023
2024-03-24T13:17:39.284649-05:00 jove kernel: Call Trace:
2024-03-24T13:17:39.284650-05:00 jove kernel: <TASK>
2024-03-24T13:17:39.284651-05:00 jove kernel: dump_stack_lvl+0x48/0x70
2024-03-24T13:17:39.284651-05:00 jove kernel: dump_stack+0x10/0x20
2024-03-24T13:17:39.284651-05:00 jove kernel: __ubsan_handle_out_of_bounds+0xc6/0x110
2024-03-24T13:17:39.284658-05:00 jove kernel: split_gpu_chunk+0x13f/0x410 [nvidia_uvm]
2024-03-24T13:17:39.284658-05:00 jove kernel: uvm_pmm_gpu_alloc+0x2da/0x6d0 [nvidia_uvm]
2024-03-24T13:17:39.284658-05:00 jove kernel: phys_mem_allocate+0xac/0x230 [nvidia_uvm]
2024-03-24T13:17:39.284659-05:00 jove kernel: allocate_directory+0xb4/0x130 [nvidia_uvm]
2024-03-24T13:17:39.284659-05:00 jove kernel: ? allocate_directory+0xb4/0x130 [nvidia_uvm]
2024-03-24T13:17:39.284659-05:00 jove kernel: uvm_page_tree_init+0x133/0x450 [nvidia_uvm]
2024-03-24T13:17:39.284660-05:00 jove kernel: uvm_gpu_retain_by_uuid+0x19df/0x2b80 [nvidia_uvm]
2024-03-24T13:17:39.284660-05:00 jove kernel: uvm_va_space_register_gpu+0x47/0x740 [nvidia_uvm]
2024-03-24T13:17:39.284660-05:00 jove kernel: ? srso_alias_return_thunk+0x5/0xfbef5
2024-03-24T13:17:39.284661-05:00 jove kernel: ? rmqueue+0x825/0xee0
2024-03-24T13:17:39.284661-05:00 jove kernel: uvm_api_register_gpu+0x5a/0x90 [nvidia_uvm]
2024-03-24T13:17:39.284661-05:00 jove kernel: uvm_ioctl+0x1a26/0x1cd0 [nvidia_uvm]
2024-03-24T13:17:39.284662-05:00 jove kernel: ? srso_alias_return_thunk+0x5/0xfbef5
2024-03-24T13:17:39.284662-05:00 jove kernel: ? __mod_lruvec_state+0x36/0x50
2024-03-24T13:17:39.284662-05:00 jove kernel: ? srso_alias_return_thunk+0x5/0xfbef5
2024-03-24T13:17:39.284663-05:00 jove kernel: ? srso_alias_return_thunk+0x5/0xfbef5
2024-03-24T13:17:39.284664-05:00 jove kernel: ? next_uptodate_folio+0xa9/0x320
2024-03-24T13:17:39.284664-05:00 jove kernel: ? srso_alias_return_thunk+0x5/0xfbef5
2024-03-24T13:17:39.284665-05:00 jove kernel: ? filemap_map_pages+0x2fe/0x4c0
2024-03-24T13:17:39.284665-05:00 jove kernel: ? srso_alias_return_thunk+0x5/0xfbef5
2024-03-24T13:17:39.284666-05:00 jove kernel: ? _raw_spin_lock_irqsave+0xe/0x20
2024-03-24T13:17:39.284666-05:00 jove kernel: ? srso_alias_return_thunk+0x5/0xfbef5
2024-03-24T13:17:39.284666-05:00 jove kernel: ? thread_context_non_interrupt_add+0x13a/0x250 [nvidia_uvm]
2024-03-24T13:17:39.284666-05:00 jove kernel: uvm_unlocked_ioctl_entry.part.0+0x7b/0xf0 [nvidia_uvm]
2024-03-24T13:17:39.284667-05:...

Revision history for this message
Byron (byron-goodman) wrote :

Something is erasing syslog. It was there, and now it is gone.

Revision history for this message
Byron (byron-goodman) wrote :

The syslog is a separate issue. I am seeing this, and then the syslog history files are wiped.

2024-03-17T00:00:02.188526-05:00 jove systemd[1]: rsyslog.service: Sent signal SIGHUP to main process 2496 (rsyslogd) on client request.
2024-03-17T00:00:02.188713-05:00 jove rsyslogd: [origin software="rsyslogd" swVersion="8.2312.0" x-pid="2496" x-info="https://www.rsyslog.com"] rsyslogd was HUPed
2024-03-17T00:00:02.193749-05:00 jove systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
2024-03-17T00:00:02.193826-05:00 jove systemd[1]: logrotate.service: Failed with result 'exit-code'.
2024-03-17T00:00:02.193964-05:00 jove systemd[1]: Failed to start logrotate.service - Rotate log files.
2024-03-17T00:00:03.396117-05:00 jove crontab[151217]: (root) LIST (root)
2024-03-17T00:00:03.397263-05:00 jove crontab[151218]: (root) LIST (root)

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Did you run out of disk space causing log rotation to fail?

Also, are you sure the same nvidia kernel messages appear every time with this bug?

Revision history for this message
Byron (byron-goodman) wrote :

I am not close to running out of space. I am seeing a drm gpu error when I switch back. It blinks and is gone, and then I prompted by gdm to login again, my entire X session is gone. Wayland also has a similar problem, except the gdm never comes back. Everything is frozen.

I don't think the syslog issue is related, but maybe. Let me know if there is something you want me to do. I could also provide you access to the machine via vpn. I haven't done it in a while, but I could attach gdb to my session and see where it's crashing if this a difficult find. It is reproducible every time. Something is telling me the problem is related to the driver, and when the display goes away, the drm context is getting released, and then accessed again. But take that with a grain of salt, I haven't done any low level gpu development in 15 years.

affects: ubuntu → nvidia-graphics-drivers-535 (Ubuntu)
Changed in nvidia-graphics-drivers-535 (Ubuntu):
status: Incomplete → New
tags: added: nvidia
Changed in linux (Ubuntu):
assignee: nobody → Jose Ogando Justo (joseogando)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.