Comment 7 for bug 2031367

Revision history for this message
dann frazier (dannf) wrote (last edit ): Re: No video output from discrete GPU (DGX Station A100)

Thank you for the sosreport... let's see here...

# dmesg shows you are booted on the correct kernel:
$ head -1 ./sos_commands/kernel/dmesg
[ 0.000000] Linux version 5.15.0-1030-nvidia (buildd@bos03-amd64-047) (gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #30-Ubuntu SMP Tue Jul 18 19:22:42 UTC 2023 (Ubuntu 5.15.0-1030.30-nvidia 5.15.111)
$

# great.

# But there's no nvidia module loaded:
$ grep nvidia lsmod
$

# weird.

# You do have the correct metapackage for the nvidia modules installed:
$ grep linux-modules-nvidia-535-server-nvidia ./sos_commands/dpkg/dpkg_-l
ii linux-modules-nvidia-535-server-nvidia 5.15.0-1029.29+1 amd64 Extra drivers for nvidia-535-server for the nvidia flavour

# But.. why is it version 5.15.0-*1029*.29+1, and not 1030? That means you have an nvidia driver for the wrong kernel version, which explains this result. What could have caused that? Was the archive out of sync?

# According to Launchpad, the 1030 nvidia modules were released to jammy-updates on 2023-08-14 07:39:06 UTC :
# https://launchpad.net/ubuntu/+source/linux-restricted-modules-nvidia/+publishinghistory

# The 1030 kernel was released at the same time:
# 2023-08-14 07:39:06 UTC

# So if one was available, both should've been available.

# And indeed, your system sees that the modules for 1030 are currently available.
# From ./sos_commands/apt/apt-cache_policy_details :

linux-modules-nvidia-535-server-nvidia:
  Installed: 5.15.0-1029.29+1
  Candidate: 5.15.0-1030.30+1
  Version table:
     5.15.0-1030.30+1 500
        500 http://us.archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages
 *** 5.15.0-1029.29+1 500
        500 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages
        100 /var/lib/dpkg/status

# But the sosreport was collected after the install, so maybe it did not become available until later.

# The command you ran looks correct (from var/log/auth.log):
Aug 14 03:45:39 u-DGX-Station-A100-920-23487-2531-000 sudo: u : TTY=pts/2 ; PWD=/home/u ; USER=root ; COMMAND=/usr/bin/apt install nvidia-utils-535-server nvidia-kernel-source-535-server linux-modules-nvidia-535-server-nvidia nvidia-fabricmanager-535 nvidia-driver-535-server

# And if I run it now in a fresh VM, it does the right thing:

$ sudo apt install nvidia-utils-535-server nvidia-kernel-source-535-server linux-modules-nvidia-535-server-nvidia nvidia-fabricmanager-535 nvidia-driver-535-server --dry-run | grep linux-modules-nvidia

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

  linux-modules-nvidia-535-server-5.15.0-1030-nvidia
  linux-modules-nvidia-535-server-5.15.0-1030-nvidia
  linux-modules-nvidia-535-server-nvidia
Inst linux-modules-nvidia-535-server-5.15.0-1030-nvidia (5.15.0-1030.30+1 Ubuntu:22.04/jammy-updates [amd64])
Inst linux-modules-nvidia-535-server-nvidia (5.15.0-1030.30+1 Ubuntu:22.04/jammy-updates [amd64])
Conf linux-modules-nvidia-535-server-5.15.0-1030-nvidia (5.15.0-1030.30+1 Ubuntu:22.04/jammy-updates [amd64])
Conf linux-modules-nvidia-535-server-nvidia (5.15.0-1030.30+1 Ubuntu:22.04/jammy-updates [amd64])

# My only theory is that, maybe, you happened to run an `apt update` in a weird situation where the Packages file for main (where linux-nvidia lives) was ahead of the Packages file for restricted (where linux-modules-nvidia-535-server-nvidia lives) on your mirror. I suspect that if you ran `sudo apt update; sudo apt dist-upgrade` and rebooted, your discrete video would recover.