Archive races can cause nvidia driver / kernel version ABI mismatch

Bug #2031367 reported by Weichen Wu
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu
Confirmed
Undecided
Unassigned
linux-restricted-modules (Ubuntu)
In Progress
High
Andy Whitcroft
nvidia-graphics-drivers-535-server (Ubuntu)
In Progress
High
Alberto Milone

Bug Description

[Summary]
after selecting ubuntu from bios boot menu, it lost video output from the discrete GPU

* build-in VGA port works fine, can output the display to monitor and boot into OS
only the discrete GPU has issue

[Steps to reproduce]
It is not currently reproducible but, at some point in the past, these steps resulted in the problem:
1. install 22.04
2. install linux-nvidia
3. install linux-modules-nvidia-535-server-nvidia

[Actual result]
linux-nvidia w/ a 1030 ABI got installed with a linux-modules-nvidia-535-server-nvidia with a 1029 ABI.

[Additional information]
CID: 202307-31886
SKU: DGX station A100
system-manufacturer: NVIDIA
system-product-name: DGX Station A100 920-23487-2531-000
bios-version: L9.28C
CPU: AMD EPYC 7742 64-Core Processor (128x)
GPU: 01:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 SXM4 80GB] [10de:20b2] (rev a1)
46:00.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 41)
47:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 SXM4 80GB] [10de:20b2] (rev a1)
81:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 SXM4 80GB] [10de:20b2] (rev a1)
c1:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU117GLM [Quadro T1000 Mobile] [10de:1fb0] (rev a1)
c2:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 SXM4 80GB] [10de:20b2] (rev a1)
Nvidia Driver: 535.54.03
kernel-version: 5.15.0-1030-nvidia

[Stage]
Issue reported and logs collected at a later stage

Revision history for this message
Weichen Wu (weichenwu) wrote :

Automatically attached

summary: - No video output from discrete GPU
+ No video output from discrete GPU (DGX A100)
Revision history for this message
Weichen Wu (weichenwu) wrote :

Automatically attached

Revision history for this message
Weichen Wu (weichenwu) wrote :
Revision history for this message
Weichen Wu (weichenwu) wrote :
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Libera.chat.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/2031367/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
dann frazier (dannf)
summary: - No video output from discrete GPU (DGX A100)
+ No video output from discrete GPU (DGX Station A100)
Revision history for this message
dann frazier (dannf) wrote (last edit ): Re: No video output from discrete GPU (DGX Station A100)
Download full text (3.5 KiB)

Thank you for the sosreport... let's see here...

# dmesg shows you are booted on the correct kernel:
$ head -1 ./sos_commands/kernel/dmesg
[ 0.000000] Linux version 5.15.0-1030-nvidia (buildd@bos03-amd64-047) (gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #30-Ubuntu SMP Tue Jul 18 19:22:42 UTC 2023 (Ubuntu 5.15.0-1030.30-nvidia 5.15.111)
$

# great.

# But there's no nvidia module loaded:
$ grep nvidia lsmod
$

# weird.

# You do have the correct metapackage for the nvidia modules installed:
$ grep linux-modules-nvidia-535-server-nvidia ./sos_commands/dpkg/dpkg_-l
ii linux-modules-nvidia-535-server-nvidia 5.15.0-1029.29+1 amd64 Extra drivers for nvidia-535-server for the nvidia flavour

# But.. why is it version 5.15.0-*1029*.29+1, and not 1030? That means you have an nvidia driver for the wrong kernel version, which explains this result. What could have caused that? Was the archive out of sync?

# According to Launchpad, the 1030 nvidia modules were released to jammy-updates on 2023-08-14 07:39:06 UTC :
# https://launchpad.net/ubuntu/+source/linux-restricted-modules-nvidia/+publishinghistory

# The 1030 kernel was released at the same time:
# 2023-08-14 07:39:06 UTC

# So if one was available, both should've been available.

# And indeed, your system sees that the modules for 1030 are currently available.
# From ./sos_commands/apt/apt-cache_policy_details :

linux-modules-nvidia-535-server-nvidia:
  Installed: 5.15.0-1029.29+1
  Candidate: 5.15.0-1030.30+1
  Version table:
     5.15.0-1030.30+1 500
        500 http://us.archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages
 *** 5.15.0-1029.29+1 500
        500 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages
        100 /var/lib/dpkg/status

# But the sosreport was collected after the install, so maybe it did not become available until later.

# The command you ran looks correct (from var/log/auth.log):
Aug 14 03:45:39 u-DGX-Station-A100-920-23487-2531-000 sudo: u : TTY=pts/2 ; PWD=/home/u ; USER=root ; COMMAND=/usr/bin/apt install nvidia-utils-535-server nvidia-kernel-source-535-server linux-modules-nvidia-535-server-nvidia nvidia-fabricmanager-535 nvidia-driver-535-server

# And if I run it now in a fresh VM, it does the right thing:

$ sudo apt install nvidia-utils-535-server nvidia-kernel-source-535-server linux-modules-nvidia-535-server-nvidia nvidia-fabricmanager-535 nvidia-driver-535-server --dry-run | grep linux-modules-nvidia

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

  linux-modules-nvidia-535-server-5.15.0-1030-nvidia
  linux-modules-nvidia-535-server-5.15.0-1030-nvidia
  linux-modules-nvidia-535-server-nvidia
Inst linux-modules-nvidia-535-server-5.15.0-1030-nvidia (5.15.0-1030.30+1 Ubuntu:22.04/jammy-updates [amd64])
Inst linux-modules-nvidia-535-server-nvidia (5.15.0-1030.30+1 Ubuntu:22.04/jammy-updates [amd64])
Conf linux-modules-nvidia-535-server-5.15.0-1030-nvidia (5.15.0-1030.30+1 Ubuntu:22.04/jammy-updates [amd64])
Conf linux-modules-nvidia-535-server-nvidia (5.15.0-1030.30+1 Ubuntu:22.04/jammy-updates [a...

Read more...

dann frazier (dannf)
description: updated
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

# My only theory is that, maybe, you happened to run an `apt update` in a weird situation where the Packages file for main (where linux-nvidia lives) was ahead of the Packages file for restricted (where linux-modules-nvidia-535-server-nvidia lives) on your mirror. I suspect that if you ran `sudo apt update; sudo apt dist-upgrade` and rebooted, your discrete video would recover.

that, or a mirror one is using did staged updates and has main quicker than restricted.

But strictly speaking on the packaging level there is nothing today that enforces strictly correct ordering of installing the right kernel modules for the currently booted kernel; nor installing modules for all the installed kernel abis. As the install of a new kernel flavour, or install of a given nvidia-drivers doesn't force the upgrade of the other, nor requests / indicates that a system reboot is required.

Ideally one should not use apt to install nvidia modules, as that doesn't ensure that matching userspace (if needed) is installed.

Please instead use `sudo ubuntu-drivers --gpgpu install` as that one likely to do the right thing, than hand invocations of apt. Similarly there is ubuntu-drivers integration to do something like that via cloud-init & autoinstall.yaml too.

Revision history for this message
dann frazier (dannf) wrote :

While ubuntu-drivers is certainly the recommended approach, is it immune from these archive-side races?

Could the linux-modules-nvidia-*-$flavor metapackages have tight versioned deps on the corresponding linux-image-$flavor metapackage, presumably causing the new kernel to be held back until the corresponding modules are also available?

Revision history for this message
Weichen Wu (weichenwu) wrote :

I've tried with apt update & apt dist-upgrade, but the discrete GPU didn't recover.

Attached new sosreport collected after reboot.

Revision history for this message
dann frazier (dannf) wrote (last edit ):

Thank you @weichenwu. This time the nvidia driver is available and loaded, so I think this is a different issue. I've opened bug 2033418 to track it.

dann frazier (dannf)
summary: - No video output from discrete GPU (DGX Station A100)
+ Archive races can cause nvidia driver / kernel version ABI mismatch
Changed in linux-restricted-modules (Ubuntu):
importance: Undecided → High
status: New → In Progress
assignee: nobody → Andy Whitcroft (apw)
Changed in nvidia-graphics-drivers-535-server (Ubuntu):
status: New → In Progress
importance: Undecided → High
assignee: nobody → Alberto Milone (albertomilone)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.