Loaded previous kernel module breaks during upgrade

Bug #2031178 reported by Benjamin Fischer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
nvidia-graphics-drivers-535 (Ubuntu)
New
Undecided
Unassigned

Bug Description

Issue:
During update process of the packages for driver version 535, the previous driver that is still loaded breaks in such a way that the GPU(s) become unusable until reboot.

Symptoms:
1. all currently running & newly started processes interacting with the GPU(s) break:
  - this affect both of the following APIs individually: CUDA, NVML
  - the processes become stuck at 100% (single threat) system CPU load, i.e. they are stuck in and (interruptable) syscall - key can be stopped (via SIGINT/-TERM/-KILL)
  - some NVML executables shows erronous total user+system time of millions of hours (far beyond the possible "uptime times CPU threads" - this may hint at bad memory accesses/writes within the kernel
2. once no processes use the GPU anymore (i.e. manually stopped) the kernel reports hung tasks in the `nvidia` and `nvidia_uvm` module (see attachment)
3. the `nvidia_uvm` kernel module cannot be unloaded: `rmmod` becomes stuck until reboot

Expected behavior (has been established through previous ~10 driver package upgrades):
1. all current processes can continue to use the GPU(s) without issue
2. once all processes have stopped using the GPU(s), i.e. none of the `/dev/nvidia*` is open, all the nvidia kernel modules can be unloaded (in appropriate order according to dependencies) via `modprobe -r` or `rmmod` - after this the new driver can be loaded, i.e. through (re)starting nvidia-persistenced

Partially retained expected behavior:
1. new processes report errors due to version incompatibilities between installed libraries and loaded kernel module
  - e.g. `nvidia-smi` says something of "Driver/library version mismatch"
  - the following kernel message is shown:
    NVRM: API mismatch: the client has the version 535.86.05, but
    NVRM: this kernel module has the version 535.54.03. Please
    NVRM: make sure that this kernel module and all NVIDIA driver
    NVRM: components have the same version.
  - this behavior is retained in the affected versions until kernel hung tasks messages appear

Affected versions:
 - 535.86.05-0ubuntu0.20.04.2 (previous was 535.54.03-0ubuntu0.20.04.4
 - 535.54.03-0ubuntu0.20.04.3 (previous was 530.41.03-0ubuntu0.20.04.2

Environment:
- Ubuntu 20.04.6 LTS (`lsb_release -d`)
- all affected upgrades we automatically installed via unattended-updates
- the issue occurred on 15 different nodes with 5 different hardware configurations (Mainboard, CPU, RAM, GPU, etc.) - so it's unlikely to be an hardware issue
- all nodes are operated headless (GPUs not used for graphics output, no Xserver/whatnot installed, access was through SSH)

Related:
The following bugs may be related, since I expect this issue to manifest in the same signature: GPU entirely unusable, thus black screen, until reboot
- https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535/+bug/2025640
- https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535/+bug/2027614

Revision history for this message
Benjamin Fischer (benjamin-fischer) wrote :
Revision history for this message
Benjamin Fischer (benjamin-fischer) wrote :

FYI, screenshot of the `nvidia-smi` with nonsensical runtime.

summary: - Loaded previous kernel breaks during upgrade
+ Loaded previous kernel module breaks during upgrade
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.