Comment 0 for bug 2031178

Revision history for this message
Benjamin Fischer (benjamin-fischer) wrote : Loaded previous kernel breaks during upgrade

Issue:
During update process of the packages for driver version 535, the previous driver that is still loaded breaks in such a way that the GPU(s) become unusable until reboot.

Symptoms:
1. all currently running & newly started processes interacting with the GPU(s) break:
  - this affect both of the following APIs individually: CUDA, NVML
  - the processes become stuck at 100% (single threat) system CPU load, i.e. they are stuck in and (interruptable) syscall - key can be stopped (via SIGINT/-TERM/-KILL)
  - some NVML executables shows erronous total user+system time of millions of hours (far beyond the possible "uptime times CPU threads" - this may hint at bad memory accesses/writes within the kernel
2. once no processes use the GPU anymore (i.e. manually stopped) the kernel reports hung tasks in the `nvidia` and `nvidia_uvm` module (see attachment)
3. the `nvidia_uvm` kernel module cannot be unloaded: `rmmod` becomes stuck until reboot

Expected behavior (has been established through previous ~10 driver package upgrades):
1. all current processes can continue to use the GPU(s) without issue
2. once all processes have stopped using the GPU(s), i.e. none of the `/dev/nvidia*` is open, all the nvidia kernel modules can be unloaded (in appropriate order according to dependencies) via `modprobe -r` or `rmmod` - after this the new driver can be loaded, i.e. through (re)starting nvidia-persistenced

Partially retained expected behavior:
1. new processes report errors due to version incompatibilities between installed libraries and loaded kernel module
  - e.g. `nvidia-smi` says something of "Driver/library version mismatch"
  - the following kernel message is show (split across 4 lines): NVRM: API mismatch e.g.:
    NVRM: API mismatch: the client has the version 535.86.05, but
    NVRM: this kernel module has the version 535.54.03. Please
    NVRM: make sure that this kernel module and all NVIDIA driver
    NVRM: components have the same version.
  - this behavior is retained in the affected versions until kernel hung tasks messages appear

Affected versions:
 - 535.86.05-0ubuntu0.20.04.2 (previous was 535.54.03-0ubuntu0.20.04.4
 - 535.54.03-0ubuntu0.20.04.3 (previous was 530.41.03-0ubuntu0.20.04.2

Environment:
- Ubuntu 20.04.6 LTS (`lsb_release -d`)
- all affected upgrades we automatically installed via unattended-updates
- the issue occurred on 15 different nodes with 5 different hardware configurations (Mainboard, CPU, RAM, GPU, etc.) - so it's unlikely to be an hardware issue
- all nodes are operated headless (GPUs not used for graphics output, no Xserver/whatnot installed, access was through SSH)

Related:
The following bugs may be related, since I expect this issue to manifest in the same signature: GPU entirely unusable, thus black screen, until reboot
- https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535/+bug/2025640
- https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535/+bug/2027614