Activity log for bug #2031178

Date Who What changed Old value New value Message
2023-08-11 17:24:47 Benjamin Fischer bug added bug
2023-08-11 17:24:47 Benjamin Fischer attachment added nvidia_hung_kern.log https://bugs.launchpad.net/bugs/2031178/+attachment/5691747/+files/nvidia_hung_kern.log
2023-08-11 17:27:26 Benjamin Fischer attachment added nvidia_broken_runtime.png https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535/+bug/2031178/+attachment/5691748/+files/nvidia_broken_runtime.png
2023-08-11 17:37:12 Benjamin Fischer summary Loaded previous kernel breaks during upgrade Loaded previous kernel module breaks during upgrade
2023-08-11 17:44:12 Benjamin Fischer description Issue: During update process of the packages for driver version 535, the previous driver that is still loaded breaks in such a way that the GPU(s) become unusable until reboot. Symptoms: 1. all currently running & newly started processes interacting with the GPU(s) break: - this affect both of the following APIs individually: CUDA, NVML - the processes become stuck at 100% (single threat) system CPU load, i.e. they are stuck in and (interruptable) syscall - key can be stopped (via SIGINT/-TERM/-KILL) - some NVML executables shows erronous total user+system time of millions of hours (far beyond the possible "uptime times CPU threads" - this may hint at bad memory accesses/writes within the kernel 2. once no processes use the GPU anymore (i.e. manually stopped) the kernel reports hung tasks in the `nvidia` and `nvidia_uvm` module (see attachment) 3. the `nvidia_uvm` kernel module cannot be unloaded: `rmmod` becomes stuck until reboot Expected behavior (has been established through previous ~10 driver package upgrades): 1. all current processes can continue to use the GPU(s) without issue 2. once all processes have stopped using the GPU(s), i.e. none of the `/dev/nvidia*` is open, all the nvidia kernel modules can be unloaded (in appropriate order according to dependencies) via `modprobe -r` or `rmmod` - after this the new driver can be loaded, i.e. through (re)starting nvidia-persistenced Partially retained expected behavior: 1. new processes report errors due to version incompatibilities between installed libraries and loaded kernel module - e.g. `nvidia-smi` says something of "Driver/library version mismatch" - the following kernel message is show (split across 4 lines): NVRM: API mismatch e.g.: NVRM: API mismatch: the client has the version 535.86.05, but NVRM: this kernel module has the version 535.54.03. Please NVRM: make sure that this kernel module and all NVIDIA driver NVRM: components have the same version. - this behavior is retained in the affected versions until kernel hung tasks messages appear Affected versions: - 535.86.05-0ubuntu0.20.04.2 (previous was 535.54.03-0ubuntu0.20.04.4 - 535.54.03-0ubuntu0.20.04.3 (previous was 530.41.03-0ubuntu0.20.04.2 Environment: - Ubuntu 20.04.6 LTS (`lsb_release -d`) - all affected upgrades we automatically installed via unattended-updates - the issue occurred on 15 different nodes with 5 different hardware configurations (Mainboard, CPU, RAM, GPU, etc.) - so it's unlikely to be an hardware issue - all nodes are operated headless (GPUs not used for graphics output, no Xserver/whatnot installed, access was through SSH) Related: The following bugs may be related, since I expect this issue to manifest in the same signature: GPU entirely unusable, thus black screen, until reboot - https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535/+bug/2025640 - https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535/+bug/2027614 Issue: During update process of the packages for driver version 535, the previous driver that is still loaded breaks in such a way that the GPU(s) become unusable until reboot. Symptoms: 1. all currently running & newly started processes interacting with the GPU(s) break:   - this affect both of the following APIs individually: CUDA, NVML   - the processes become stuck at 100% (single threat) system CPU load, i.e. they are stuck in and (interruptable) syscall - key can be stopped (via SIGINT/-TERM/-KILL)   - some NVML executables shows erronous total user+system time of millions of hours (far beyond the possible "uptime times CPU threads" - this may hint at bad memory accesses/writes within the kernel 2. once no processes use the GPU anymore (i.e. manually stopped) the kernel reports hung tasks in the `nvidia` and `nvidia_uvm` module (see attachment) 3. the `nvidia_uvm` kernel module cannot be unloaded: `rmmod` becomes stuck until reboot Expected behavior (has been established through previous ~10 driver package upgrades): 1. all current processes can continue to use the GPU(s) without issue 2. once all processes have stopped using the GPU(s), i.e. none of the `/dev/nvidia*` is open, all the nvidia kernel modules can be unloaded (in appropriate order according to dependencies) via `modprobe -r` or `rmmod` - after this the new driver can be loaded, i.e. through (re)starting nvidia-persistenced Partially retained expected behavior: 1. new processes report errors due to version incompatibilities between installed libraries and loaded kernel module   - e.g. `nvidia-smi` says something of "Driver/library version mismatch"   - the following kernel message is shown:     NVRM: API mismatch: the client has the version 535.86.05, but     NVRM: this kernel module has the version 535.54.03. Please     NVRM: make sure that this kernel module and all NVIDIA driver     NVRM: components have the same version.   - this behavior is retained in the affected versions until kernel hung tasks messages appear Affected versions:  - 535.86.05-0ubuntu0.20.04.2 (previous was 535.54.03-0ubuntu0.20.04.4  - 535.54.03-0ubuntu0.20.04.3 (previous was 530.41.03-0ubuntu0.20.04.2 Environment: - Ubuntu 20.04.6 LTS (`lsb_release -d`) - all affected upgrades we automatically installed via unattended-updates - the issue occurred on 15 different nodes with 5 different hardware configurations (Mainboard, CPU, RAM, GPU, etc.) - so it's unlikely to be an hardware issue - all nodes are operated headless (GPUs not used for graphics output, no Xserver/whatnot installed, access was through SSH) Related: The following bugs may be related, since I expect this issue to manifest in the same signature: GPU entirely unusable, thus black screen, until reboot - https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535/+bug/2025640 - https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535/+bug/2027614