After upgrading the containerd charm from 1.22 to 1.24 our GPU devices stopped working. The k8s-device-plugin pods were reporting:
2022/07/18 18:59:17 Loading NVML
2022/07/18 18:59:17 Failed to initialize NVML: could not load NVML library.
2022/07/18 18:59:17 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2022/07/18 18:59:17 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2022/07/18 18:59:17 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2022/07/18 18:59:17 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on
GPU nodes
2022/07/18 18:59:17 Error: failed to initialize NVML: could not load NVML library
The upgrade of the charm switched config_version from v1 to v2.
Manually changing the config.toml seems to make things work:
[plugins."io.containerd.grpc.v1.cri".containerd]
no_pivot = false
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName="/usr/bin/nvidia-container-runtime"
https:/ /github. com/charmed- kubernetes/ charm-container d/pull/ 67/files