Upgrade charm from 1.22 to 1.24 causes GPU's to stop working

Bug #1982034 reported by Chris Johnston
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Containerd Subordinate Charm
Fix Released
High
Adam Dyess

Bug Description

After upgrading the containerd charm from 1.22 to 1.24 our GPU devices stopped working. The k8s-device-plugin pods were reporting:

2022/07/18 18:59:17 Loading NVML
2022/07/18 18:59:17 Failed to initialize NVML: could not load NVML library.
2022/07/18 18:59:17 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2022/07/18 18:59:17 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2022/07/18 18:59:17 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2022/07/18 18:59:17 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on
GPU nodes
2022/07/18 18:59:17 Error: failed to initialize NVML: could not load NVML library

The upgrade of the charm switched config_version from v1 to v2.

Manually changing the config.toml seems to make things work:

    [plugins."io.containerd.grpc.v1.cri".containerd]
      no_pivot = false
      default_runtime_name = "nvidia"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName="/usr/bin/nvidia-container-runtime"

Revision history for this message
Chris Johnston (cjohnston) wrote :
Changed in charm-containerd:
status: New → In Progress
Revision history for this message
Chris Johnston (cjohnston) wrote :

Changing to v1 does work.

Revision history for this message
Chris Johnston (cjohnston) wrote :

subscribed ~field-high

George Kraft (cynerva)
Changed in charm-containerd:
importance: Undecided → High
milestone: none → 1.24+ck1
Changed in charm-containerd:
status: In Progress → Fix Committed
Adam Dyess (addyess)
Changed in charm-containerd:
assignee: nobody → Chris Johnston (cjohnston)
Changed in charm-containerd:
assignee: Chris Johnston (cjohnston) → nobody
Adam Dyess (addyess)
Changed in charm-containerd:
assignee: nobody → Adam Dyess (addyess)
Adam Dyess (addyess)
tags: added: backport-needed
Adam Dyess (addyess)
tags: removed: backport-needed
Adam Dyess (addyess)
Changed in charm-containerd:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.