Containerd Subordinate Charm

Upgrade charm from 1.22 to 1.24 causes GPU's to stop working

Bug #1982034 reported by Chris Johnston on 2022-07-18

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Containerd Subordinate Charm	Fix Released	High	Adam Dyess	Containerd Subordinate Charm 1.24+ck1

Bug Description

After upgrading the containerd charm from 1.22 to 1.24 our GPU devices stopped working. The k8s-device-plugin pods were reporting:

2022/07/18 18:59:17 Loading NVML
2022/07/18 18:59:17 Failed to initialize NVML: could not load NVML library.
2022/07/18 18:59:17 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2022/07/18 18:59:17 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2022/07/18 18:59:17 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2022/07/18 18:59:17 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on
GPU nodes
2022/07/18 18:59:17 Error: failed to initialize NVML: could not load NVML library

The upgrade of the charm switched config_version from v1 to v2.

Manually changing the config.toml seems to make things work:

    [plugins."io.containerd.grpc.v1.cri".containerd]
      no_pivot = false
      default_runtime_name = "nvidia"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName="/usr/bin/nvidia-container-runtime"

Revision history for this message

Chris Johnston (cjohnston) wrote on 2022-07-18:

https://github.com/charmed-kubernetes/charm-containerd/pull/67/files

Chris Johnston (cjohnston) on 2022-07-19

Changed in charm-containerd:
status:	New → In Progress

Revision history for this message

Chris Johnston (cjohnston) wrote on 2022-07-20:

Changing to v1 does work.

Revision history for this message

Chris Johnston (cjohnston) wrote on 2022-07-20:

subscribed ~field-high

George Kraft (cynerva) on 2022-07-20

Changed in charm-containerd:
importance:	Undecided → High
milestone:	none → 1.24+ck1

Kevin W Monroe (kwmonroe) on 2022-07-25

Changed in charm-containerd:
status:	In Progress → Fix Committed

Adam Dyess (addyess) on 2022-07-26

Changed in charm-containerd:
assignee:	nobody → Chris Johnston (cjohnston)

Chris Johnston (cjohnston) on 2022-07-26

Changed in charm-containerd:
assignee:	Chris Johnston (cjohnston) → nobody

Adam Dyess (addyess) on 2022-07-26

Changed in charm-containerd:
assignee:	nobody → Adam Dyess (addyess)

Adam Dyess (addyess) on 2022-07-27

tags:

added: backport-needed

Adam Dyess (addyess) on 2022-08-01

tags:

removed: backport-needed

Adam Dyess (addyess) on 2022-08-04

Changed in charm-containerd:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.