Can't bring up GPU worker, docker daemon fail.

Bug #1812894 reported by yen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kubernetes Worker Charm
Fix Released
Undecided
Joseph Borg

Bug Description

Hi,

First thank you so much for creating this awesome charm!! I have successfully deploy it on prem with my DELL R720s around 6 months ago.

However, when I trying to rebuild it once again this time with cs:bundle/kubernetes-core-503 (which has charm: cs:~containers/kubernetes-worker-398) I am facing below problem. And this problem only occurred on worker with GPU. ALL the CPU worker can be deploy just fine.

From "juju debug-log --replay --include kubernetes-worker/3"
------
unit-kubernetes-worker-3: 11:55:40 INFO unit.kubernetes-worker/3.juju-log cni:9: Invoking reactive handler: reactive/docker.py:442:signal_workloads_start
unit-kubernetes-worker-3: 11:55:40 DEBUG unit.kubernetes-worker/3.cni-relation-changed Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
...
...
unit-kubernetes-worker-3: 11:55:47 INFO unit.kubernetes-worker/3.juju-log cni:9: Executing ['kubectl', '--kubeconfig=/root/.kube/config', 'apply', '-f', '/root/cdk/addons/default-http-backend.yaml']
unit-kubernetes-worker-3: 11:55:48 DEBUG worker.uniter.jujuc running hook tool "open-port"
unit-kubernetes-worker-3: 11:55:48 DEBUG worker.uniter.jujuc running hook tool "open-port"
unit-kubernetes-worker-3: 11:55:48 DEBUG worker.uniter.jujuc running hook tool "juju-log"
unit-kubernetes-worker-3: 11:55:48 INFO unit.kubernetes-worker/3.juju-log cni:9: Invoking reactive handler: reactive/kubernetes_worker.py:552:apply_node_labels
unit-kubernetes-worker-3: 11:55:48 DEBUG worker.uniter.jujuc running hook tool "juju-log"
unit-kubernetes-worker-3: 11:55:48 INFO unit.kubernetes-worker/3.juju-log cni:9: Skipping malformed option: .
unit-kubernetes-worker-3: 11:55:48 DEBUG unit.kubernetes-worker/3.cni-relation-changed Error from server (NotFound): nodes "bubnicki" not found
unit-kubernetes-worker-3: 11:55:48 DEBUG worker.uniter.jujuc running hook tool "juju-log"
unit-kubernetes-worker-3: 11:55:48 INFO unit.kubernetes-worker/3.juju-log cni:9: Failed to apply label juju-application=kubernetes-worker. Will retry.

------

From "root@Bubnicki:~# journalctl -xe -u docker"
------
Jan 22 17:34:30 Bubnicki dockerd[40216]: unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives are specified both as a flag and in the configuration file: default-runtime: (from flag: nvidia, from file: nvidia)
Jan 22 17:34:30 Bubnicki systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
------

Seems like a conflict between the system docker startup file and the docker daemon.json file but I can't tell which one.

Below are the daemon.json and startup config file. Thank you so much for your help!
------
root@Bubnicki:~# cat /etc/docker/daemon.json
{"runtimes": {"nvidia": {"path": "nvidia-container-runtime", "runtimeArgs": []}}, "default-runtime": "nvidia"}
------

------
root@Bubnicki:~# cat /lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network.target docker.socket
Requires=docker.socket

[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
EnvironmentFile=-/etc/default/docker
ExecStart=/usr/bin/dockerd -H fd:// $DOCKER_OPTS
ExecReload=/bin/kill -s HUP $MAINPID
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
# Uncomment TasksMax if your systemd version supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
# kill only the docker process, not all processes in the cgroup
KillMode=process

[Install]
WantedBy=multi-user.target
------

Best,
Yen

Revision history for this message
yen (antigenius0910) wrote :

Fix the problem with following steps.

In worker node
#apt-get remove nvidia-docker2
#apt-get remove nvidia-container-runtime
#apt-get remove docker-ce
#apt-get install docker-ce=18.06.0~ce~3-0~ubuntu
#apt-get install nvidia-container-runtime=2.0.0+docker18.06.0-1
#apt-get install nvidia-docker2=2.0.3+docker18.06.0-1
#pkill -SIGHUP dockerd
#docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

On MAAS
#juju resolved kubernetes-worker/3

Revision history for this message
Joseph Borg (joeborg) wrote :

We've pinned the versions of packages you mentioned, so this should be fixed moving forward. Feel free to comment or open a new ticket if you see this issue again.

Thanks!

Changed in charm-kubernetes-worker:
assignee: nobody → Joseph Borg (joeborg)
Revision history for this message
Joseph Borg (joeborg) wrote :

https://github.com/charmed-kubernetes/bundle/issues/720

Introduced in kubernetes worker 436.

Changed in charm-kubernetes-worker:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.