Kubernetes Worker Charm

Can't bring up GPU worker, docker daemon fail.

Bug #1812894 reported by yen on 2019-01-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Kubernetes Worker Charm	Fix Released	Undecided	Joseph Borg

Bug Description

Hi,

First thank you so much for creating this awesome charm!! I have successfully deploy it on prem with my DELL R720s around 6 months ago.

However, when I trying to rebuild it once again this time with cs:bundle/kubernetes-core-503 (which has charm: cs:~containers/kubernetes-worker-398) I am facing below problem. And this problem only occurred on worker with GPU. ALL the CPU worker can be deploy just fine.

From "juju debug-log --replay --include kubernetes-worker/3"
------
unit-kubernetes-worker-3: 11:55:40 INFO unit.kubernetes-worker/3.juju-log cni:9: Invoking reactive handler: reactive/docker.py:442:signal_workloads_start
unit-kubernetes-worker-3: 11:55:40 DEBUG unit.kubernetes-worker/3.cni-relation-changed Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
...
...
unit-kubernetes-worker-3: 11:55:47 INFO unit.kubernetes-worker/3.juju-log cni:9: Executing ['kubectl', '--kubeconfig=/root/.kube/config', 'apply', '-f', '/root/cdk/addons/default-http-backend.yaml']
unit-kubernetes-worker-3: 11:55:48 DEBUG worker.uniter.jujuc running hook tool "open-port"
unit-kubernetes-worker-3: 11:55:48 DEBUG worker.uniter.jujuc running hook tool "open-port"
unit-kubernetes-worker-3: 11:55:48 DEBUG worker.uniter.jujuc running hook tool "juju-log"
unit-kubernetes-worker-3: 11:55:48 INFO unit.kubernetes-worker/3.juju-log cni:9: Invoking reactive handler: reactive/kubernetes_worker.py:552:apply_node_labels
unit-kubernetes-worker-3: 11:55:48 DEBUG worker.uniter.jujuc running hook tool "juju-log"
unit-kubernetes-worker-3: 11:55:48 INFO unit.kubernetes-worker/3.juju-log cni:9: Skipping malformed option: .
unit-kubernetes-worker-3: 11:55:48 DEBUG unit.kubernetes-worker/3.cni-relation-changed Error from server (NotFound): nodes "bubnicki" not found
unit-kubernetes-worker-3: 11:55:48 DEBUG worker.uniter.jujuc running hook tool "juju-log"
unit-kubernetes-worker-3: 11:55:48 INFO unit.kubernetes-worker/3.juju-log cni:9: Failed to apply label juju-application=kubernetes-worker. Will retry.

------

From "root@Bubnicki:~# journalctl -xe -u docker"
------
Jan 22 17:34:30 Bubnicki dockerd[40216]: unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives are specified both as a flag and in the configuration file: default-runtime: (from flag: nvidia, from file: nvidia)
Jan 22 17:34:30 Bubnicki systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
------

Seems like a conflict between the system docker startup file and the docker daemon.json file but I can't tell which one.

Below are the daemon.json and startup config file. Thank you so much for your help!
------
root@Bubnicki:~# cat /etc/docker/daemon.json
{"runtimes": {"nvidia": {"path": "nvidia-container-runtime", "runtimeArgs": []}}, "default-runtime": "nvidia"}
------

------
root@Bubnicki:~# cat /lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network.target docker.socket
Requires=docker.socket

[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
EnvironmentFile=-/etc/default/docker
ExecStart=/usr/bin/dockerd -H fd:// $DOCKER_OPTS
ExecReload=/bin/kill -s HUP $MAINPID
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
# Uncomment TasksMax if your systemd version supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
# kill only the docker process, not all processes in the cgroup
KillMode=process

[Install]
WantedBy=multi-user.target
------

Best,
Yen

Revision history for this message

yen (antigenius0910) wrote on 2019-01-29:

Fix the problem with following steps.

In worker node
#apt-get remove nvidia-docker2
#apt-get remove nvidia-container-runtime
#apt-get remove docker-ce
#apt-get install docker-ce=18.06.0~ce~3-0~ubuntu
#apt-get install nvidia-container-runtime=2.0.0+docker18.06.0-1
#apt-get install nvidia-docker2=2.0.3+docker18.06.0-1
#pkill -SIGHUP dockerd
#docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

On MAAS
#juju resolved kubernetes-worker/3

Revision history for this message

Joseph Borg (joeborg) wrote on 2019-08-01:

We've pinned the versions of packages you mentioned, so this should be fixed moving forward. Feel free to comment or open a new ticket if you see this issue again.

Thanks!