[docker] Unable to run kubernetes-master with calico integration in a LXD container

Bug #1831249 reported by Dmitrii Shcherbakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Calico Charm
New
Undecided
Unassigned

Bug Description

The Calico charm sets up a service called calico-node to run a docker container as follows:

systemctl cat calico-node.service

https://docs.projectcalico.org/v3.5/usage/configuration/as-service#systemd-service-example

ExecStart=/usr/bin/docker run --net=host --privileged --name=calico-node \
  -e ETCD_ENDPOINTS=https://172.16.7.63:2379 \
  -e ETCD_CA_CERT_FILE=/opt/calicoctl/etcd-ca \
  -e ETCD_CERT_FILE=/opt/calicoctl/etcd-cert \
  -e ETCD_KEY_FILE=/opt/calicoctl/etcd-key \
# ...

The problem is that if kubernetes-master is placed into a LXD container created by Juju calico-node is unable to start.

Docker is unable to launch a container because --privileged is used in the run command.

journalctl -u calico-node.service

May 31 07:39:55 juju-fa887c-11-lxd-2 systemd[1]: calico-node.service: Main process exited, code=exited, status=126/n/a
May 31 07:39:55 juju-fa887c-11-lxd-2 systemd[1]: calico-node.service: Failed with result 'exit-code'.
May 31 07:40:05 juju-fa887c-11-lxd-2 systemd[1]: calico-node.service: Service hold-off time over, scheduling restart.
May 31 07:40:05 juju-fa887c-11-lxd-2 systemd[1]: calico-node.service: Scheduled restart job, restart counter is at 1037.
May 31 07:40:05 juju-fa887c-11-lxd-2 systemd[1]: Stopped calico node.
May 31 07:40:05 juju-fa887c-11-lxd-2 systemd[1]: calico-node.service: Failed to reset devices.list: Operation not permitted
May 31 07:40:05 juju-fa887c-11-lxd-2 systemd[1]: Starting calico node...
May 31 07:40:05 juju-fa887c-11-lxd-2 docker[12173]: calico-node
May 31 07:40:05 juju-fa887c-11-lxd-2 systemd[1]: Started calico node.
May 31 07:40:06 juju-fa887c-11-lxd-2 docker[12189]: /usr/bin/docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"rootfs_linux.go:58: mounting \\\"proc\\\" to rootfs \\\"/var/lib/docker/vfs/dir/a14c912f3f8def9499aa0010ef2e6d11581e68b60562471328eaeec3db5bc1a6\\\" at \\\"/proc\\\" caused \\\"permission denied\\\"\"": unknown.

I enabled nesting on the LXD container by hand (could be done in a charm LXD profile which is possible with Juju now):

sudo lxc config set juju-fa887c-11-lxd-2 security.nesting true
sudo lxc restart juju-fa887c-11-lxd-2

And changed ExecStart to run without "privileged" which resulted in this (even without additional capabilities added to the container):

ExecStart=/usr/bin/docker run --net=host --name=calico-node \

docker logs calico-node
2019-05-31 08:02:10.258 [INFO][9] startup.go 173: Early log level set to info
2019-05-31 08:02:10.258 [INFO][9] client.go 202: Loading config from environment
2019-05-31 08:02:10.258 [INFO][9] startup.go 83: Skipping datastore connection test
2019-05-31 08:02:10.271 [INFO][9] startup.go 259: Building new node resource Name="juju-fa887c-11-lxd-2"
2019-05-31 08:02:10.271 [INFO][9] startup.go 273: Initialise BGP data
2019-05-31 08:02:10.271 [INFO][9] startup.go 362: Using IPv4 address from environment: IP=172.16.7.64
2019-05-31 08:02:10.271 [INFO][9] startup.go 392: IPv4 address 172.16.7.64 discovered on interface eth0
2019-05-31 08:02:10.271 [INFO][9] startup.go 338: Node IPv4 changed, will check for conflicts
2019-05-31 08:02:10.283 [INFO][9] startup.go 530: No AS number configured on node resource, using global value
2019-05-31 08:02:10.283 [INFO][9] etcd.go 111: Ready flag is already set
2019-05-31 08:02:10.284 [INFO][9] client.go 139: Using previously configured cluster GUID
2019-05-31 08:02:10.291 [INFO][9] compat.go 796: Returning configured node to node mesh
2019-05-31 08:02:10.303 [INFO][9] startup.go 131: Using node name: juju-fa887c-11-lxd-2
2019-05-31 08:02:10.412 [INFO][30] client.go 202: Loading config from environment
2019-05-31 08:02:10.429 [INFO][30] ipam.go 120: Auto-assign 1 ipv4, 0 ipv6 addrs for host 'juju-fa887c-11-lxd-2'
2019-05-31 08:02:10.430 [INFO][30] ipam.go 172: Ran out of existing affine blocks for host 'juju-fa887c-11-lxd-2'
2019-05-31 08:02:10.431 [INFO][30] ipam.go 195: Need to allocate 1 more addresses - allocate another block
2019-05-31 08:02:10.431 [INFO][30] ipam_block_reader_writer.go 116: Claiming a new affine block for host 'juju-fa887c-11-lxd-2'
2019-05-31 08:02:10.432 [INFO][30] ipam_block_reader_writer.go 159: Host juju-fa887c-11-lxd-2 claiming block affinity for 192.168.154.192/26
2019-05-31 08:02:10.433 [INFO][30] ipam.go 208: Claimed new block 192.168.154.192/26 - assigning 1 addresses
2019-05-31 08:02:10.434 [INFO][30] ipam_block.go 343: New allocation attribute: {AttrPrimary:<nil> AttrSecondary:map[]}
2019-05-31 08:02:10.435 [INFO][30] ipam.go 285: Auto-assigned 1 out of 1 IPv4s: [192.168.154.192]
2019-05-31 08:02:10.443 [INFO][30] allocate_ipip_addr.go 145: Set IPIP tunnel address IP="192.168.154.192"
Starting libnetwork service
Calico node started successfully

I can see that other projects that run calico add CAP_NET_ADMIN and CAP_SYS_ADMIN to the docker container and do away without --privileged.

https://opendev.org/openstack/openstack-helm-infra/commit/200b5e902b3a176fbfbe669b6a10a254c9b50f5d
https://opendev.org/openstack/openstack-helm-infra/src/commit/c34dbeeec81ca0ab12370f108a7645ef2bca9386/calico/values.yaml#L59-L60

So we could modify the per-application LXD profile for kubernetes-master to include nesting (without setting "privileged") and also pass namespaced capabilities to the docker run command in the Calico charm:

ExecStart=/usr/bin/docker run --cap-add=NET_ADMIN --cap-add=SYS_ADMIN --net=host --name=calico-node \

Example:

23079 /usr/bin/docker run --cap-add=NET_ADMIN --cap-add=SYS_ADMIN --net=host --name=calico-node -e ETCD_ENDPOINTS=https://172.16.7.63:2379 -e ETCD_CA_CERT_FILE=/opt/calicoctl/etcd-ca -e ETCD_CERT_FILE=/opt/calicoctl/etcd-cert -e ETCD_KEY_FILE=/opt/calicoctl/etcd-key -e NODENAME=juju-fa887c-11-lxd-2 -e IP=172.16.7.64 -e NO_DEFAULT_POOLS= -e AS= -e CALICO_LIBNETWORK_ENABLED=true -e IP6= -e CALICO_NETWORKING_BACKEND=bird -e FELIX_DEFAULTENDPOINTTOHOSTACTION=ACCEPT -v /var/run/calico:/var/run/calico -v /lib/modules:/lib/modules -v /run/docker/plugins:/run/docker/plugins -v /var/run/docker.sock:/var/run/docker.sock -v /var/log/calico:/var/log/calico -v /opt/calicoctl:/opt/calicoctl quay.io/calico/node:v2.6.12

Tags: cpe-onsite
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Looked at this https://github.com/projectcalico/calicoctl/issues/310

It appears to be that there are some code-paths (probably only relevant for worker nodes) that change networking sysctls via /proc/sys/net/ipv4/conf/<interface>/<config_key>:

https://github.com/projectcalico/felix/blob/v3.6.0/dataplane/linux/endpoint_mgr.go#L862-L879

And Docker without --privileged sets up /proc/sys as read-only even with additional capabilities:

$ docker exec -it calico-node /bin/sh

/ # mount | grep /proc/sys
proc on /proc/sys type proc (ro,relatime)

/ # capsh --print | grep net_admin
Current: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_admin,cap_net_raw,cap_sys_chroot,cap_sys_admin,cap_mknod,cap_audit_write,cap_setfcap+eip
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_admin,cap_net_raw,cap_sys_chroot,cap_sys_admin,cap_mknod,cap_audit_write,cap_setfcap

/ # capsh --print | grep sys_admin
Current: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_admin,cap_net_raw,cap_sys_chroot,cap_sys_admin,cap_mknod,cap_audit_write,cap_setfcap+eip
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_admin,cap_net_raw,cap_sys_chroot,cap_sys_admin,cap_mknod,cap_audit_write,cap_setfcap

# LXD
ubuntu@juju-fa887c-11-lxd-2:~$ mount | grep /proc
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
lxcfs on /proc/cpuinfo type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
lxcfs on /proc/diskstats type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
lxcfs on /proc/meminfo type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
lxcfs on /proc/stat type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
lxcfs on /proc/swaps type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
lxcfs on /proc/uptime type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)
proc on /dev/.lxc/proc type proc (rw,relatime)

There was some work to address this:

https://github.com/moby/moby/issues/21649
https://github.com/moby/moby/pull/21751
https://github.com/moby/moby/issues/36597

https://github.com/moby/moby/pull/36644

https://docs.docker.com/engine/release-notes/#18060-ce
"RawAccess allows a set of paths to be not set as masked or readonly. moby/moby#36644"

CLI integration:
https://github.com/docker/cli/pull/1347 (per-path config support)
https://github.com/docker/cli/pull/1808 (--security-opt systempaths=unconfined)

Revision history for this message
George Kraft (cynerva) wrote :

We have seen success running calico-node in lxd, with --privileged, by using the lxd profile from here: https://github.com/charmed-kubernetes/bundle/wiki/Deploying-on-LXD#the-profile

Make sure you include the bit from the "Privileged containers" section.

We have it on our roadmap for this cycle to add LXD profiles to the CDK charms. For now, you will need to apply them manually.

summary: - Unable to run kubernetes-master with calico integration in a LXD
- container
+ [docker] Unable to run kubernetes-master with calico integration in a
+ LXD container
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.