Charm hangs in stuck state "Waiting to retry deploying policy controller"

Bug #1859848 reported by Michael Skalka
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Calico Charm
Incomplete
Undecided
Unassigned

Bug Description

Deploying CDK using K8s 1.17 one out of three Calico charm hangs with the error "Waiting to retry deploying policy controller" for an indefinite amount of time.

The juju unit logs show the service attempting to apply the policy configuration repeatedly:

calico_1:/var/log/juju/unit-calico-1.log:

...
2020-01-15 01:33:37 INFO juju-log etcd:14: Command '['kubectl', '--kubeconfig=/root/.kube/config', 'apply', '-f', '/tmp/policy-controller.yaml']' returned non-zero exit status 1.
...

Syslog shows the calico-node service attempting to start multiple times:

calico_1:/var/log/sylog:
Jan 15 01:33:35 ip-172-31-24-240 systemd[1]: Starting calico node...
Jan 15 01:33:36 ip-172-31-24-240 charm-env[24525]: ctr: container "calico-node" in namespace "default": not found
Jan 15 01:33:36 ip-172-31-24-240 charm-env[24525]: time="2020-01-15T01:33:36Z" level=error msg="failed to delete container "calico-node"" error="container "calico-node" in namespace "default": not found"
Jan 15 01:33:36 ip-172-31-24-240 charm-env[24525]: ctr: container "calico-node" in namespace "default": not found
Jan 15 01:33:36 ip-172-31-24-240 systemd[1]: Started calico node.
Jan 15 01:33:36 ip-172-31-24-240 systemd[1]: Reloading.
Jan 15 01:33:37 ip-172-31-24-240 containerd[16218]: time="2020-01-15T01:33:37.480441366Z" level=info msg="shim containerd-shim started" address="/containerd-shim/default/calico-node/shim.sock" debug=false
 pid=24809
 ...
A bunch of containerd and charm-env output
...
Jan 15 01:33:37 ip-172-31-24-240 charm-env[24596]: Calico node started successfully

At some point containerd restarts and the calico-node is left hanging:

Jan 15 01:35:37 ip-172-31-24-240 systemd[1]: containerd.service: Found left-over process 24966 (calico-node) in control group while starting unit. Ignoring.

Then the service fails when it is restarted later:

Jan 15 01:35:47 ip-172-31-24-240 systemd[1]: Starting calico node...
Jan 15 01:35:48 ip-172-31-24-240 charm-env[5198]: time="2020-01-15T01:35:48Z" level=error msg="failed to delete container "calico-node"" error="cannot delete a non stopped container: {running 0 0001-01-01 00:00:00 +0000 UTC}"
Jan 15 01:35:48 ip-172-31-24-240 charm-env[5198]: ctr: cannot delete a non stopped container: {running 0 0001-01-01 00:00:00 +0000 UTC}
Jan 15 01:35:48 ip-172-31-24-240 systemd[1]: Started calico node.
Jan 15 01:35:49 ip-172-31-24-240 charm-env[5356]: ctr: snapshot "calico-node": already exists
Jan 15 01:35:49 ip-172-31-24-240 systemd[1]: calico-node.service: Main process exited, code=exited, status=1/FAILURE
Jan 15 01:35:49 ip-172-31-24-240 systemd[1]: calico-node.service: Failed with result 'exit-code'.

I /think/ this is the source of the hanging juju agent but I don't know enough about Calico or Kubernetes to be sure. Attaching the bundle and crashdump.

Revision history for this message
Michael Skalka (mskalka) wrote :
Revision history for this message
Michael Skalka (mskalka) wrote :

Bundle file

Revision history for this message
George Kraft (cynerva) wrote :

What's preventing calico/1 from deploying the policy controller is that the kubernetes API never came up. The kubernetes-master units are stuck with a "Waiting for master components to start" status. The snap.kube-apiserver.daemon service is repeatedly failing to start with:

Error: open /root/cdk/server.crt: no such file or directory

It looks like the kubernetes-master unit was never sent certificates by the CA. Indeed, vault/0 is stuck in a Blocked state with the "Vault needs to be initialized" status message.

Have you initialized and unsealed vault as documented here? https://ubuntu.com/kubernetes/docs/using-vault

Changed in charm-calico:
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.