Charm hangs in stuck state "Waiting to retry deploying policy controller"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Calico Charm |
Incomplete
|
Undecided
|
Unassigned |
Bug Description
Deploying CDK using K8s 1.17 one out of three Calico charm hangs with the error "Waiting to retry deploying policy controller" for an indefinite amount of time.
The juju unit logs show the service attempting to apply the policy configuration repeatedly:
calico_
...
2020-01-15 01:33:37 INFO juju-log etcd:14: Command '['kubectl', '--kubeconfig=
...
Syslog shows the calico-node service attempting to start multiple times:
calico_
Jan 15 01:33:35 ip-172-31-24-240 systemd[1]: Starting calico node...
Jan 15 01:33:36 ip-172-31-24-240 charm-env[24525]: ctr: container "calico-node" in namespace "default": not found
Jan 15 01:33:36 ip-172-31-24-240 charm-env[24525]: time="2020-
Jan 15 01:33:36 ip-172-31-24-240 charm-env[24525]: ctr: container "calico-node" in namespace "default": not found
Jan 15 01:33:36 ip-172-31-24-240 systemd[1]: Started calico node.
Jan 15 01:33:36 ip-172-31-24-240 systemd[1]: Reloading.
Jan 15 01:33:37 ip-172-31-24-240 containerd[16218]: time="2020-
pid=24809
...
A bunch of containerd and charm-env output
...
Jan 15 01:33:37 ip-172-31-24-240 charm-env[24596]: Calico node started successfully
At some point containerd restarts and the calico-node is left hanging:
Jan 15 01:35:37 ip-172-31-24-240 systemd[1]: containerd.service: Found left-over process 24966 (calico-node) in control group while starting unit. Ignoring.
Then the service fails when it is restarted later:
Jan 15 01:35:47 ip-172-31-24-240 systemd[1]: Starting calico node...
Jan 15 01:35:48 ip-172-31-24-240 charm-env[5198]: time="2020-
Jan 15 01:35:48 ip-172-31-24-240 charm-env[5198]: ctr: cannot delete a non stopped container: {running 0 0001-01-01 00:00:00 +0000 UTC}
Jan 15 01:35:48 ip-172-31-24-240 systemd[1]: Started calico node.
Jan 15 01:35:49 ip-172-31-24-240 charm-env[5356]: ctr: snapshot "calico-node": already exists
Jan 15 01:35:49 ip-172-31-24-240 systemd[1]: calico-
Jan 15 01:35:49 ip-172-31-24-240 systemd[1]: calico-
I /think/ this is the source of the hanging juju agent but I don't know enough about Calico or Kubernetes to be sure. Attaching the bundle and crashdump.
Bundle file