Calico fails to start, no such file or directory "/var/lib/calico/nodename"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Calico Charm |
Fix Released
|
High
|
Kevin W Monroe | ||
Canal Charm |
Fix Released
|
High
|
Kevin W Monroe | ||
Tigera Secure EE Charm |
Fix Released
|
High
|
Kevin W Monroe |
Bug Description
Using cdk 1.21 a calico unit is stuck waiting with the status "Waiting to retry disabling VXLAN TX checksumming". Looking into the logs on that unit it's failing to run a calicoctl command:
var/log/
2021-06-14 20:24:05 DEBUG jujuc server.go:211 running hook tool "juju-log" for calico/
2021-06-14 20:24:05 INFO juju-log Traceback (most recent call last):
File "/var/lib/
node = calicoctl_
File "/var/lib/
output = calicoctl(*args)
File "/var/lib/
return check_output(cmd, env=env, stderr=STDOUT)
File "/usr/lib/
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib/
raise CalledProcessEr
subprocess.
2021-06-14 20:24:05 DEBUG jujuc server.go:211 running hook tool "juju-log" for calico/
Peeking at syslog the calico service is trying to start repeatedly, but is failing on a missing file:
calico_
Jun 14 20:24:32 duision systemd[1]: Starting calico node...
Jun 14 20:24:32 duision charm-env[654215]: ctr: container "calico-node" in namespace "default": not found
Jun 14 20:24:32 duision charm-env[654233]: time="2021-
Jun 14 20:24:32 duision charm-env[654233]: ctr: container "calico-node" in namespace "default": not found
Jun 14 20:24:32 duision systemd[1]: Started calico node.
Jun 14 20:24:32 duision containerd[138798]: time="2021-
Jun 14 20:24:32 duision containerd[138798]: time="2021-
Jun 14 20:24:32 duision containerd[138798]: time="2021-
Jun 14 20:24:32 duision systemd[1]: run-containerd-
Jun 14 20:24:32 duision systemd[651936]: run-containerd-
Jun 14 20:24:32 duision charm-env[654348]: ctr: OCI runtime create failed: container with id exists: calico-node: unknown
Jun 14 20:24:32 duision systemd[1]: calico-
Jun 14 20:24:32 duision systemd[1]: calico-
Jun 14 20:24:33 duision systemd[1]: session-45.scope: Succeeded.
Jun 14 20:24:33 duision kubelet.
Jun 14 20:24:35 duision systemd[1]: Started Session 49 of user ubuntu.
Jun 14 20:24:36 duision systemd[1]: session-49.scope: Succeeded.
Jun 14 20:24:36 duision containerd[138798]: time="2021-
t:0,}"
Jun 14 20:24:36 duision containerd[138798]: time="2021-
irectory: check that the calico/node container is running and has mounted /var/lib/calico/"
Jun 14 20:24:36 duision containerd[138798]: time="2021-
pt:0,} failed, error" error="failed to setup network for sandbox \"7334d00f9f736
lib/calico/"
Jun 14 20:24:36 duision kubelet.
ff48754f9820378
Jun 14 20:24:36 duision kubelet.
54f98203789aaee
Jun 14 20:24:36 duision kubelet.
54f98203789aaee
Jun 14 20:24:36 duision kubelet.
Logs here: https:/
Test run here: https:/
Changed in charm-calico: | |
milestone: | none → 1.24 |
status: | New → Triaged |
importance: | Undecided → High |
Changed in charm-calico: | |
status: | Triaged → Fix Committed |
Changed in charm-canal: | |
status: | Triaged → Fix Committed |
Changed in charm-tigera-secure-ee: | |
status: | Triaged → Fix Committed |
Changed in charm-calico: | |
assignee: | nobody → Kevin W Monroe (kwmonroe) |
Changed in charm-canal: | |
assignee: | nobody → Kevin W Monroe (kwmonroe) |
Changed in charm-tigera-secure-ee: | |
assignee: | nobody → Kevin W Monroe (kwmonroe) |
Changed in charm-calico: | |
status: | Fix Committed → Fix Released |
Changed in charm-canal: | |
status: | Fix Committed → Fix Released |
Changed in charm-tigera-secure-ee: | |
status: | Fix Committed → Fix Released |
Looking at 3 recent occurrences of this...
https:/ /solutions. qa.canonical. com/testruns/ testRun/ e9b7200a- ae31-485e- adbd-1568b1119f 5f /solutions. qa.canonical. com/testruns/ testRun/ 91ea2c66- 21fe-45da- b973-6e13a34c3b 60 /solutions. qa.canonical. com/testruns/ testRun/ 5e99e033- 12d3-4a7f- a52b-038ba1619d e9
https:/
https:/
In all cases, the first time the calico-node service is started, it gets stopped before the container comes up:
Apr 5 09:14:12 solqa-lab1- server- 12 systemd[1]: Starting calico node... server- 12 charm-env[532278]: ctr: container "calico-node" in namespace "default": not found server- 12 charm-env[532286]: time="2022- 04-05T09: 14:12Z" level=error msg="failed to delete container \"calico-node\"" error="container \"calico-node\" in namespace \"default\": not found" server- 12 charm-env[532286]: ctr: container "calico-node" in namespace "default": not found server- 12 systemd[1]: Started calico node. server- 12 systemd[1]: Reloading. server- 12 systemd[1]: Reloading. server- 12 systemd[1]: Stopping calico node... server- 12 charm-env[532611]: ctr: container "calico-node" in namespace "default": not found server- 12 charm-env[532618]: time="2022- 04-05T09: 14:13Z" level=error msg="failed to delete container \"calico-node\"" error="container \"calico-node\" in namespace \"default\": not found" server- 12 charm-env[532618]: ctr: container "calico-node" in namespace "default": not found server- 12 systemd[1]: calico- node.service: Succeeded. server- 12 systemd[1]: Stopped calico node.
Apr 5 09:14:12 solqa-lab1-
Apr 5 09:14:12 solqa-lab1-
Apr 5 09:14:12 solqa-lab1-
Apr 5 09:14:12 solqa-lab1-
Apr 5 09:14:12 solqa-lab1-
Apr 5 09:14:12 solqa-lab1-
Apr 5 09:14:13 solqa-lab1-
Apr 5 09:14:13 solqa-lab1-
Apr 5 09:14:13 solqa-lab1-
Apr 5 09:14:13 solqa-lab1-
Apr 5 09:14:15 solqa-lab1-
Apr 5 09:14:15 solqa-lab1-
After that, all attempts to start calico-node fail:
Apr 5 09:14:15 solqa-lab1- server- 12 systemd[1]: Starting calico node... server- 12 charm-env[532701]: ctr: container "calico-node" in namespace "default": not found server- 12 charm-env[532708]: time="2022- 04-05T09: 14:15Z" level=error msg="failed to delete container \"calico-node\"" error="container \"calico-node\" in namespace \"default\": not found" server- 12 charm-env[532708]: ctr: container "calico-node" in namespace "default": not found server- 12 systemd[1]: Started calico node. server- 12 systemd[1]: Reloading. server- 12 charm-env[532755]: ctr: snapshot "calico-node": already exists server- 12 systemd[1]: calico- node.service: Main process exited, code=exited, status=1/FAILURE server- 12 systemd[1]: calico- node.service: Failed with result 'exit-code'.
Apr 5 09:14:15 solqa-lab1-
Apr 5 09:14:15 solqa-lab1-
Apr 5 09:14:15 solqa-lab1-
Apr 5 09:14:15 solqa-lab1-
Apr 5 09:14:15 solqa-lab1-
Apr 5 09:14:15 solqa-lab1-
Apr 5 09:14:15 solqa-lab1-
Apr 5 09:14:15 solqa-lab1-
The key error seems to be this:
ctr: snapshot "calico-node": already exists
There's some lingering state that's preventing the container from starting, and that state isn't getting cleaned up. Seems like a containerd or ctr bug of some sort.