Calico fails to start, no such file or directory "/var/lib/calico/nodename"

Bug #1932052 reported by Michael Skalka
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Calico Charm
Fix Released
High
Kevin W Monroe
Canal Charm
Fix Released
High
Kevin W Monroe
Tigera Secure EE Charm
Fix Released
High
Kevin W Monroe

Bug Description

Using cdk 1.21 a calico unit is stuck waiting with the status "Waiting to retry disabling VXLAN TX checksumming". Looking into the logs on that unit it's failing to run a calicoctl command:

var/log/juju/unit-calico-2.log:

2021-06-14 20:24:05 DEBUG jujuc server.go:211 running hook tool "juju-log" for calico/2-update-status-5687765828834794899
2021-06-14 20:24:05 INFO juju-log Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-calico-2/charm/reactive/calico.py", line 512, in configure_node
    node = calicoctl_get('node', node_name)
  File "/var/lib/juju/agents/unit-calico-2/charm/reactive/calico.py", line 700, in calicoctl_get
    output = calicoctl(*args)
  File "/var/lib/juju/agents/unit-calico-2/charm/reactive/calico.py", line 633, in calicoctl
    return check_output(cmd, env=env, stderr=STDOUT)
  File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/calicoctl/calicoctl', 'get', '-o', 'yaml', '--export', 'node', 'duision']' returned non-zero exit status 1.

2021-06-14 20:24:05 DEBUG jujuc server.go:211 running hook tool "juju-log" for calico/2-update-status-5687765828834794899

Peeking at syslog the calico service is trying to start repeatedly, but is failing on a missing file:

calico_2/var/log/syslog:

Jun 14 20:24:32 duision systemd[1]: Starting calico node...
Jun 14 20:24:32 duision charm-env[654215]: ctr: container "calico-node" in namespace "default": not found
Jun 14 20:24:32 duision charm-env[654233]: time="2021-06-14T20:24:32Z" level=error msg="failed to delete container \"calico-node\"" error="container \"calico-node\" in namespace \"default\": not found"
Jun 14 20:24:32 duision charm-env[654233]: ctr: container "calico-node" in namespace "default": not found
Jun 14 20:24:32 duision systemd[1]: Started calico node.
Jun 14 20:24:32 duision containerd[138798]: time="2021-06-14T20:24:32.811242620Z" level=info msg="starting signal loop" namespace=default path=/run/containerd/io.containerd.runtime.v2.task/default/calico-node pid=654399
Jun 14 20:24:32 duision containerd[138798]: time="2021-06-14T20:24:32.828163436Z" level=info msg="shim disconnected" id=calico-node
Jun 14 20:24:32 duision containerd[138798]: time="2021-06-14T20:24:32.828245659Z" level=error msg="copy shim log" error="read /proc/self/fd/18: file already closed"
Jun 14 20:24:32 duision systemd[1]: run-containerd-io.containerd.runtime.v2.task-default-calico\x2dnode-rootfs.mount: Succeeded.
Jun 14 20:24:32 duision systemd[651936]: run-containerd-io.containerd.runtime.v2.task-default-calico\x2dnode-rootfs.mount: Succeeded.
Jun 14 20:24:32 duision charm-env[654348]: ctr: OCI runtime create failed: container with id exists: calico-node: unknown
Jun 14 20:24:32 duision systemd[1]: calico-node.service: Main process exited, code=exited, status=1/FAILURE
Jun 14 20:24:32 duision systemd[1]: calico-node.service: Failed with result 'exit-code'.
Jun 14 20:24:33 duision systemd[1]: session-45.scope: Succeeded.
Jun 14 20:24:33 duision kubelet.daemon[282123]: E0614 20:24:33.610986 282123 dns.go:136] "Nameserver limits exceeded" err="Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.246.64.7 10.246.64.5 10.246.64.6"
Jun 14 20:24:35 duision systemd[1]: Started Session 49 of user ubuntu.
Jun 14 20:24:36 duision systemd[1]: session-49.scope: Succeeded.
Jun 14 20:24:36 duision containerd[138798]: time="2021-06-14T20:24:36.610198488Z" level=info msg="RunPodsandbox for &PodSandboxMetadata{Name:dashboard-metrics-scraper-8458d7fdf6-8vclq,Uid:77fe1c10-aae3-40b9-be3f-fe455faf0651,Namespace:kubernetes-dashboard,Attemp
t:0,}"
Jun 14 20:24:36 duision containerd[138798]: time="2021-06-14T20:24:36.697444317Z" level=error msg="Failed to destroy network for sandbox \"7334d00f9f736146d5d58c1980904506f0fc9393dff48754f98203789aaee585\"" error="stat /var/lib/calico/nodename: no such file or d
irectory: check that the calico/node container is running and has mounted /var/lib/calico/"
Jun 14 20:24:36 duision containerd[138798]: time="2021-06-14T20:24:36.697771967Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:dashboard-metrics-scraper-8458d7fdf6-8vclq,Uid:77fe1c10-aae3-40b9-be3f-fe455faf0651,Namespace:kubernetes-dashboard,Attem
pt:0,} failed, error" error="failed to setup network for sandbox \"7334d00f9f736146d5d58c1980904506f0fc9393dff48754f98203789aaee585\": stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/
lib/calico/"
Jun 14 20:24:36 duision kubelet.daemon[282123]: E0614 20:24:36.698142 282123 remote_runtime.go:116] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"7334d00f9f736146d5d58c1980904506f0fc9393d
ff48754f98203789aaee585\": stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/"
Jun 14 20:24:36 duision kubelet.daemon[282123]: E0614 20:24:36.698242 282123 kuberuntime_sandbox.go:68] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"7334d00f9f736146d5d58c1980904506f0fc9393dff487
54f98203789aaee585\": stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/" pod="kubernetes-dashboard/dashboard-metrics-scraper-8458d7fdf6-8vclq"
Jun 14 20:24:36 duision kubelet.daemon[282123]: E0614 20:24:36.698294 282123 kuberuntime_manager.go:790] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"7334d00f9f736146d5d58c1980904506f0fc9393dff487
54f98203789aaee585\": stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/" pod="kubernetes-dashboard/dashboard-metrics-scraper-8458d7fdf6-8vclq"
Jun 14 20:24:36 duision kubelet.daemon[282123]: E0614 20:24:36.698411 282123 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"dashboard-metrics-scraper-8458d7fdf6-8vclq_kubernetes-dashboard(77fe1c10-aae3-40b9-be3f-fe455faf0651)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"dashboard-metrics-scraper-8458d7fdf6-8vclq_kubernetes-dashboard(77fe1c10-aae3-40b9-be3f-fe455faf0651)\\\": rpc error: code = Unknown desc = failed to setup network for sandbox \\\"7334d00f9f736146d5d58c1980904506f0fc9393dff48754f98203789aaee585\\\": stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/\"" pod="kubernetes-dashboard/dashboard-metrics-scraper-8458d7fdf6-8vclq" podUID=77fe1c10-aae3-40b9-be3f-fe455faf0651

Logs here: https://oil-jenkins.canonical.com/artifacts/022bdf30-db22-4fc7-8ba1-ebb3987c9a4b/generated/generated/kubernetes/juju-crashdump-kubernetes-2021-06-14-20.22.50.tar.gz
Test run here: https://solutions.qa.canonical.com/testruns/testRun/022bdf30-db22-4fc7-8ba1-ebb3987c9a4b

Changed in charm-calico:
milestone: none → 1.24
status: New → Triaged
importance: Undecided → High
Revision history for this message
George Kraft (cynerva) wrote :

Looking at 3 recent occurrences of this...

https://solutions.qa.canonical.com/testruns/testRun/e9b7200a-ae31-485e-adbd-1568b1119f5f
https://solutions.qa.canonical.com/testruns/testRun/91ea2c66-21fe-45da-b973-6e13a34c3b60
https://solutions.qa.canonical.com/testruns/testRun/5e99e033-12d3-4a7f-a52b-038ba1619de9

In all cases, the first time the calico-node service is started, it gets stopped before the container comes up:

Apr 5 09:14:12 solqa-lab1-server-12 systemd[1]: Starting calico node...
Apr 5 09:14:12 solqa-lab1-server-12 charm-env[532278]: ctr: container "calico-node" in namespace "default": not found
Apr 5 09:14:12 solqa-lab1-server-12 charm-env[532286]: time="2022-04-05T09:14:12Z" level=error msg="failed to delete container \"calico-node\"" error="container \"calico-node\" in namespace \"default\": not found"
Apr 5 09:14:12 solqa-lab1-server-12 charm-env[532286]: ctr: container "calico-node" in namespace "default": not found
Apr 5 09:14:12 solqa-lab1-server-12 systemd[1]: Started calico node.
Apr 5 09:14:12 solqa-lab1-server-12 systemd[1]: Reloading.
Apr 5 09:14:12 solqa-lab1-server-12 systemd[1]: Reloading.
Apr 5 09:14:13 solqa-lab1-server-12 systemd[1]: Stopping calico node...
Apr 5 09:14:13 solqa-lab1-server-12 charm-env[532611]: ctr: container "calico-node" in namespace "default": not found
Apr 5 09:14:13 solqa-lab1-server-12 charm-env[532618]: time="2022-04-05T09:14:13Z" level=error msg="failed to delete container \"calico-node\"" error="container \"calico-node\" in namespace \"default\": not found"
Apr 5 09:14:13 solqa-lab1-server-12 charm-env[532618]: ctr: container "calico-node" in namespace "default": not found
Apr 5 09:14:15 solqa-lab1-server-12 systemd[1]: calico-node.service: Succeeded.
Apr 5 09:14:15 solqa-lab1-server-12 systemd[1]: Stopped calico node.

After that, all attempts to start calico-node fail:

Apr 5 09:14:15 solqa-lab1-server-12 systemd[1]: Starting calico node...
Apr 5 09:14:15 solqa-lab1-server-12 charm-env[532701]: ctr: container "calico-node" in namespace "default": not found
Apr 5 09:14:15 solqa-lab1-server-12 charm-env[532708]: time="2022-04-05T09:14:15Z" level=error msg="failed to delete container \"calico-node\"" error="container \"calico-node\" in namespace \"default\": not found"
Apr 5 09:14:15 solqa-lab1-server-12 charm-env[532708]: ctr: container "calico-node" in namespace "default": not found
Apr 5 09:14:15 solqa-lab1-server-12 systemd[1]: Started calico node.
Apr 5 09:14:15 solqa-lab1-server-12 systemd[1]: Reloading.
Apr 5 09:14:15 solqa-lab1-server-12 charm-env[532755]: ctr: snapshot "calico-node": already exists
Apr 5 09:14:15 solqa-lab1-server-12 systemd[1]: calico-node.service: Main process exited, code=exited, status=1/FAILURE
Apr 5 09:14:15 solqa-lab1-server-12 systemd[1]: calico-node.service: Failed with result 'exit-code'.

The key error seems to be this:

ctr: snapshot "calico-node": already exists

There's some lingering state that's preventing the container from starting, and that state isn't getting cleaned up. Seems like a containerd or ctr bug of some sort.

Revision history for this message
George Kraft (cynerva) wrote :
Changed in charm-canal:
milestone: none → 1.24
Changed in charm-tigera-secure-ee:
milestone: none → 1.24
Changed in charm-canal:
importance: Undecided → High
Changed in charm-tigera-secure-ee:
importance: Undecided → High
Changed in charm-canal:
status: New → Triaged
Changed in charm-tigera-secure-ee:
status: New → Triaged
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :
Changed in charm-calico:
status: Triaged → Fix Committed
Changed in charm-canal:
status: Triaged → Fix Committed
Changed in charm-tigera-secure-ee:
status: Triaged → Fix Committed
Changed in charm-calico:
assignee: nobody → Kevin W Monroe (kwmonroe)
Changed in charm-canal:
assignee: nobody → Kevin W Monroe (kwmonroe)
Changed in charm-tigera-secure-ee:
assignee: nobody → Kevin W Monroe (kwmonroe)
Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

We have unfortunately seen 1 occurrence of this so far in the 1.24 release testing:

https://solutions.qa.canonical.com/testruns/testRun/ddea2ceb-0577-4937-80a7-28f048dddd24

There is a small error in our config (we set the calico cidr to a /2 instead of /22 range), but I don't think that is related.

Revision history for this message
George Kraft (cynerva) wrote :

The failure in the crashdump looks unrelated - it hadn't made it far enough to start calico-node yet.

The accidental /2 is likely to cause major problems. That's claiming the entire address range from 192.0.0.0 to 255.255.255.255. Let us know if you see a problem with the /22 and we'll look into it deeper.

Changed in charm-calico:
status: Fix Committed → Fix Released
Changed in charm-canal:
status: Fix Committed → Fix Released
Changed in charm-tigera-secure-ee:
status: Fix Committed → Fix Released
Revision history for this message
Patrizio Bassi (patrizio-bassi) wrote :

I still have this in 1.24, any hints?

nit-calico-5: 16:06:49 INFO unit.calico/5.juju-log Reactive main running for hook update-status
unit-calico-5: 16:06:49 INFO unit.calico/5.juju-log Initializing Leadership Layer (is follower)
unit-calico-5: 16:06:49 INFO unit.calico/5.juju-log Invoking reactive handler: reactive/calico.py:146:update_calicoctl_env
unit-calico-5: 16:06:49 INFO unit.calico/5.juju-log Invoking reactive handler: reactive/calico.py:484:configure_node
unit-calico-5: 16:06:49 INFO unit.calico/5.juju-log status-set: maintenance: Configuring Calico node
unit-calico-5: 16:06:49 INFO unit.calico/5.juju-log b'null\nresource does not exist: Node(juju-3baadf-k8s-11) with error: <nil>\n'
unit-calico-5: 16:06:49 INFO unit.calico/5.juju-log Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-calico-5/charm/reactive/calico.py", line 494, in configure_node
    node = calicoctl_get('node', node_name)
  File "/var/lib/juju/agents/unit-calico-5/charm/reactive/calico.py", line 684, in calicoctl_get
    output = calicoctl(*args)
  File "/var/lib/juju/agents/unit-calico-5/charm/reactive/calico.py", line 617, in calicoctl
    return check_output(cmd, env=env, stderr=STDOUT)
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/calicoctl/calicoctl', 'get', '-o', 'yaml', '--export', 'node', 'juju-3baadf-k8s-11']' returned non-zero exit status 1.

Revision history for this message
George Kraft (cynerva) wrote :

Hi Patrizio. The "resource does not exist: Node(juju-3baadf-k8s-11)" error indicates that the calico-node service is failing for some reason. To troubleshoot further, I would ssh into the calico/5 unit with `juju ssh calico/5` and then run:

sudo journalctl -u calico-node
sudo cat /var/log/calico/felix/current

Revision history for this message
Patrizio Bassi (patrizio-bassi) wrote :

Hi George thank you for your feedback.

It looks like the unit was stuck, i issues a reboot and after it came back smooth.

As this is not a clean w.a. in a few days i will destroy and recreate it and i'll report it again

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.