Comment 0 for bug 2053034

Revision history for this message
Amjad Chami (amjad-chami) wrote : A Calico Pod is stuck waiting during deployment

A Calico pod is stuck waiting during a k8s deployment. The logs show

-----------------------------------------------------------
2024-02-09 08:35:22 ERROR unit.calico/4.juju-log server.go:325 b'resource does not exist: Node(ip-172-31-37-35) with error: <nil>\n'
2024-02-09 08:35:22 ERROR unit.calico/4.juju-log server.go:325 b'null\n'
2024-02-09 08:35:22 ERROR unit.calico/4.juju-log server.go:325 Failed to configure node.
Traceback (most recent call last):
  File "./src/charm.py", line 298, in _configure_node
    node = self._calicoctl_get("node", node_name)
  File "./src/charm.py", line 640, in _calicoctl_get
    output = self.calicoctl(*args)
  File "./src/charm.py", line 632, in calicoctl
    return subprocess.check_output(cmd, env=env, stderr=subprocess.PIPE, timeout=timeout)
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/calicoctl/calicoctl', 'get', '-o', 'yaml', '--export', 'node', 'ip-172-31-37-35']' returned non-zero exit status 1.
2024-02-09 08:35:22 ERROR unit.calico/4.juju-log server.go:325 Failed to configure Calico, will retry.
Traceback (most recent call last):
  File "./src/charm.py", line 174, in _install_or_upgrade
    self._configure_calico()
  File "./src/charm.py", line 125, in _configure_calico
    self._configure_node()
  File "./src/charm.py", line 305, in _configure_node
    raise e
  File "./src/charm.py", line 298, in _configure_node
    node = self._calicoctl_get("node", node_name)
  File "./src/charm.py", line 640, in _calicoctl_get
    output = self.calicoctl(*args)
  File "./src/charm.py", line 632, in calicoctl
    return subprocess.check_output(cmd, env=env, stderr=subprocess.PIPE, timeout=timeout)
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/calicoctl/calicoctl', 'get', '-o', 'yaml', '--export', 'node', 'ip-172-31-37-35']' returned non-zero exit status 1.

---------------------------------------------------------------------------------------------
and in the cloud manager logs we see

I0209 04:26:41.428228 1 node_controller.go:484] Successfully initialized node ip-172-31-42-39.ec2.internal with cloud provider
I0209 04:26:41.428263 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
I0209 04:26:41.428621 1 event.go:294] "Event occurred" object="ip-172-31-42-39.ec2.internal" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="Synced" message="Node synced successfully"
E0209 04:26:41.519423 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0209 04:26:41.529643 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
E0209 04:26:41.619732 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0209 04:26:41.639955 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
E0209 04:26:41.731775 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0209 04:26:41.772253 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
E0209 04:26:41.857045 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0209 04:26:41.937298 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
E0209 04:26:42.018322 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0209 04:26:42.178671 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
E0209 04:26:42.269708 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0209 04:26:42.590273 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
E0209 04:26:42.677117 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0209 04:26:43.317318 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
E0209 04:26:43.467989 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0209 04:26:44.748498 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
E0209 04:26:44.840428 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0209 04:26:45.070342 1 node_lifecycle_controller.go:164] deleting node since it is no longer present in cloud provider: ip-172-31-37-35
I0209 04:26:45.070488 1 event.go:294] "Event occurred" object="ip-172-31-37-35" fieldPath="" kind="Node" apiVersion="" type="Normal" reason="DeletingNode" message="Deleting node ip-172-31-37-35 because it does not exist in the cloud provider"

---------------------------------------------------------------------------------------------------
It seems like the instance was deleted by no new pod was provisioned and the old ip kept on being reused;

Crashdumps and more information can be found here: https://oil-jenkins.canonical.com/artifacts/f987f8e9-1490-4812-aed3-07d3ba4623d1/index.html