aws cloud controller manager not using fqdn for unitialized nodes

Bug #2053034 reported by Amjad Chami
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
AWS Cloud Provider Charm
Invalid
Medium
Unassigned
Calico Charm
Invalid
Undecided
Unassigned
Kubernetes Control Plane Charm
Fix Committed
Medium
Unassigned

Bug Description

A Calico pod is stuck waiting during a k8s on aws deployment. The logs show

-----------------------------------------------------------
2024-02-09 08:35:22 ERROR unit.calico/4.juju-log server.go:325 b'resource does not exist: Node(ip-172-31-37-35) with error: <nil>\n'
2024-02-09 08:35:22 ERROR unit.calico/4.juju-log server.go:325 b'null\n'
2024-02-09 08:35:22 ERROR unit.calico/4.juju-log server.go:325 Failed to configure node.
Traceback (most recent call last):
  File "./src/charm.py", line 298, in _configure_node
    node = self._calicoctl_get("node", node_name)
  File "./src/charm.py", line 640, in _calicoctl_get
    output = self.calicoctl(*args)
  File "./src/charm.py", line 632, in calicoctl
    return subprocess.check_output(cmd, env=env, stderr=subprocess.PIPE, timeout=timeout)
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/calicoctl/calicoctl', 'get', '-o', 'yaml', '--export', 'node', 'ip-172-31-37-35']' returned non-zero exit status 1.
2024-02-09 08:35:22 ERROR unit.calico/4.juju-log server.go:325 Failed to configure Calico, will retry.
Traceback (most recent call last):
  File "./src/charm.py", line 174, in _install_or_upgrade
    self._configure_calico()
  File "./src/charm.py", line 125, in _configure_calico
    self._configure_node()
  File "./src/charm.py", line 305, in _configure_node
    raise e
  File "./src/charm.py", line 298, in _configure_node
    node = self._calicoctl_get("node", node_name)
  File "./src/charm.py", line 640, in _calicoctl_get
    output = self.calicoctl(*args)
  File "./src/charm.py", line 632, in calicoctl
    return subprocess.check_output(cmd, env=env, stderr=subprocess.PIPE, timeout=timeout)
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/calicoctl/calicoctl', 'get', '-o', 'yaml', '--export', 'node', 'ip-172-31-37-35']' returned non-zero exit status 1.

---------------------------------------------------------------------------------------------
and in the cloud manager logs we see

I0209 04:26:41.428228 1 node_controller.go:484] Successfully initialized node ip-172-31-42-39.ec2.internal with cloud provider
I0209 04:26:41.428263 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
I0209 04:26:41.428621 1 event.go:294] "Event occurred" object="ip-172-31-42-39.ec2.internal" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="Synced" message="Node synced successfully"
E0209 04:26:41.519423 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0209 04:26:41.529643 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
E0209 04:26:41.619732 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0209 04:26:41.639955 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
E0209 04:26:41.731775 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0209 04:26:41.772253 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
E0209 04:26:41.857045 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0209 04:26:41.937298 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
E0209 04:26:42.018322 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0209 04:26:42.178671 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
E0209 04:26:42.269708 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0209 04:26:42.590273 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
E0209 04:26:42.677117 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0209 04:26:43.317318 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
E0209 04:26:43.467989 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0209 04:26:44.748498 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
E0209 04:26:44.840428 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0209 04:26:45.070342 1 node_lifecycle_controller.go:164] deleting node since it is no longer present in cloud provider: ip-172-31-37-35
I0209 04:26:45.070488 1 event.go:294] "Event occurred" object="ip-172-31-37-35" fieldPath="" kind="Node" apiVersion="" type="Normal" reason="DeletingNode" message="Deleting node ip-172-31-37-35 because it does not exist in the cloud provider"

---------------------------------------------------------------------------------------------------
It seems like the instance was deleted by no new pod was provisioned and the old ip kept on being reused;

Crashdumps and more information can be found here: https://oil-jenkins.canonical.com/artifacts/f987f8e9-1490-4812-aed3-07d3ba4623d1/index.html

summary: - A Calico Pod is stuck waiting during deployment
+ A Calico Pod is stuck waiting during deployment, unable to reach the ip
+ of a deleted node
description: updated
Revision history for this message
Jeffrey Chang (modern911) wrote (last edit ): Re: A Calico Pod is stuck waiting during deployment, unable to reach the ip of a deleted node

ok, the same calicoctl error also happens in a passing testrun, https://solutions.qa.canonical.com/testruns/cffca8e5-1fd3-4eb4-b182-3c69b58dccb5
Maybe this is not the cause.

Revision history for this message
Kevin W Monroe (kwmonroe) wrote :
Download full text (3.3 KiB)

Correct, the calicoctl errors are famous red herrings (esp while the cluster is coming up), but they aren't the root cause. You were in the right spot with the cloud controller logs. Here's the key bit with more context:

./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:39.871909 1 node_controller.go:415] Initializing node ip-172-31-33-226.ec2.internal with cloud provider
./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:40.415003 1 node_controller.go:415] Initializing node ip-172-31-38-62.ec2.internal with cloud provider
./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:40.771236 1 node_controller.go:415] Initializing node ip-172-31-41-51.ec2.internal with cloud provider
./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:41.025044 1 node_controller.go:415] Initializing node ip-172-31-42-39.ec2.internal with cloud provider
./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:41.428263 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider

See that last line? Good nodes look like this:

Initializing node ip-172-31-33-226.ec2.internal with cloud provider

Our failing node looks like this:

Initializing node ip-172-31-37-35 with cloud provider

AWS nodes should be fully qualified, so the c-c-m is correctly telling us that the short hostname is not found:

E0209 04:26:40.414974 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing

External cloud providers (like aws) will taint nodes with a NoSchedule effect until they are initialized. Since this never happens for the 37.35 node, the taint remains and no pods can be scheduled. That's the root cause for calico being upset -- it can't deploy the calico resources to that node because it's not schedulable.

Unfortunately, I can't see where this short hostname is coming from. cloud-init logs clearly show the fqdn being set as the hostname, yet kube-api-server logs clearly show checks against the short name (e.g. URI="/api/v1/nodes/ip-172-31-37-35?timeout=10s"). Couple ideas:

1. Some other co-located charm is altering the hostname between cloud-init and k8s bringup (i see calico, containerd, filebeat, ntp, telegraf, and ubuntu-advantage subordinates on the k-c-p/0 unit).

2. The aws-cloud-provider charm is somehow not getting the provider-id correct for k-c-p/0 (the leader, fwiw) and is causing c-c-m to look for an incorrect node name.

3. The DNS resolver wasn't ready when k8s came up, and the 'dn' part of 'fqdn' wasn't available.

One thing I can say with a high degree of certainty is that this isn't a bug in the calico charm, and we haven't seen this with dozens of charmed k8s aws runs over the last few weeks. I'm adding aws-cloud-provider as an affected project and targeti...

Read more...

Changed in charm-aws-cloud-provider:
status: New → Incomplete
importance: Undecided → Medium
milestone: none → 1.29+ck1
Changed in charm-calico:
status: New → Invalid
summary: - A Calico Pod is stuck waiting during deployment, unable to reach the ip
- of a deleted node
+ aws cloud controller manager not using fqdn for unitialized nodes
Changed in charm-aws-cloud-provider:
milestone: 1.29+ck1 → 1.30
Revision history for this message
Adam Dyess (addyess) wrote :

I imagine the integration gap closed with AWS may help here. This will show up in 1.30

commit: https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/commit/0db208586f465943b3528d0420074659b15aeecb

Changed in charm-kubernetes-master:
milestone: none → 1.30
status: New → Fix Committed
Adam Dyess (addyess)
Changed in charm-kubernetes-master:
importance: Undecided → Medium
Adam Dyess (addyess)
Changed in charm-aws-cloud-provider:
status: Incomplete → Invalid
milestone: 1.30 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.