Kubernetes Control Plane Charm

aws cloud controller manager not using fqdn for unitialized nodes

Bug #2053034 reported by Amjad Chami on 2024-02-13

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
AWS Cloud Provider Charm	Invalid	Medium	Unassigned
Calico Charm	Invalid	Undecided	Unassigned
Kubernetes Control Plane Charm	Fix Released	Medium	Unassigned	Kubernetes Control Plane Charm 1.30

Bug Description

A Calico pod is stuck waiting during a k8s on aws deployment. The logs show

-----------------------------------------------------------
2024-02-09 08:35:22 ERROR unit.calico/4.juju-log server.go:325 b'resource does not exist: Node(ip-172-31-37-35) with error: <nil>\n'
2024-02-09 08:35:22 ERROR unit.calico/4.juju-log server.go:325 b'null\n'
2024-02-09 08:35:22 ERROR unit.calico/4.juju-log server.go:325 Failed to configure node.
Traceback (most recent call last):
  File "./src/charm.py", line 298, in _configure_node
    node = self._calicoctl_get("node", node_name)
  File "./src/charm.py", line 640, in _calicoctl_get
    output = self.calicoctl(*args)
  File "./src/charm.py", line 632, in calicoctl
    return subprocess.check_output(cmd, env=env, stderr=subprocess.PIPE, timeout=timeout)
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/calicoctl/calicoctl', 'get', '-o', 'yaml', '--export', 'node', 'ip-172-31-37-35']' returned non-zero exit status 1.
2024-02-09 08:35:22 ERROR unit.calico/4.juju-log server.go:325 Failed to configure Calico, will retry.
Traceback (most recent call last):
  File "./src/charm.py", line 174, in _install_or_upgrade
    self._configure_calico()
  File "./src/charm.py", line 125, in _configure_calico
    self._configure_node()
  File "./src/charm.py", line 305, in _configure_node
    raise e
  File "./src/charm.py", line 298, in _configure_node
    node = self._calicoctl_get("node", node_name)
  File "./src/charm.py", line 640, in _calicoctl_get
    output = self.calicoctl(*args)
  File "./src/charm.py", line 632, in calicoctl
    return subprocess.check_output(cmd, env=env, stderr=subprocess.PIPE, timeout=timeout)
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/calicoctl/calicoctl', 'get', '-o', 'yaml', '--export', 'node', 'ip-172-31-37-35']' returned non-zero exit status 1.

---------------------------------------------------------------------------------------------
and in the cloud manager logs we see

I0209 04:26:41.428228 I0209 04:26:41.428263 I0209 04:26:41.428621 E0209 04:26:41.519423 I0209 04:26:41.529643 E0209 04:26:41.619732 I0209 04:26:41.639955 E0209 04:26:41.731775 I0209 04:26:41.772253 E0209 04:26:41.857045 I0209 04:26:41.937298 E0209 04:26:42.018322 I0209 04:26:42.178671 E0209 04:26:42.269708 I0209 04:26:42.590273 E0209 04:26:42.677117 I0209 04:26:43.317318 E0209 04:26:43.467989 I0209 04:26:44.748498 E0209 04:26:44.840428 I0209 04:26:45.070342 I0209 04:26:45.070488 1 node_controller.go:484] Successfully initialized node ip-172-31-42-39.ec2.internal with cloud provider
1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
1 event.go:294] "Event occurred" object="ip-172-31-42-39.ec2.internal" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="Synced" message="Node synced successfully"
1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider
1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
1 node_lifecycle_controller.go:164] deleting node since it is no longer present in cloud provider: ip-172-31-37-35
1 event.go:294] "Event occurred" object="ip-172-31-37-35" fieldPath="" kind="Node" apiVersion="" type="Normal" reason="DeletingNode" message="Deleting node ip-172-31-37-35 because it does not exist in the cloud provider"

---------------------------------------------------------------------------------------------------
It seems like the instance was deleted by no new pod was provisioned and the old ip kept on being reused;

Crashdumps and more information can be found here: https://oil-jenkins.canonical.com/artifacts/f987f8e9-1490-4812-aed3-07d3ba4623d1/index.html

See original description

Amjad Chami (amjad-chami) on 2024-02-13

summary:	- A Calico Pod is stuck waiting during deployment + A Calico Pod is stuck waiting during deployment, unable to reach the ip + of a deleted node
description:	updated

Revision history for this message

Jeffrey Chang (modern911) wrote on 2024-02-28 (last edit on 2024-02-28): Re: A Calico Pod is stuck waiting during deployment, unable to reach the ip of a deleted node

ok, the same calicoctl error also happens in a passing testrun, https://solutions.qa.canonical.com/testruns/cffca8e5-1fd3-4eb4-b182-3c69b58dccb5
Maybe this is not the cause.

Revision history for this message

Kevin W Monroe (kwmonroe) wrote on 2024-03-07:

Download full text (3.3 KiB)

Correct, the calicoctl errors are famous red herrings (esp while the cluster is coming up), but they aren't the root cause. You were in the right spot with the cloud controller logs. Here's the key bit with more context:

./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:39.871909 1 node_controller.go:415] Initializing node ip-172-31-33-226.ec2.internal with cloud provider
./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:40.415003 1 node_controller.go:415] Initializing node ip-172-31-38-62.ec2.internal with cloud provider
./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:40.771236 1 node_controller.go:415] Initializing node ip-172-31-41-51.ec2.internal with cloud provider
./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:41.025044 1 node_controller.go:415] Initializing node ip-172-31-42-39.ec2.internal with cloud provider
./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:41.428263 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider

See that last line? Good nodes look like this:

Initializing node ip-172-31-33-226.ec2.internal with cloud provider

Our failing node looks like this:

Initializing node ip-172-31-37-35 with cloud provider

AWS nodes should be fully qualified, so the c-c-m is correctly telling us that the short hostname is not found:

E0209 04:26:40.414974 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing

External cloud providers (like aws) will taint nodes with a NoSchedule effect until they are initialized. Since this never happens for the 37.35 node, the taint remains and no pods can be scheduled. That's the root cause for calico being upset -- it can't deploy the calico resources to that node because it's not schedulable.

Unfortunately, I can't see where this short hostname is coming from. cloud-init logs clearly show the fqdn being set as the hostname, yet kube-api-server logs clearly show checks against the short name (e.g. URI="/api/v1/nodes/ip-172-31-37-35?timeout=10s"). Couple ideas:

1. Some other co-located charm is altering the hostname between cloud-init and k8s bringup (i see calico, containerd, filebeat, ntp, telegraf, and ubuntu-advantage subordinates on the k-c-p/0 unit).

2. The aws-cloud-provider charm is somehow not getting the provider-id correct for k-c-p/0 (the leader, fwiw) and is causing c-c-m to look for an incorrect node name.

3. The DNS resolver wasn't ready when k8s came up, and the 'dn' part of 'fqdn' wasn't available.

./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:39.871909       1 node_controller.go:415] Initializing node ip-172-31-33-226.ec2.internal with cloud provider
./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:40.415003       1 node_controller.go:415] Initializing node ip-172-31-38-62.ec2.internal with cloud provider
./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:40.771236       1 node_controller.go:415] Initializing node ip-172-31-41-51.ec2.internal with cloud provider
./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:41.025044       1 node_controller.go:415] Initializing node ip-172-31-42-39.ec2.internal with cloud provider
./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:41.428263       1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider

See that last line? Good nodes look like this:

Initializing node ip-172-31-33-226.ec2.internal with cloud provider

Our failing node looks like this:

Initializing node ip-172-31-37-35 with cloud provider

AWS nodes should be fully qualified, so the c-c-m is correctly telling us that the short hostname is not found:

E0209 04:26:40.414974       1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing

1. Some other co-located charm is altering the hostname between cloud-init and k8s bringup (i see calico, containerd, filebeat, ntp, telegraf, and ubuntu-advantage subordinates on the k-c-p/0 unit).

2. The aws-cloud-provider charm is somehow not getting the provider-id correct for k-c-p/0 (the leader, fwiw) and is causing c-c-m to look for an incorrect node name.

3. The DNS resolver wasn't ready when k8s came up, and the 'dn' part of 'fqdn' wasn't available.

One thing I can say with a high degree of certainty is that this isn't a bug in the calico charm, and we haven't seen this with dozens of charmed k8s aws runs over the last few weeks. I'm adding aws-cloud-provider as an affected project and targeting 1.29+ck1, but I'm not sure if there's actually a todo for that charm either. Hopefully a simpler reproducer will reveal itself.

Changed in charm-aws-cloud-provider:
status:	New → Incomplete
importance:	Undecided → Medium
milestone:	none → 1.29+ck1
Changed in charm-calico:
status:	New → Invalid
summary:	- A Calico Pod is stuck waiting during deployment, unable to reach the ip - of a deleted node + aws cloud controller manager not using fqdn for unitialized nodes

Kevin W Monroe (kwmonroe) on 2024-03-25

Changed in charm-aws-cloud-provider:
milestone:	1.29+ck1 → 1.30

Revision history for this message

Adam Dyess (addyess) wrote on 2024-05-31:

I imagine the integration gap closed with AWS may help here. This will show up in 1.30

commit: https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/commit/0db208586f465943b3528d0420074659b15aeecb

Changed in charm-kubernetes-master:
milestone:	none → 1.30
status:	New → Fix Committed

Adam Dyess (addyess) on 2024-05-31

Changed in charm-kubernetes-master:
importance:	Undecided → Medium

Adam Dyess (addyess) on 2024-06-06

Changed in charm-aws-cloud-provider:
status:	Incomplete → Invalid
milestone:	1.30 → none

Adam Dyess (addyess) on 2024-07-11

Changed in charm-kubernetes-master:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.