Comment 2 for bug 2053034

Revision history for this message
Kevin W Monroe (kwmonroe) wrote : Re: A Calico Pod is stuck waiting during deployment, unable to reach the ip of a deleted node

Correct, the calicoctl errors are famous red herrings (esp while the cluster is coming up), but they aren't the root cause. You were in the right spot with the cloud controller logs. Here's the key bit with more context:

./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:39.871909 1 node_controller.go:415] Initializing node ip-172-31-33-226.ec2.internal with cloud provider
./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:40.415003 1 node_controller.go:415] Initializing node ip-172-31-38-62.ec2.internal with cloud provider
./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:40.771236 1 node_controller.go:415] Initializing node ip-172-31-41-51.ec2.internal with cloud provider
./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:41.025044 1 node_controller.go:415] Initializing node ip-172-31-42-39.ec2.internal with cloud provider
./7/baremetal/pod-logs/kube-system-aws-cloud-controller-manager-cqz49-aws-cloud-controller-manager.log:I0209 04:26:41.428263 1 node_controller.go:415] Initializing node ip-172-31-37-35 with cloud provider

See that last line? Good nodes look like this:

Initializing node ip-172-31-33-226.ec2.internal with cloud provider

Our failing node looks like this:

Initializing node ip-172-31-37-35 with cloud provider

AWS nodes should be fully qualified, so the c-c-m is correctly telling us that the short hostname is not found:

E0209 04:26:40.414974 1 node_controller.go:229] error syncing 'ip-172-31-37-35': failed to get provider ID for node ip-172-31-37-35 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing

External cloud providers (like aws) will taint nodes with a NoSchedule effect until they are initialized. Since this never happens for the 37.35 node, the taint remains and no pods can be scheduled. That's the root cause for calico being upset -- it can't deploy the calico resources to that node because it's not schedulable.

Unfortunately, I can't see where this short hostname is coming from. cloud-init logs clearly show the fqdn being set as the hostname, yet kube-api-server logs clearly show checks against the short name (e.g. URI="/api/v1/nodes/ip-172-31-37-35?timeout=10s"). Couple ideas:

1. Some other co-located charm is altering the hostname between cloud-init and k8s bringup (i see calico, containerd, filebeat, ntp, telegraf, and ubuntu-advantage subordinates on the k-c-p/0 unit).

2. The aws-cloud-provider charm is somehow not getting the provider-id correct for k-c-p/0 (the leader, fwiw) and is causing c-c-m to look for an incorrect node name.

3. The DNS resolver wasn't ready when k8s came up, and the 'dn' part of 'fqdn' wasn't available.

One thing I can say with a high degree of certainty is that this isn't a bug in the calico charm, and we haven't seen this with dozens of charmed k8s aws runs over the last few weeks. I'm adding aws-cloud-provider as an affected project and targeting 1.29+ck1, but I'm not sure if there's actually a todo for that charm either. Hopefully a simpler reproducer will reveal itself.