EKS 1.30: new nodes sometimes get "tls: internal error"

Bug #2069854 reported by Sergei Jeldosev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-images
New
Undecided
Unassigned

Bug Description

After upgrading to 1.30 and EKS AMI ami-053f81caf652e695c, the node CSR remains in the Pending state for some new nodes. A new certificate is issued in approximately 10 minutes.
That causes "tls: internal error" errors for ~10 minutes. If kubelet service is restarted manually, everything starts working immediately. We use Karpenter for autoscaling and it happens for about every 5th node.
Looks like there's a timing issue somewhere. We use userdata but don't restart kubelet or containerd.
AWS VPC CNI: 1.18.2
kube-proxy v1.30.0-minimal-eksbuild.3

Tags: eks
Revision history for this message
Thomas Bechtold (toabctl) wrote :

Hi Sergei,

thanks for filling the bug report. Can you please provide log files (journal, kubelet-eks service, relevant pods, ...) . And possibly more details about your setup and the detailed steps to reproduce this?

Revision history for this message
Sergei Jeldosev (sergeij) wrote :

Hi Thomas,

Sure, attached kubelet logs. Everything seems to start in the same order as in 1.29 Focal, except there are no issues with tls.
In the log file you can see messages "http: TLS handshake failed from 10.10.254.110:59266: No service certificate available for kubelet"
And at 13:24:01 after restarting the kubelet everything starts working and the errors disappear (as a workaround we added another cron to check logs for tls errors on startup and restart the kubelet)
Not sure how exactly to reproduce it, because this happens just randomly with some nodes.

And we see the following for node certificate signing request. The first one is in status "Pending" and another one is "Approved,Issued" after kubelet restart. Without cron workaround the same happens automatically but in ~10 minutes.

kubectl get csr
NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION
csr-pkvdm 17h kubernetes.io/kubelet-serving system:node:x-x-x-x.eu-north-1.compute.internal <none> Pending
csr-qhb6n 17m kubernetes.io/kubelet-serving system:node:x-x-x-x.eu-north-1.compute.internal <none> Approved,Issued

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.