cloud-images

EKS 1.30: new nodes sometimes get "tls: internal error"

Bug #2069854 reported by Sergei Jeldosev on 2024-06-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	cloud-images	New	Undecided	Unassigned

Bug Description

After upgrading to 1.30 and EKS AMI ami-053f81caf652e695c, the node CSR remains in the Pending state for some new nodes. A new certificate is issued in approximately 10 minutes.
That causes "tls: internal error" errors for ~10 minutes. If kubelet service is restarted manually, everything starts working immediately. We use Karpenter for autoscaling and it happens for about every 5th node.
Looks like there's a timing issue somewhere. We use userdata but don't restart kubelet or containerd.
AWS VPC CNI: 1.18.2
kube-proxy v1.30.0-minimal-eksbuild.3

Tags:

Revision history for this message

Thomas Bechtold (toabctl) wrote on 2024-06-20:

Hi Sergei,

thanks for filling the bug report. Can you please provide log files (journal, kubelet-eks service, relevant pods, ...) . And possibly more details about your setup and the detailed steps to reproduce this?

Revision history for this message

Sergei Jeldosev (sergeij) wrote on 2024-06-20:

kubelet.txt Edit (69.0 KiB, text/plain)

Hi Thomas,

Sure, attached kubelet logs. Everything seems to start in the same order as in 1.29 Focal, except there are no issues with tls.
In the log file you can see messages "http: TLS handshake failed from 10.10.254.110:59266: No service certificate available for kubelet"
And at 13:24:01 after restarting the kubelet everything starts working and the errors disappear (as a workaround we added another cron to check logs for tls errors on startup and restart the kubelet)
Not sure how exactly to reproduce it, because this happens just randomly with some nodes.

And we see the following for node certificate signing request. The first one is in status "Pending" and another one is "Approved,Issued" after kubelet restart. Without cron workaround the same happens automatically but in ~10 minutes.

kubectl get csr
NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION
csr-pkvdm 17h kubernetes.io/kubelet-serving system:node:x-x-x-x.eu-north-1.compute.internal <none> Pending
csr-qhb6n 17m kubernetes.io/kubelet-serving system:node:x-x-x-x.eu-north-1.compute.internal <none> Approved,Issued

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

kubelet.txt Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.