racy configuration of kubelet with recent images - eks nodes failing to taint / label on creation
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
cloud-images |
Fix Released
|
High
|
George Kraft |
Bug Description
Hi. I'm using EKS with a self-managed nodepool, set up via TF.
Over the last 72 hours or so, there has been a high incidence (but not 100%) of new instances registering themselves with k8s, but not honouring the --register-
These instances are booted with the following user-data:
[[[
#!/bin/bash
set -e
B64_CLUSTER_CA=
API_SERVER_URL=
/etc/eks/
]]]
On inspection of a broken node, kubelet is running from snap with those command-line arguments.
My *suspicion* is that the user-data script is racing - perhaps kubelet is coming up once without the appropriate arguments. This'd cause the node to be registered, then the process is replaced but by that point the node resource is created in k8s and isn't updated. (Certainly, restarting the kubelet process doenst' address the issue.)
I see that https:/
summary: |
- amazon/ubuntu-eks/k8s_1.24/images/hvm-ssd/ubuntu- - focal-20.04-amd64-server-20230607 - eks nodes failing to taint / label - on creation + racy configuration of kubelet with recent images - eks nodes failing to + taint / label on creation |
The problem here is two "snap set" commands, one after the other, in bootstrap.sh.
The first gives kubelet all the information to contact the API server and register itself, but not the extra args. This restarts kubelet and begins a race.
The second adds the extra-args and restarts kubelet.
--register- with-taint and --node-label have no effect if they are applied after the node's registered.
If the first restart of kubelet gets as far as registering the node, those taints / labels will *never* apply.
The fix is to merge the two `kubelet set` commands into one.
A more principled fix would be to not start kubelet at all until it's configured properly.