cloud-images

Bug #2045791
Comment #0

Comment 0 for bug 2045791

Revision history for this message

Lorelei Rupp (loreleirupp) wrote on 2023-12-06:

We have an EKS cluster in aws 1.25

I tried to connect managed node groups with the following base AMIs:

ubuntu-eks/k8s_1.25/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20231201 -- fails to label the node
ubuntu-eks/k8s_1.26/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20231201 -- fails to label the node
ubuntu-eks/k8s_1.26/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20231204.1--fails to join the cluster with a error in user-data.log on line 506 in the bootstrap.sh script.

I guess I am just unlucky in that I tried to roll a new ubuntu AMI to our cluster this week.

I believe these bugs are related https://bugs.launchpad.net/cloud-images/+bug/2040477 and https://bugs.launchpad.net/cloud-images/+bug/2045311

As they called out the /etc/eks/ootstrap.sh is different and has changed in these newer amis and has issues in all the different versions I have tried

At first it was just labeling was not working

I could see the kubelet was just not being started with the node labels
In a working 1.24 image it looks like
$ ps -ef | grep kube
root 3833 1 1 13:33 ? 00:04:08 /snap/kubelet-eks/198/kubelet --node-labels=ec2.amazonaws.com/as-label-env=dev2,ec2.amazonaws.com/as-label-type=paravision-processor_gpu --address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook --authorization-mode=Webhook --cgroup-driver=cgroupfs --client-ca-file=/etc/kubernetes/pki/ca.crt --cloud-provider=aws --cluster-dns=172.20.0.10 --cluster-domain=cluster.local --config=/etc/kubernetes/kubelet/kubelet-config.json --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock --feature-gates=RotateKubeletServerCertificate=true --kubeconfig=/var/lib/kubelet/kubeconfig --node-ip=10.0.20.16 --pod-infra-container-image=602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause:3.5 --register-node --resolv-conf=/run/systemd/resolve/resolv.conf

Where in one of my first two above it shows

$ ps -ef | grep kube
root 4059 1 1 Dec05 ? 00:24:01 /snap/kubelet-eks/202/kubelet --address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook --authorization-mode=Webhook --cgroup-driver=cgroupfs --client-ca-file=/etc/kubernetes/pki/ca.crt --cloud-provider=aws --cluster-dns=172.20.0.10 --cluster-domain=cluster.local --config=/etc/kubernetes/kubelet/kubelet-config.json --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock --kubeconfig=/var/lib/kubelet/kubeconfig --node-ip=10.0.21.117 --pod-infra-container-image=602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause:3.5 --register-node --resolv-conf=/run/systemd/resolve/resolv.conf

-------------------
Seeing those bug reports I tried to grab the latest AMI just now, and that one doesn't even connect to our cluster.

Here is the user-data.log error
....
2023-12-06 16:15:26,674:__main__:INFO:No more changes in progress ...
2023-12-06 16:15:26,676:__main__:INFO:result for change: {'id': '28', 'kind': 'configure-snap', 'summary': 'Change configuration of "kubelet-eks" snap', 'status': 'Done', 'tasks': [{'id': '151', 'kind': 'run-hook', 'summary': 'Run configure hook of "kubelet-eks" snap', 'status': 'Done', 'progress': {'label': '', 'done': 1, 'total': 1}, 'spawn-time': '2023-12-06T16:15:25.546818263Z', 'ready-time': '2023-12-06T16:15:26.659445389Z'}], 'ready': True, 'spawn-time': '2023-12-06T16:15:25.546834552Z', 'ready-time': '2023-12-06T16:15:26.659446614Z'}
usage: snapdhelper.py configure [-h] snapname key value
snapdhelper.py configure: error: the following arguments are required: value
Exited with error on line 506

-----------------
Also our user-data script for all of these is the same and looks like this for example

#!/bin/bash
#
# This script is meant to be run in the User Data of each EKS worker instance that hosts applications. It registers the
# instance with the proper EKS cluster based on data provided by Terraform. Note that this script assumes it is running
# from an AMI that is derived from the EKS optimized AMIs that AWS provides.

set -e

# Send the log output from this script to user-data.log, syslog, and the console
# From: https://alestic.com/2010/12/ec2-user-data-output/
exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1

# Here we call the bootstrap script to register the EKS worker node to the control plane.
# Maps tags to labels for tags with the specific label prefix defined in var.worker_label_prefix
# https://github.com/gruntwork-io/terraform-aws-eks/tree/master/modules/eks-scripts
function register_eks_worker {
  NODE_LABELS="ec2.amazonaws.com/as-label-env=dev2,ec2.amazonaws.com/as-label-type=paravision-processor_gpu"
  /etc/eks/bootstrap.sh \
    --apiserver-endpoint "https://C870147FDA923006BED90BC4DE7A2B34.gr7.us-east-2.eks.amazonaws.com" \
    --b64-cluster-ca "XXXXX" --kubelet-extra-args "--node-labels=\"$NODE_LABELS\"" \
    "saas-dev2-eks"
}

function run {
register_eks_worker
}

run

Happy to attach more logs etc if you just let me know what you want. Hoping someone can help me!

We have an EKS cluster in aws 1.25

I tried to connect managed node groups with the following base AMIs:

I guess I am just unlucky in that I tried to roll a new ubuntu AMI to our cluster this week.

I believe these bugs are related https://bugs.launchpad.net/cloud-images/+bug/2040477 and https://bugs.launchpad.net/cloud-images/+bug/2045311

As they called out the /etc/eks/ootstrap.sh is different and has changed in these newer amis and has issues in all the different versions I have tried

At first it was just labeling was not working

I could see the kubelet was just not being started with the node labels
In a working 1.24 image it looks like
$ ps -ef | grep kube
root        3833       1  1 13:33 ?        00:04:08 /snap/kubelet-eks/198/kubelet --node-labels=ec2.amazonaws.com/as-label-env=dev2,ec2.amazonaws.com/as-label-type=paravision-processor_gpu --address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook --authorization-mode=Webhook --cgroup-driver=cgroupfs --client-ca-file=/etc/kubernetes/pki/ca.crt --cloud-provider=aws --cluster-dns=172.20.0.10 --cluster-domain=cluster.local --config=/etc/kubernetes/kubelet/kubelet-config.json --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock --feature-gates=RotateKubeletServerCertificate=true --kubeconfig=/var/lib/kubelet/kubeconfig --node-ip=10.0.20.16 --pod-infra-container-image=602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause:3.5 --register-node --resolv-conf=/run/systemd/resolve/resolv.conf

Where in one of my first two above it shows

$ ps -ef | grep kube
root        4059       1  1 Dec05 ?        00:24:01 /snap/kubelet-eks/202/kubelet --address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook --authorization-mode=Webhook --cgroup-driver=cgroupfs --client-ca-file=/etc/kubernetes/pki/ca.crt --cloud-provider=aws --cluster-dns=172.20.0.10 --cluster-domain=cluster.local --config=/etc/kubernetes/kubelet/kubelet-config.json --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock --kubeconfig=/var/lib/kubelet/kubeconfig --node-ip=10.0.21.117 --pod-infra-container-image=602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause:3.5 --register-node --resolv-conf=/run/systemd/resolve/resolv.conf

-------------------
Seeing those bug reports I tried to grab the latest AMI just now, and that one doesn't even connect to our cluster.

-----------------
Also our user-data script for all of these is the same and looks like this for example

set -e

function run {
  register_eks_worker
}

run

Happy to attach more logs etc if you just let me know what you want. Hoping someone can help me!