new ubuntu images fail to join cluster and label

Bug #2045791 reported by Lorelei Rupp
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-images
Fix Released
Critical
Thomas Bechtold

Bug Description

We have an EKS cluster in aws 1.25

I tried to connect managed node groups with the following base AMIs:

ubuntu-eks/k8s_1.25/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20231201 -- fails to label the node
ubuntu-eks/k8s_1.26/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20231201 -- fails to label the node
ubuntu-eks/k8s_1.26/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20231204.1--fails to join the cluster with a error in user-data.log on line 506 in the bootstrap.sh script.

I guess I am just unlucky in that I tried to roll a new ubuntu AMI to our cluster this week.

I believe these bugs are related https://bugs.launchpad.net/cloud-images/+bug/2040477 and https://bugs.launchpad.net/cloud-images/+bug/2045311

As they called out the /etc/eks/ootstrap.sh is different and has changed in these newer amis and has issues in all the different versions I have tried

At first it was just labeling was not working

I could see the kubelet was just not being started with the node labels
In a working 1.24 image it looks like
$ ps -ef | grep kube
root 3833 1 1 13:33 ? 00:04:08 /snap/kubelet-eks/198/kubelet --node-labels=ec2.amazonaws.com/as-label-env=dev2,ec2.amazonaws.com/as-label-type=paravision-processor_gpu --address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook --authorization-mode=Webhook --cgroup-driver=cgroupfs --client-ca-file=/etc/kubernetes/pki/ca.crt --cloud-provider=aws --cluster-dns=172.20.0.10 --cluster-domain=cluster.local --config=/etc/kubernetes/kubelet/kubelet-config.json --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock --feature-gates=RotateKubeletServerCertificate=true --kubeconfig=/var/lib/kubelet/kubeconfig --node-ip=10.0.20.16 --pod-infra-container-image=xxx.dkr.ecr.us-east-2.amazonaws.com/eks/pause:3.5 --register-node --resolv-conf=/run/systemd/resolve/resolv.conf

Where in one of my first two above it shows

$ ps -ef | grep kube
root 4059 1 1 Dec05 ? 00:24:01 /snap/kubelet-eks/202/kubelet --address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook --authorization-mode=Webhook --cgroup-driver=cgroupfs --client-ca-file=/etc/kubernetes/pki/ca.crt --cloud-provider=aws --cluster-dns=172.20.0.10 --cluster-domain=cluster.local --config=/etc/kubernetes/kubelet/kubelet-config.json --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock --kubeconfig=/var/lib/kubelet/kubeconfig --node-ip=10.0.21.117 --pod-infra-container-image=xxx.dkr.ecr.us-east-2.amazonaws.com/eks/pause:3.5 --register-node --resolv-conf=/run/systemd/resolve/resolv.conf

-------------------
Seeing those bug reports I tried to grab the latest AMI just now, and that one doesn't even connect to our cluster.

Here is the user-data.log error
....
2023-12-06 16:15:26,674:__main__:INFO:No more changes in progress ...
2023-12-06 16:15:26,676:__main__:INFO:result for change: {'id': '28', 'kind': 'configure-snap', 'summary': 'Change configuration of "kubelet-eks" snap', 'status': 'Done', 'tasks': [{'id': '151', 'kind': 'run-hook', 'summary': 'Run configure hook of "kubelet-eks" snap', 'status': 'Done', 'progress': {'label': '', 'done': 1, 'total': 1}, 'spawn-time': '2023-12-06T16:15:25.546818263Z', 'ready-time': '2023-12-06T16:15:26.659445389Z'}], 'ready': True, 'spawn-time': '2023-12-06T16:15:25.546834552Z', 'ready-time': '2023-12-06T16:15:26.659446614Z'}
usage: snapdhelper.py configure [-h] snapname key value
snapdhelper.py configure: error: the following arguments are required: value
Exited with error on line 506

-----------------
Also our user-data script for all of these is the same and looks like this for example

#!/bin/bash
#
# This script is meant to be run in the User Data of each EKS worker instance that hosts applications. It registers the
# instance with the proper EKS cluster based on data provided by Terraform. Note that this script assumes it is running
# from an AMI that is derived from the EKS optimized AMIs that AWS provides.

set -e

# Send the log output from this script to user-data.log, syslog, and the console
# From: https://alestic.com/2010/12/ec2-user-data-output/
exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1

# Here we call the bootstrap script to register the EKS worker node to the control plane.
# Maps tags to labels for tags with the specific label prefix defined in var.worker_label_prefix
# https://github.com/gruntwork-io/terraform-aws-eks/tree/master/modules/eks-scripts
function register_eks_worker {
  NODE_LABELS="ec2.amazonaws.com/as-label-env=dev2,ec2.amazonaws.com/as-label-type=paravision-processor_gpu"
  /etc/eks/bootstrap.sh \
    --apiserver-endpoint "https://C870147FDA923006BED90BC4DE7A2B34.gr7.us-east-2.eks.amazonaws.com" \
    --b64-cluster-ca "XXXXX" --kubelet-extra-args "--node-labels=\"$NODE_LABELS\"" \
    "saas-dev2-eks"
}

function run {
  register_eks_worker
}

run

Happy to attach more logs etc if you just let me know what you want. Hoping someone can help me!

Tags: cpc-3546
description: updated
Revision history for this message
Lorelei Rupp (loreleirupp) wrote :

Reverting back to ubuntu-eks/k8s_1.26/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20231101 does not have any issues, it connects to the cluster AND labels properly

Revision history for this message
Robby Pocase (rpocase) wrote :

@Lorelei thanks for the detailed bug report! We'll follow up ASAP. This seems like a case that we don't have good coverage on for the new bootstrap method. We'll get that addressed and fixed soon.

Changed in cloud-images:
importance: Undecided → Critical
status: New → Confirmed
tags: added: cpc-3546
Revision history for this message
Thomas Bechtold (toabctl) wrote :

Looks like we run into a long standing cpython bug here: https://github.com/python/cpython/issues/58572 with our new snapdhelper.py script.

Revision history for this message
Thomas Bechtold (toabctl) wrote :

@Lorelei , as a workaround, you can add a space before --node-labels. So :

--kubelet-extra-args " --node-labels=\"$NODE_LABELS\""

should work. Could you try that?

Revision history for this message
chaitanya (vchetu) wrote :

Workaround works with extra space

--kubelet-extra-args " --node-labels=\"$NODE_LABELS\""

Changed in cloud-images:
assignee: nobody → Thomas Bechtold (toabctl)
status: Confirmed → In Progress
Revision history for this message
Thomas Bechtold (toabctl) wrote :

New images got released (serial 20231213.1) which should fix the problem. I'll close the bug. Please reopen if you still run into this issue.

Changed in cloud-images:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.