1.25 bootstrap.sh from 1129+ breaks on max-pods setting

Bug #2045311 reported by joe miller
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
cloud-images
Fix Released
High
Thomas Bechtold

Bug Description

We are seeing ec2 nodes fail to become Ready due to errors in the bootstrap.sh script.

The error from /var/log/user-data.log:
```
2023-11-30 18:21:32,986:__main__:INFO:result for change: {'id': '8', 'kind': 'configure-snap', 'summary': 'Change configuration of "kubelet-eks" snap', 'status': 'Error', 'tasks': [{'id': '113', 'kind': 'run-hook', 'summary': 'Run configure hook of "kubelet-eks" snap', 'statu
s': 'Error', 'log': ['2023-11-30T18:21:27Z ERROR invalid option name: "--max-pods"'], 'progress': {'label': '', 'done': 1, 'total': 1}, 'spawn-time': '2023-11-30T18:21:27.933832076Z', 'ready-time': '2023-11-30T18:21:27.979626575Z'}], 'ready': True, 'err': 'cannot perform the
following tasks:\n- Run configure hook of "kubelet-eks" snap (invalid option name: "--max-pods")', 'spawn-time': '2023-11-30T18:21:27.933868886Z', 'ready-time': '2023-11-30T18:21:27.979627625Z'}
```

Images we have tried:
- GOOD: amazon/ubuntu-eks/k8s_1.25/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20231128
- BAD: amazon/ubuntu-eks/k8s_1.25/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20231129
- BAD: amazon/ubuntu-eks/k8s_1.25/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20231130

The error appears to come from a change in bootstrap.sh. If we replace the bootstrap.sh script from a version prior to the switch to using `snapd-helper.py`, things work fine.

More details here: https://gist.github.com/joemiller/22b6ac4d60910e9c997957be09504c99

Revision history for this message
Thomas Bechtold (toabctl) wrote :

Hi Joe,

thanks for reporting this bug. Could you share the exact call of the /etc/eks/bootstrap.sh script please?

Changed in cloud-images:
status: New → Incomplete
importance: Undecided → High
assignee: nobody → Thomas Bechtold (toabctl)
Revision history for this message
joe miller (joeym) wrote :
Download full text (6.6 KiB)

here is the `bash -x` output from the user-data script that called /etc/eks/bootstrap.sh

+ /etc/eks/bootstrap.sh mycluster-aws-useast1-2 --apiserver-endpoint https://REDACTED.yl4.us-east-1.eks.amazonaws.com --b64-cluster-ca TRUNCATED_B64_CA --use-max-pods false --kubelet-extra-args '--node-labels=karpenter.sh/capacity-type=on-demand,karpenter.sh/provisioner-name=c
atchall --register-with-taints=node.cilium.io/agent-not-ready=true:NoExecute --max-pods=110'

Using containerd as the container runtime
Aliasing EKS k8s snap commands
Added:
  - kubelet-eks.kubelet as kubelet
Added:
  - kubectl-eks.kubectl as kubectl
Stopping k8s daemons until configured
Stopped.
Cluster "kubernetes" set.
Container runtime is containerd
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5: resolving |--------------------------------------|
elapsed: 0.1 s total: 0.0 B (0.0 B/s)
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5: resolved |++++++++++++++++++++++++++++++++++++++|
index-sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2: downloading |--------------------------------------| 0.0 B/741.0 B
elapsed: 0.2 s total: 0.0 B (0.0 B/s)
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5: resolved |++++++++++++++++++++++++++++++++++++++|
index-sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2: done |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:666eebd093e91212426aeba3b89002911d2c981fefd8806b1a0ccb4f1b639a60: downloading |--------------------------------------| 0.0 B/526.0 B
elapsed: 0.3 s total: 741.0 (2.4 KiB/s)
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5: resolved |++++++++++++++++++++++++++++++++++++++|
index-sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2: done |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:666eebd093e91212426aeba3b89002911d2c981fefd8806b1a0ccb4f1b639a60: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:0692f38991d53a0c28679148f99de26a44d630fda984b41f63c5e19f839d15a6: downloading |--------------------------------------| 0.0 B/289.6 KiB
config-sha256:6996f8da07bd405c6f82a549ef041deda57d1d658ec20a78584f9f436c9a3bb7: downloading |--------------------------------------| 0.0 B/901.0 B
elapsed: 0.4 s total: 1.2 Ki (3.0 KiB/s)
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5: resolved |++++++++++++++++++++++++++++++++++++++|
index-sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2: done |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:666eebd093e91212426aeba3b89002911d2c981fefd8806b1a0ccb4f1b639a60: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:0692f38991d53a0c28679148f99de26a44d630fda984b41f63c5e19f839d15a6: downloading...

Read more...

Revision history for this message
joe miller (joeym) wrote :

We have heard from the Karpenter folks that they have seen this too in their E2E tests, FWIW

Changed in cloud-images:
status: Incomplete → Confirmed
Revision history for this message
Robby Pocase (rpocase) wrote :

For now, I've reverted the offending commit while we work out what was causing this specific failure. There should be new AMIs out tonight. We will update when they are available.

Revision history for this message
Thomas Bechtold (toabctl) wrote :

Images with serial 20231201 should fix this problem. Closing this bug. Please reopen if you still have problems.

Changed in cloud-images:
status: Confirmed → Fix Released
Revision history for this message
Jason Deal (jdeal) wrote :

Karpenter is still seeing failures with the new AMI release, though it is with the 1.28 rather than the 1.25 AMI. This is with the following AMI: ubuntu-eks/k8s_1.28/images/hvm-ssd/ubuntu-focal-20.04-arm64-server-20231201.

The error in the logs is the same as before:
```
01T19:43:02Z ERROR invalid option name: "--max-pods"'], 'progress': {'label': '', 'done': 1, 'total': 1}, 'spawn-time': '2023-12-01T19:43:02.548711134Z', 'ready-time': '2023-12-01T19:43:02.568060204Z'}], 'ready': True, 'err': 'cannot perform the following tasks:\n- Run configure hook of "kubelet-eks" snap (invalid option name: "--max-pods")', 'spawn-time': '2023-12-01T19:43:02.54874168Z', 'ready-time': '2023-12-01T19:43:02.568060948Z'}

```
The call to `/etc/eks/bootstrap.sh`:
```
etc/eks/bootstrap.sh jmdeal-dev --apiserver-endpoint <redacted> --b64-cluster-ca <redacted> --dns-cluster-ip <redacted> --use-max-pods false --kubelet-extra-args '--node-labels="karpenter.sh/capacity-type=spot,karpenter.sh/nodepool=default" --max-pods=8'
```

Changed in cloud-images:
status: Fix Released → Confirmed
Revision history for this message
Thomas Bechtold (toabctl) wrote :

Sorry, the change we did wasn't picked up by the pipeline. We added now test coverage for kubelet-extra-args so this should be covered by CI now. New images will likely be published today.

Revision history for this message
Thomas Bechtold (toabctl) wrote :

AMIs with serial 20231204.1 should fix this. Please let us know if you still see any problems.

Changed in cloud-images:
status: Confirmed → Fix Released
Revision history for this message
Jason Deal (jdeal) wrote :

Initial tests look good! I'll need to unpin the 11/28 AMI from our end-to-end tests to get some wider test coverage, I'll update here if there are any issues but I don't expect there to be.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.