Bug #2045311 “1.25 bootstrap.sh from 1129+ breaks on max-pods se...” : Bugs : cloud-images

Revision history for this message

Thomas Bechtold (toabctl) wrote on 2023-11-30:

#1

Hi Joe,

thanks for reporting this bug. Could you share the exact call of the /etc/eks/bootstrap.sh script please?

Changed in cloud-images:
status:	New → Incomplete
importance:	Undecided → High
assignee:	nobody → Thomas Bechtold (toabctl)

Revision history for this message

joe miller (joeym) wrote on 2023-11-30:

#2

Download full text (6.6 KiB)

here is the `bash -x` output from the user-data script that called /etc/eks/bootstrap.sh

+ /etc/eks/bootstrap.sh mycluster-aws-useast1-2 --apiserver-endpoint https://REDACTED.yl4.us-east-1.eks.amazonaws.com --b64-cluster-ca TRUNCATED_B64_CA --use-max-pods false --kubelet-extra-args '--node-labels=karpenter.sh/capacity-type=on-demand,karpenter.sh/provisioner-name=c
atchall --register-with-taints=node.cilium.io/agent-not-ready=true:NoExecute --max-pods=110'

Using containerd as the container runtime
Aliasing EKS k8s snap commands
Added:
- kubelet-eks.kubelet as kubelet
Added:
- kubectl-eks.kubectl as kubectl
Stopping k8s daemons until configured
Stopped.
Cluster "kubernetes" set.
Container runtime is containerd
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5: resolving |--------------------------------------|
elapsed: 0.1 s total: 0.0 B (0.0 B/s)
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5: resolved |++++++++++++++++++++++++++++++++++++++|
index-sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2: downloading |--------------------------------------| 0.0 B/741.0 B
elapsed: 0.2 s total: 0.0 B (0.0 B/s)
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5: resolved |++++++++++++++++++++++++++++++++++++++|
index-sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2: done |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:666eebd093e91212426aeba3b89002911d2c981fefd8806b1a0ccb4f1b639a60: downloading |--------------------------------------| 0.0 B/526.0 B
elapsed: 0.3 s total: 741.0 (2.4 KiB/s)
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5: resolved |++++++++++++++++++++++++++++++++++++++|
index-sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2: done |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:666eebd093e91212426aeba3b89002911d2c981fefd8806b1a0ccb4f1b639a60: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:0692f38991d53a0c28679148f99de26a44d630fda984b41f63c5e19f839d15a6: downloading |--------------------------------------| 0.0 B/289.6 KiB
config-sha256:6996f8da07bd405c6f82a549ef041deda57d1d658ec20a78584f9f436c9a3bb7: downloading |--------------------------------------| 0.0 B/901.0 B
elapsed: 0.4 s total: 1.2 Ki (3.0 KiB/s)
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5: resolved |++++++++++++++++++++++++++++++++++++++|
index-sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2: done |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:666eebd093e91212426aeba3b89002911d2c981fefd8806b1a0ccb4f1b639a60: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:0692f38991d53a0c28679148f99de26a44d630fda984b41f63c5e19f839d15a6: downloading...

here is the `bash -x` output  from the user-data script that called /etc/eks/bootstrap.sh

+ /etc/eks/bootstrap.sh mycluster-aws-useast1-2 --apiserver-endpoint https://REDACTED.yl4.us-east-1.eks.amazonaws.com --b64-cluster-ca TRUNCATED_B64_CA --use-max-pods false --kubelet-extra-args '--node-labels=karpenter.sh/capacity-type=on-demand,karpenter.sh/provisioner-name=c
atchall --register-with-taints=node.cilium.io/agent-not-ready=true:NoExecute --max-pods=110'

Using containerd as the container runtime
Aliasing EKS k8s snap commands
Added:
  - kubelet-eks.kubelet as kubelet
Added:
  - kubectl-eks.kubectl as kubectl
Stopping k8s daemons until configured
Stopped.
Cluster "kubernetes" set.
Container runtime is containerd
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5: resolving      |--------------------------------------|
elapsed: 0.1 s                                              total:   0.0 B (0.0 B/s)
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5:                    resolved       |++++++++++++++++++++++++++++++++++++++|
index-sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2: downloading    |--------------------------------------|    0.0 B/741.0 B
elapsed: 0.2 s                                                                 total:   0.0 B (0.0 B/s)
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5:                       resolved       |++++++++++++++++++++++++++++++++++++++|
index-sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2:    done           |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:666eebd093e91212426aeba3b89002911d2c981fefd8806b1a0ccb4f1b639a60: downloading    |--------------------------------------|    0.0 B/526.0 B
elapsed: 0.3 s                                                                    total:  741.0  (2.4 KiB/s)
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5:                       resolved       |++++++++++++++++++++++++++++++++++++++|
index-sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2:    done           |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:666eebd093e91212426aeba3b89002911d2c981fefd8806b1a0ccb4f1b639a60: done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:0692f38991d53a0c28679148f99de26a44d630fda984b41f63c5e19f839d15a6:    downloading    |--------------------------------------|    0.0 B/289.6 KiB
config-sha256:6996f8da07bd405c6f82a549ef041deda57d1d658ec20a78584f9f436c9a3bb7:   downloading    |--------------------------------------|    0.0 B/901.0 B
elapsed: 0.4 s                                                                    total:  1.2 Ki (3.0 KiB/s)
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5:                       resolved       |++++++++++++++++++++++++++++++++++++++|
index-sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2:    done           |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:666eebd093e91212426aeba3b89002911d2c981fefd8806b1a0ccb4f1b639a60: done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:0692f38991d53a0c28679148f99de26a44d630fda984b41f63c5e19f839d15a6:    downloading    |--------------------------------------|    0.0 B/289.6 KiB
config-sha256:6996f8da07bd405c6f82a549ef041deda57d1d658ec20a78584f9f436c9a3bb7:   done           |++++++++++++++++++++++++++++++++++++++|
elapsed: 0.5 s                                                                    total:  2.1 Ki (4.1 KiB/s)
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5:                       resolved       |++++++++++++++++++++++++++++++++++++++|
index-sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2:    done           |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:666eebd093e91212426aeba3b89002911d2c981fefd8806b1a0ccb4f1b639a60: done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:0692f38991d53a0c28679148f99de26a44d630fda984b41f63c5e19f839d15a6:    done           |++++++++++++++++++++++++++++++++++++++|
config-sha256:6996f8da07bd405c6f82a549ef041deda57d1d658ec20a78584f9f436c9a3bb7:   done           |++++++++++++++++++++++++++++++++++++++|
elapsed: 0.6 s                                                                    total:  2.1 Ki (3.5 KiB/s)
nvidia-smi not found
Configuring kubelet snap
2023-11-30 18:21:26,956:__main__:INFO:No more changes in progress ...
2023-11-30 18:21:27,871:__main__:INFO:received '202/Accepted' from snapd for POST on /v2/snaps (change-id: 7)
2023-11-30 18:21:27,879:__main__:INFO:No more changes in progress ...
2023-11-30 18:21:27,880:__main__:INFO:result for change: {'id': '7', 'kind': 'refresh-snap', 'summary': 'Refresh all snaps: no updates', 'status': 'Done', 'ready': True, 'spawn-time': '2023-11-30T18:21:27.863112183Z', 'ready-time': '2023-11-30T18:21:27.863124863Z', 'data': {'
snap-names': []}}
2023-11-30 18:21:27,932:__main__:INFO:Setting kubelet-eks config to: {'cluster-dns': '172.20.0.10', 'container-runtime': 'remote', 'container-runtime-endpoint': 'unix:///run/containerd/containerd.sock', 'address': '0.0.0.0', 'anonymous-auth': 'false', 'authentication-token-we
bhook': 'true', 'authorization-mode': 'Webhook', 'cgroup-driver': 'cgroupfs', 'client-ca-file': '/etc/kubernetes/pki/ca.crt', 'cloud-provider': 'aws', 'cluster-domain': 'cluster.local', 'cni-bin-dir': '/opt/cni/bin', 'cni-conf-dir': '/etc/cni/net.d', 'config': '/etc/kubernete
s/kubelet/kubelet-config.json', 'kubeconfig': '/var/lib/kubelet/kubeconfig', 'node-ip': '10.8.56.177', 'network-plugin': 'cni', 'register-node': 'true', 'resolv-conf': '/run/systemd/resolve/resolv.conf', 'pod-infra-container-image': 'REDACTED.dkr.ecr.us-east-1.amazonaws.c
om/eks/pause:3.5', '--max-pods': '110'}
2023-11-30 18:21:27,970:__main__:INFO:received '202/Accepted' from snapd for PUT on /v2/snaps/kubelet-eks/conf (change-id: 8)
2023-11-30 18:21:27,979:__main__:INFO:Still 1 changes in progress ...
2023-11-30 18:21:32,984:__main__:INFO:No more changes in progress ...
2023-11-30 18:21:32,986:__main__:INFO:result for change: {'id': '8', 'kind': 'configure-snap', 'summary': 'Change configuration of "kubelet-eks" snap', 'status': 'Error', 'tasks': [{'id': '113', 'kind': 'run-hook', 'summary': 'Run configure hook of "kubelet-eks" snap', 'statu
s': 'Error', 'log': ['2023-11-30T18:21:27Z ERROR invalid option name: "--max-pods"'], 'progress': {'label': '', 'done': 1, 'total': 1}, 'spawn-time': '2023-11-30T18:21:27.933832076Z', 'ready-time': '2023-11-30T18:21:27.979626575Z'}], 'ready': True, 'err': 'cannot perform the
following tasks:\n- Run configure hook of "kubelet-eks" snap (invalid option name: "--max-pods")', 'spawn-time': '2023-11-30T18:21:27.933868886Z', 'ready-time': '2023-11-30T18:21:27.979627625Z'}

Revision history for this message

joe miller (joeym) wrote on 2023-11-30:

#3

We have heard from the Karpenter folks that they have seen this too in their E2E tests, FWIW

Thomas Bechtold (toabctl) on 2023-11-30

Changed in cloud-images:
status:	Incomplete → Confirmed

Revision history for this message

Robby Pocase (rpocase) wrote on 2023-11-30:

#4

For now, I've reverted the offending commit while we work out what was causing this specific failure. There should be new AMIs out tonight. We will update when they are available.

Revision history for this message

Thomas Bechtold (toabctl) wrote on 2023-12-01:

#5

Images with serial 20231201 should fix this problem. Closing this bug. Please reopen if you still have problems.

Changed in cloud-images:
status:	Confirmed → Fix Released

Revision history for this message

Jason Deal (jdeal) wrote on 2023-12-01:

#6

Karpenter is still seeing failures with the new AMI release, though it is with the 1.28 rather than the 1.25 AMI. This is with the following AMI: ubuntu-eks/k8s_1.28/images/hvm-ssd/ubuntu-focal-20.04-arm64-server-20231201.

The error in the logs is the same as before:
```
01T19:43:02Z ERROR invalid option name: "--max-pods"'], 'progress': {'label': '', 'done': 1, 'total': 1}, 'spawn-time': '2023-12-01T19:43:02.548711134Z', 'ready-time': '2023-12-01T19:43:02.568060204Z'}], 'ready': True, 'err': 'cannot perform the following tasks:\n- Run configure hook of "kubelet-eks" snap (invalid option name: "--max-pods")', 'spawn-time': '2023-12-01T19:43:02.54874168Z', 'ready-time': '2023-12-01T19:43:02.568060948Z'}

```
The call to `/etc/eks/bootstrap.sh`:
```
etc/eks/bootstrap.sh jmdeal-dev --apiserver-endpoint <redacted> --b64-cluster-ca <redacted> --dns-cluster-ip <redacted> --use-max-pods false --kubelet-extra-args '--node-labels="karpenter.sh/capacity-type=spot,karpenter.sh/nodepool=default" --max-pods=8'
```

Thomas Bechtold (toabctl) on 2023-12-04

Changed in cloud-images:
status:	Fix Released → Confirmed

Revision history for this message

Thomas Bechtold (toabctl) wrote on 2023-12-04:

#7

Sorry, the change we did wasn't picked up by the pipeline. We added now test coverage for kubelet-extra-args so this should be covered by CI now. New images will likely be published today.

Revision history for this message

Thomas Bechtold (toabctl) wrote on 2023-12-04:

#8

AMIs with serial 20231204.1 should fix this. Please let us know if you still see any problems.

Changed in cloud-images:
status:	Confirmed → Fix Released

Revision history for this message

Jason Deal (jdeal) wrote on 2023-12-04:

#9

Initial tests look good! I'll need to unpin the 11/28 AMI from our end-to-end tests to get some wider test coverage, I'll update here if there are any issues but I don't expect there to be.

cloud-images

1.25 bootstrap.sh from 1129+ breaks on max-pods setting

Bug Description

Other bug subscribers

Remote bug watches