Bug #2061187 “EKS: mark pause-container as pinned in containerd” : Bugs : cloud-images

Revision history for this message

Mike (msarfaty-twilio) wrote on 2024-07-10 (last edit on 2024-07-10):

#1

Can we get any update here? The workaround is fine, but with highly multi-tenant and short-lived workloads we still encounter periods where the pause container is garbage collected.

edit: additionally, we tried to shim-in a public image by adding `[env, PAUSE_CONTAINER_IMAGE=registry.k8s.io/pause, /etc/eks/bootstrap.sh`, ...] to our cloud-init config, but there's an error from the script that pulls the image down because it expects that the image is in ECR, not any arbitrary registry.

Revision history for this message

Thomas Bechtold (toabctl) wrote on 2024-07-10:

#2

Hi Mike,

- which image do you use exactly to get that problem (I wonder if that happens with Focal based images only, or also with Jammy based images)
- can you share the error you seen when you set a different pause container image please?

Revision history for this message

Mike (msarfaty-twilio) wrote on 2024-07-16:

#3

Hey Thomas, thanks for the response. I was out at the end of last week, sorry for the delay.

Here's the output where the script errors (running with -x):
```
+ [[ containerd = \c\o\n\t\a\i\n\e\r\d ]]
+ echo 'Container runtime is containerd'
Container runtime is containerd
+ mkdir -p /etc/systemd/system/containerd.service.d
+ cat
+ systemctl daemon-reload
+ sed s,SANDBOX_IMAGE,registry.k8s.io/pause:3.5,g
+ systemctl restart containerd
+ /usr/local/share/eks/pull-sandbox-image.sh

Provided region_name '5' doesn't match a supported format.
Attempt 1 of 3

Provided region_name '5' doesn't match a supported format.
Attempt 2 of 3

Provided region_name '5' doesn't match a supported format.
Attempt 3 of 3

Provided region_name '5' doesn't match a supported format.
Unable to retrieve the ECR password.
++ err_report 410
++ echo 'Exited with error on line 410'
Exited with error on line 410
```

and that pull-sandbox-image with -x
```
bash -x /usr/local/share/eks/pull-sandbox-image.sh
+ set -euo pipefail
+ source /dev/fd/63
++ grep sandbox_image /etc/containerd/config.toml
++ tr -d ' '
++ sandbox_image=registry.k8s.io/pause:3.5
++ sudo ctr --namespace k8s.io image ls
++ grep registry.k8s.io/pause:3.5
+ [[ '' != '' ]]
+ /etc/eks/containerd/pull-image.sh registry.k8s.io/pause:3.5

Provided region_name '5' doesn't match a supported format.
```

and lastly, the `pull-image.sh`
```
bash -x /etc/eks/containerd/pull-image.sh registry.k8s.io/pause:3.5
+ img=registry.k8s.io/pause:3.5
++ echo registry.k8s.io/pause:3.5
++ cut -f4 -d .
+ region=5
+ MAX_RETRIES=3
++ retry aws ecr get-login-password --region 5
++ local rc=0
+++ seq 0 3
++ for attempt in $(seq 0 $MAX_RETRIES)
++ rc=0
++ [[ 0 -gt 0 ]]
++ aws ecr get-login-password --region 5

Provided region_name '5' doesn't match a supported format.
++ rc=255
++ [[ 255 -eq 0 ]]
++ [[ 0 -eq 3 ]]
++ local jitter=1
++ local sleep_sec=11
++ sleep 11
```

You can see neither script is checking that the sandbox image is actually from a private ecr before attempting the pull. I think this would also fail with the ECR public image since there is not region name in `public.ecr.aws...` This chain fails in the bootstrap script running with -e, cutting the script short and thus not starting the kubelet.