EKS: mark pause-container as pinned in containerd

Bug #2061187 reported by Robby Pocase
28
This bug affects 6 people
Affects Status Importance Assigned to Milestone
cloud-images
New
Medium
Unassigned

Bug Description

LP:2060537 pointed out that in EKS 1.29 the pause container is being pruned by garbage collection. This has had a workaround provided via a periodic cron poll. Longer term, we should update the containerd configuration to pin the pause container [1] at launch. This will avoid the container being pruned entirely.

[0] - https://bugs.launchpad.net/cloud-images/+bug/2060537
[1] - https://github.com/containerd/containerd/issues/6352

Revision history for this message
Mike (msarfaty-twilio) wrote (last edit ):

Can we get any update here? The workaround is fine, but with highly multi-tenant and short-lived workloads we still encounter periods where the pause container is garbage collected.

edit: additionally, we tried to shim-in a public image by adding `[env, PAUSE_CONTAINER_IMAGE=registry.k8s.io/pause, /etc/eks/bootstrap.sh`, ...] to our cloud-init config, but there's an error from the script that pulls the image down because it expects that the image is in ECR, not any arbitrary registry.

Revision history for this message
Thomas Bechtold (toabctl) wrote :

Hi Mike,

- which image do you use exactly to get that problem (I wonder if that happens with Focal based images only, or also with Jammy based images)
- can you share the error you seen when you set a different pause container image please?

Revision history for this message
Mike (msarfaty-twilio) wrote :

Hey Thomas, thanks for the response. I was out at the end of last week, sorry for the delay.

Here's the output where the script errors (running with -x):
```
+ [[ containerd = \c\o\n\t\a\i\n\e\r\d ]]
+ echo 'Container runtime is containerd'
Container runtime is containerd
+ mkdir -p /etc/systemd/system/containerd.service.d
+ cat
+ systemctl daemon-reload
+ sed s,SANDBOX_IMAGE,registry.k8s.io/pause:3.5,g
+ systemctl restart containerd
+ /usr/local/share/eks/pull-sandbox-image.sh

Provided region_name '5' doesn't match a supported format.
Attempt 1 of 3

Provided region_name '5' doesn't match a supported format.
Attempt 2 of 3

Provided region_name '5' doesn't match a supported format.
Attempt 3 of 3

Provided region_name '5' doesn't match a supported format.
Unable to retrieve the ECR password.
++ err_report 410
++ echo 'Exited with error on line 410'
Exited with error on line 410
```

and that pull-sandbox-image with -x
```
bash -x /usr/local/share/eks/pull-sandbox-image.sh
+ set -euo pipefail
+ source /dev/fd/63
++ grep sandbox_image /etc/containerd/config.toml
++ tr -d ' '
++ sandbox_image=registry.k8s.io/pause:3.5
++ sudo ctr --namespace k8s.io image ls
++ grep registry.k8s.io/pause:3.5
+ [[ '' != '' ]]
+ /etc/eks/containerd/pull-image.sh registry.k8s.io/pause:3.5

Provided region_name '5' doesn't match a supported format.
```

and lastly, the `pull-image.sh`
```
bash -x /etc/eks/containerd/pull-image.sh registry.k8s.io/pause:3.5
+ img=registry.k8s.io/pause:3.5
++ echo registry.k8s.io/pause:3.5
++ cut -f4 -d .
+ region=5
+ MAX_RETRIES=3
++ retry aws ecr get-login-password --region 5
++ local rc=0
+++ seq 0 3
++ for attempt in $(seq 0 $MAX_RETRIES)
++ rc=0
++ [[ 0 -gt 0 ]]
++ aws ecr get-login-password --region 5

Provided region_name '5' doesn't match a supported format.
++ rc=255
++ [[ 255 -eq 0 ]]
++ [[ 0 -eq 3 ]]
++ local jitter=1
++ local sleep_sec=11
++ sleep 11
```

You can see neither script is checking that the sandbox image is actually from a private ecr before attempting the pull. I think this would also fail with the ECR public image since there is not region name in `public.ecr.aws...` This chain fails in the bootstrap script running with -e, cutting the script short and thus not starting the kubelet.

Revision history for this message
Mike (msarfaty-twilio) wrote :

The AMI we are using is the focal one, `ami-09971b6dfd1dcac90`

ubuntu-eks/k8s_1.29/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20240522

Revision history for this message
Mike (msarfaty-twilio) wrote :

Hey Thomas, do you need any other info from us here?

Revision history for this message
Robby Pocase (rpocase) wrote :

@Mike apologies for the delay! we have a fix in place for focal that is in images newer than you are using. Could you try with the latest 1.29 focal image (or at least a serial newer than 20270712)

Revision history for this message
Robby Pocase (rpocase) wrote :

probably obvious, but serial listed in the last reply should have been `20240712`

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.