EKS AMI Failing due to Pause Image 403, not Pinned

Bug #2060537 reported by Shaped Technologies
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
cloud-images
Fix Released
High
Robby Pocase

Bug Description

Hi,

The latest Ubuntu EKS AMI (using ca-central personally) has the same issue as the AL2 AMI did recently, https://github.com/awslabs/amazon-eks-ami/issues/1597

The Pause/sandbox image is being GC'd due to not being properly pinned.

This causes pods to not be able to be deployed on a specific node after the image is GC'd, rendering that node SNAFU.

Tags: cpc-4128
Robby Pocase (rpocase)
tags: added: cpc-4128
Revision history for this message
Robby Pocase (rpocase) wrote :

Thanks for filing this! We are actively working on a fix now and hope to have new images out within the next couple days.

Changed in cloud-images:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Robby Pocase (rpocase)
Revision history for this message
Shaped Technologies (jason-shaped) wrote :

Thanks; was hoping it was known since it popped up on the AWS one but it's been an issue since 1.29 and hadn't been fixed in any of the newer AMIs so figured I'd drop a note.

I think that perhaps with the AM2 one part of the issue was this:

> I think the issue here is the version of containerd being used by Amazon Linux does not have pinned image support, which was added in 1.7.3: <email address hidden>

Looks like the Ubuntu AMI I'm on is currently using 1.7.2; not sure how it was handled previously but the fix might be as simple as bumping to 1.7.3+

Anyhow, thanks for getting on it and will keep my eyes out on https://cloud-images.ubuntu.com/docs/aws/eks/ for a new image (unless there's anywhere better to watch?)

Revision history for this message
Robby Pocase (rpocase) wrote :

@Shaped you're correct, bumping the version would definitely be required for a full fix. I've reviewed the packages (containerd, containerd-app) in jammy-updates and it definitely does not have the relevant commit [0]. I'm not entirely sure if we'll be able to backport this or not, but I am talking to internal resources to see if its an option. Otherwise, we'll likely only be able to provide a bandaid for this in Jammy/Focal.

[0] - https://github.com/containerd/containerd/commit/699d6701ae17bd3c12a7f86f5d9470cedf210169

Revision history for this message
Robby Pocase (rpocase) wrote :

I've gotten word internally that the Pinned fix is in flight, but will take some time to land (likely after noble). I'll keep this updated as I hear more. We'll also approach a workaround that avoids the pause image from permanent garbage collection.

Revision history for this message
Shaped Technologies (jason-shaped) wrote (last edit ):

I'm curious what prevented it from being GC'd in prior versions, if pinning is a new thing..? Not something I looked into before until it broke ;)

A potential other workaround is to point it at an image that doesn't require auth, such as `public.ecr.aws/eks-distro/kubernetes/pause:v1.29.0-eks-1-29-latest`, but I'm not 100% sure if that's on the image side or the EKS side

Best of luck figuring something out!

Revision history for this message
Robby Pocase (rpocase) wrote (last edit ):

I just wanted to give a brief update. We've got an internal workaround that is waiting for review (effectively just a cron entry to periodically pull the container). I expect it to get merged tomorrow in time for the 20240411 serials (for both the jammy and focal versions)

Revision history for this message
Robby Pocase (rpocase) wrote :

Unfortunately we seemed to have missed the 20240411 window. I'll check back tomorrow and make sure the workaround is in place.

Changed in cloud-images:
status: Confirmed → In Progress
Revision history for this message
Shaped Technologies (jason-shaped) wrote :

Probably not the most ideal workaround but I understand choosing that path over my suggestions considering your responsibilities and everything, though the `public.ecr.aws/eks-distro/kubernetes/pause:v1.29.0-eks-1-29-latest` image is still on AWS so that should maybe be fine.

Has the cronjob been fully tested? As when the bug was encountered, the issue wasn't specifically that the image wasn't there but that the credentials required to re-pull the image weren't available thus it resulting in a 403 when trying to re-pull it - hopefully you've considered that in the fix, just wanted to mention it so all bases are covered.

Too bad the 0411 window was missed but I'll keep checking back. Thanks for the updates and quick work!

Revision history for this message
Andrea Scarpino (ilpianista) wrote (last edit ):

We also ran into this problem (though we got 401). It happened on 2/30 nodes who had 25% of free disk space remaining.

Our workaround has been to create a DaemonSet which uses that image.

Revision history for this message
Robby Pocase (rpocase) wrote (last edit ):

The 20240412 AMIs have been published and contain the workaround.

@shaped - the cron has been tested and credentials are used to ensure you get the closest to your region pause container. this runs every 15 minutes, so there is a wideish window where you could hit this. if it turns out high load users are still hitting this frequently then we will adjust the timing to be shorter.

we'll keep monitoring for a full fix through containerd, but will resolve that in a separate ticket [0].

[0] - https://bugs.launchpad.net/cloud-images/+bug/2061187

Changed in cloud-images:
status: In Progress → Fix Committed
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.