EKS AMI 1.20 - failed to create pod sandbox

Bug #1932116 reported by Maurizio Valenzisi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-images
Fix Released
High
Thomas Bechtold

Bug Description

With the newest AMI for EKS with K8s 1.20.4,
some pods fail to start with the error:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "32f94d8f159e3483ae3d2121881443a95ea65924c4a1f2ae2e6a0f4257f64da1" network for pod "fluent-bit-wn2pf": networkPlugin cni failed to set up pod "fluent-bit-wn2pf_kube-infrastructure" network: failed to find plugin "loopback" in path [/opt/cni/bin]

Cloud: AWS
AMI ID: ami-0e7501dc4c24cda7c
EKS K8s version: 1.20
Platform version: eks.1

Revision history for this message
Maurizio Valenzisi (mvalenzisi) wrote :

boostrap.sh:

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==BOUNDARY=="

--==BOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
set -x
CLUSTER_NAME=devops-ops-k8s-cluster
NODE_LABELS=--node-labels=node-selector-key=non-production,eks.amazonaws.com/nodegroup-image=ami-0e7501dc4c24cda7c,cluster=devops-ops-k8s-ekscluster-id,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup=jenkins-20210614,environment-type=non-production
NODE_TAINTS=
/etc/eks/bootstrap.sh $CLUSTER_NAME --kubelet-extra-args "$NODE_LABELS $NODE_TAINTS"

--==BOUNDARY==--

Revision history for this message
Thomas Bechtold (toabctl) wrote :

Hi Maurizio,

thanks for filling the bug report.

For 1.20 images, we dropped the CNI plugins from the image. The reason is that EKS does install CNI plugins automatically via the aws-node daemonset when using the amazon-vpc-cni-k8s plugin[1]

1) So how did you create the cluster? can you share the command(s) you use?
2) which CNI plugin are you using?

[1] https://docs.aws.amazon.com/eks/latest/userguide/pod-networking.html

Changed in cloud-images:
status: New → Incomplete
Revision history for this message
Maurizio Valenzisi (mvalenzisi) wrote :

Hi Thomas,

1) I create the clusters via a number of Cloudformation templates. Please find attached the two most relevant that I use to create the cluster and of the nodegroup that uses the Ubuntu image.

2) We use the default AWS CNI plugin (https://github.com/aws/amazon-vpc-cni-k8s).

Revision history for this message
Thomas Bechtold (toabctl) wrote :

Hi Maurizio,

you wrote "some pods". So does it work sometimes? I wonder if this is a race condition where the aws-node daemonset (which does install /opt/cni/bin/loopback) did no finish to run before you try to create a pod.
Could you share
- the kubelet snap logs
- aws-node daemonset logs
- after a pod failed, wait a minute and look what's in /opt/cni/bin on the node?

Revision history for this message
Maurizio Valenzisi (mvalenzisi) wrote :

Hi Thomas,

yes, only some pods fail to start. Actually, it's not exact to say that they fail, they're stuck in ContainerCreating status.

- a last chunk of the kubelet log is attached

- log of aws-node

# kubectl logs -f -n kube-system aws-node-jrbtw
Copying portmap binary ... Starting IPAM daemon in the background ... ok.
Checking for IPAM connectivity ... ok.
Copying additional CNI plugin binaries and config files ... ok.
Foregrounding IPAM daemon ...
ERROR: logging before flag.Parse: W0615 14:25:55.286171 10 reflector.go:341] <email address hidden>/tools/cache/reflector.go:99: watch of *v1.Pod ended with: too old resource version: 142380942 (142920328)
ERROR: logging before flag.Parse: E0615 14:40:25.944155 10 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
ERROR: logging before flag.Parse: W0616 07:37:34.494052 10 reflector.go:341] <email address hidden>/tools/cache/reflector.go:99: watch of *v1.Pod ended with: too old resource version: 142938294 (142942818)
ERROR: logging before flag.Parse: W0616 08:18:28.510504 10 reflector.go:341] <email address hidden>/tools/cache/reflector.go:99: watch of *v1.Pod ended with: too old resource version: 143246162 (143250888)

- content of /opt/cni/bin

# ls /opt/cni/bin
aws-cni aws-cni-support.sh portmap

Revision history for this message
Maurizio Valenzisi (mvalenzisi) wrote :

It looks like the problem might be the CNI version that we are using (1.6.3)
https://github.com/aws/amazon-vpc-cni-k8s/issues/64
The fix was merged in 1.7.0 https://github.com/aws/amazon-vpc-cni-k8s/pull/955

I will try to upgrade the CNI and see if it solves the issue.
Interesting that version 1.6.3 it's being installed even if the default is 1.7.5, and I do not specify any version of the plugin in my template.

Revision history for this message
Thomas Bechtold (toabctl) wrote :

Ok. that information helps a lot! Please let us know if a newer CNI plugin version helps with the problem.
But given that the 1.6.3 CNI plugin version is still supported by AWS, I think we need to add back the cni plugins to the image.

Revision history for this message
Thomas Bechtold (toabctl) wrote :

@Maurizio, there are now new images (serial 20210618, website will be updated soon) which include the CNI plugins again. Closing this bug now. Please reopen if you still have this problem with the new images.
Thanks a lot for helping debugging this!

Changed in cloud-images:
assignee: nobody → Thomas Bechtold (toabctl)
status: Incomplete → In Progress
importance: Undecided → High
Changed in cloud-images:
status: In Progress → Fix Released
Revision history for this message
Maurizio Valenzisi (mvalenzisi) wrote :

I can confirm that after updating the VPC CNI plugin to v1.7.5 all the problem is solved.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.