cloud-images

EKS AMI 1.20 - failed to create pod sandbox

Bug #1932116 reported by Maurizio Valenzisi on 2021-06-16

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	cloud-images	Fix Released	High	Thomas Bechtold

Bug Description

With the newest AMI for EKS with K8s 1.20.4,
some pods fail to start with the error:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "32f94d8f159e3483ae3d2121881443a95ea65924c4a1f2ae2e6a0f4257f64da1" network for pod "fluent-bit-wn2pf": networkPlugin cni failed to set up pod "fluent-bit-wn2pf_kube-infrastructure" network: failed to find plugin "loopback" in path [/opt/cni/bin]

Cloud: AWS
AMI ID: ami-0e7501dc4c24cda7c
EKS K8s version: 1.20
Platform version: eks.1

Related branches

~toabctl/cloud-images/+git/aws-eks-website:1.20-update

Merged into ~cloud-images-release-managers/cloud-images/+git/aws-eks-website:master at revision e43114838aeec3a8ac1e0d98d95117f898b45bca

Gauthier Jolly: Approve on 2021-06-21

Joshua Powers (community): Approve on 2021-06-18

Revision history for this message

Maurizio Valenzisi (mvalenzisi) wrote on 2021-06-16:

boostrap.sh:

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==BOUNDARY=="

--==BOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
set -x
CLUSTER_NAME=devops-ops-k8s-cluster
NODE_LABELS=--node-labels=node-selector-key=non-production,eks.amazonaws.com/nodegroup-image=ami-0e7501dc4c24cda7c,cluster=devops-ops-k8s-ekscluster-id,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup=jenkins-20210614,environment-type=non-production
NODE_TAINTS=
/etc/eks/bootstrap.sh $CLUSTER_NAME --kubelet-extra-args "$NODE_LABELS $NODE_TAINTS"

--==BOUNDARY==--

Revision history for this message

Thomas Bechtold (toabctl) wrote on 2021-06-16:

Hi Maurizio,

thanks for filling the bug report.

For 1.20 images, we dropped the CNI plugins from the image. The reason is that EKS does install CNI plugins automatically via the aws-node daemonset when using the amazon-vpc-cni-k8s plugin[1]

1) So how did you create the cluster? can you share the command(s) you use?
2) which CNI plugin are you using?

[1] https://docs.aws.amazon.com/eks/latest/userguide/pod-networking.html

Changed in cloud-images:
status:	New → Incomplete

Revision history for this message

Maurizio Valenzisi (mvalenzisi) wrote on 2021-06-16:

cluster.template Edit (22.3 KiB, text/plain)

Hi Thomas,

1) I create the clusters via a number of Cloudformation templates. Please find attached the two most relevant that I use to create the cluster and of the nodegroup that uses the Ubuntu image.

2) We use the default AWS CNI plugin (https://github.com/aws/amazon-vpc-cni-k8s).

Revision history for this message

Thomas Bechtold (toabctl) wrote on 2021-06-17:

Hi Maurizio,

you wrote "some pods". So does it work sometimes? I wonder if this is a race condition where the aws-node daemonset (which does install /opt/cni/bin/loopback) did no finish to run before you try to create a pod.
Could you share
- the kubelet snap logs
- aws-node daemonset logs
- after a pod failed, wait a minute and look what's in /opt/cni/bin on the node?

Revision history for this message

Maurizio Valenzisi (mvalenzisi) wrote on 2021-06-17:

kubelet.log Edit (37.5 KiB, text/plain)

Hi Thomas,

yes, only some pods fail to start. Actually, it's not exact to say that they fail, they're stuck in ContainerCreating status.

- a last chunk of the kubelet log is attached

- log of aws-node

# kubectl logs -f -n kube-system aws-node-jrbtw
Copying portmap binary ... Starting IPAM daemon in the background ... ok.
Checking for IPAM connectivity ... ok.
Copying additional CNI plugin binaries and config files ... ok.
Foregrounding IPAM daemon ...
ERROR: logging before flag.Parse: W0615 14:25:55.286171 10 reflector.go:341] <email address hidden>/tools/cache/reflector.go:99: watch of *v1.Pod ended with: too old resource version: 142380942 (142920328)
ERROR: logging before flag.Parse: E0615 14:40:25.944155 10 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
ERROR: logging before flag.Parse: W0616 07:37:34.494052 10 reflector.go:341] <email address hidden>/tools/cache/reflector.go:99: watch of *v1.Pod ended with: too old resource version: 142938294 (142942818)
ERROR: logging before flag.Parse: W0616 08:18:28.510504 10 reflector.go:341] <email address hidden>/tools/cache/reflector.go:99: watch of *v1.Pod ended with: too old resource version: 143246162 (143250888)

- content of /opt/cni/bin

# ls /opt/cni/bin
aws-cni aws-cni-support.sh portmap

Revision history for this message

Maurizio Valenzisi (mvalenzisi) wrote on 2021-06-17:

It looks like the problem might be the CNI version that we are using (1.6.3)
https://github.com/aws/amazon-vpc-cni-k8s/issues/64
The fix was merged in 1.7.0 https://github.com/aws/amazon-vpc-cni-k8s/pull/955

I will try to upgrade the CNI and see if it solves the issue.
Interesting that version 1.6.3 it's being installed even if the default is 1.7.5, and I do not specify any version of the plugin in my template.

Revision history for this message

Thomas Bechtold (toabctl) wrote on 2021-06-17:

Ok. that information helps a lot! Please let us know if a newer CNI plugin version helps with the problem.
But given that the 1.6.3 CNI plugin version is still supported by AWS, I think we need to add back the cni plugins to the image.

Revision history for this message

Thomas Bechtold (toabctl) wrote on 2021-06-18:

@Maurizio, there are now new images (serial 20210618, website will be updated soon) which include the CNI plugins again. Closing this bug now. Please reopen if you still have this problem with the new images.
Thanks a lot for helping debugging this!