cloud-images

Nodes don't join the cluster because of reboot

Bug #2017782 reported by Samir Akarioh on 2023-04-26

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	cloud-images	Fix Released	Medium	Thomas Bechtold

Bug Description

Reproduce the bug
=================

Creation of the nodegroup
-------------------------

I have created a cluster with file:

::
    apiVersion: eksctl.io/v1alpha5
    kind: ClusterConfig
    metadata:
    name: samirakariohtest
    region: eu-north-1
    version: '1.24'

After that i have created an launchtemplate with securitygroup of the cluster and this userdata and with my ssh key and using t3.2xlarge

::
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="

--==MYBOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash
sudo /etc/eks/bootstrap.sh samirakariohtest

--==MYBOUNDARY==
Content-Type: text/cloud-config; charset="us-ascii"

    power_state:
    mode: reboot
    ubuntu_advantage:
    token: <your-token>
    enable:
    - fips

--==MYBOUNDARY==--

After that i had this into my eksctl file:

::
    managedNodeGroups:
  - name: managed-samirakarioh-1
    launchTemplate:
      id: lt-016d58f3445bd3069

and i created the nodegroup with

::
eksctl create nodegroup -f cluster.yaml --cfn-disable-rollback

The error
---------

On CloudFormation section i saw this error on the nodegroup stack:

::
Resource handler returned message: "[Issue(Code=NodeCreationFailure, Message=Unhealthy nodes in the kubernetes cluster, ResourceIds=[i-05ac556a6d1d02534, i-00ca36e2807d96d1e])]
(Service: null, Status Code: 0, Request ID: null)" (RequestToken: c299b416-d923-a0ca-8b8b-63b555a0666b, HandlerErrorCode: GeneralServiceException)

Try to debug
------------

i have done:

::
$ kubectl get nodes

    NAME STATUS ROLES AGE VERSION
    ip-192-168-31-235.eu-north-1.compute.internal NotReady <none> 18m v1.24.9
    ip-192-168-46-23.eu-north-1.compute.internal NotReady <none> 18m v1.24.9

for each node i have done kubectl describe:

first node:

    Name: ip-192-168-31-235.eu-north-1.compute.internal
    Roles: <none>
    Labels: alpha.eksctl.io/cluster-name=samirakariohtest
                        alpha.eksctl.io/nodegroup-name=managed-samirakarioh-1
                        beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/instance-type=t3.2xlarge
                        beta.kubernetes.io/os=linux
                        eks.amazonaws.com/capacityType=ON_DEMAND
                        eks.amazonaws.com/nodegroup=managed-samirakarioh-1
                        eks.amazonaws.com/nodegroup-image=ami-0a56ed2c62f9d6179
                        eks.amazonaws.com/sourceLaunchTemplateId=lt-016d58f3445bd3069
                        eks.amazonaws.com/sourceLaunchTemplateVersion=4
                        failure-domain.beta.kubernetes.io/region=eu-north-1
                        failure-domain.beta.kubernetes.io/zone=eu-north-1b
                        k8s.io/cloud-provider-aws=ef811e5323187d0bc7394b2bd9d6f165
                        kubernetes.io/arch=amd64
                        kubernetes.io/hostname=ip-192-168-31-235
                        kubernetes.io/os=linux
                        node.kubernetes.io/instance-type=t3.2xlarge
                        topology.kubernetes.io/region=eu-north-1
                        topology.kubernetes.io/zone=eu-north-1b
    Annotations: node.alpha.kubernetes.io/ttl: 0
                        volumes.kubernetes.io/controller-managed-attach-detach: true
    CreationTimestamp: Wed, 26 Apr 2023 13:21:37 +0200
    Taints: node.kubernetes.io/not-ready:NoSchedule
    Unschedulable: false
    Lease:
    HolderIdentity: ip-192-168-31-235.eu-north-1.compute.internal
    AcquireTime: <unset>
    RenewTime: Wed, 26 Apr 2023 13:40:41 +0200
    Conditions:
    Type Status LastHeartbeatTime LastTransitionTime Reason Message
    ---- ------ ----------------- ------------------ ------ -------
    MemoryPressure False Wed, 26 Apr 2023 13:37:57 +0200 Wed, 26 Apr 2023 13:21:32 +0200 KubeletHasSufficientMemory kubelet has sufficient memory available
    DiskPressure False Wed, 26 Apr 2023 13:37:57 +0200 Wed, 26 Apr 2023 13:21:32 +0200 KubeletHasNoDiskPressure kubelet has no disk pressure
    PIDPressure False Wed, 26 Apr 2023 13:37:57 +0200 Wed, 26 Apr 2023 13:21:32 +0200 KubeletHasSufficientPID kubelet has sufficient PID available
    Ready False Wed, 26 Apr 2023 13:37:57 +0200 Wed, 26 Apr 2023 13:21:32 +0200 KubeletNotReady container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
    Addresses:
    InternalIP: 192.168.31.235
    ExternalIP: 16.171.27.242
    Hostname: ip-192-168-31-235.eu-north-1.compute.internal
    InternalDNS: ip-192-168-31-235.eu-north-1.compute.internal
    ExternalDNS: ec2-16-171-27-242.eu-north-1.compute.amazonaws.com
    Capacity:
    attachable-volumes-aws-ebs: 25
    cpu: 8
    ephemeral-storage: 20145724Ki
    hugepages-1Gi: 0
    hugepages-2Mi: 0
    memory: 32525216Ki
    pods: 58
    Allocatable:
    attachable-volumes-aws-ebs: 25
    cpu: 7910m
    ephemeral-storage: 17492557384
    hugepages-1Gi: 0
    hugepages-2Mi: 0
    memory: 31508384Ki
    pods: 58
    System Info:
    Machine ID: ec24db936fe7b8f610a5a41a8b202768
    System UUID: ec24db93-6fe7-b8f6-10a5-a41a8b202768
    Boot ID: e05710a4-c425-4b8f-a2a9-4af20536d322
    Kernel Version: 5.4.0-1021-aws-fips
    OS Image: Ubuntu 20.04.5 LTS
    Operating System: linux
    Architecture: amd64
    Container Runtime Version: containerd://1.5.9
    Kubelet Version: v1.24.9
    Kube-Proxy Version: v1.24.9
    ProviderID: aws:///eu-north-1b/i-05ac556a6d1d02534
    Non-terminated Pods: (2 in total)
    Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
    --------- ---- ------------ ---------- --------------- ------------- ---
    kube-system aws-node-5297r 25m (0%) 0 (0%) 0 (0%) 0 (0%) 19m
    kube-system kube-proxy-54jbh 100m (1%) 0 (0%) 0 (0%) 0 (0%) 19m
    Allocated resources:
    (Total limits may be over 100 percent, i.e., overcommitted.)
    Resource Requests Limits
    -------- -------- ------
    cpu 125m (1%) 0 (0%)
    memory 0 (0%) 0 (0%)
    ephemeral-storage 0 (0%) 0 (0%)
    hugepages-1Gi 0 (0%) 0 (0%)
    hugepages-2Mi 0 (0%) 0 (0%)
    attachable-volumes-aws-ebs 0 0
    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Normal Starting 18m kube-proxy
    Normal Starting 19m kubelet Starting kubelet.
    Warning InvalidDiskCapacity 19m kubelet invalid capacity 0 on image filesystem
    Normal NodeHasSufficientMemory 19m (x2 over 19m) kubelet Node ip-192-168-31-235.eu-north-1.compute.internal status is now: NodeHasSufficientMemory
    Normal NodeHasNoDiskPressure 19m (x2 over 19m) kubelet Node ip-192-168-31-235.eu-north-1.compute.internal status is now: NodeHasNoDiskPressure
    Normal NodeHasSufficientPID 19m (x2 over 19m) kubelet Node ip-192-168-31-235.eu-north-1.compute.internal status is now: NodeHasSufficientPID
    Normal NodeAllocatableEnforced 19m kubelet Updated Node Allocatable limit across pods
    Normal RegisteredNode 19m node-controller Node ip-192-168-31-235.eu-north-1.compute.internal event: Registered Node ip-192-168-31-235.eu-north-1.compute.internal in Controller

Revision history for this message

Samir Akarioh (samiraka) wrote on 2023-04-26:

snap.kubelet-eks.daemon.service Edit (1.0 MiB, text/plain)

Revision history for this message

Samir Akarioh (samiraka) wrote on 2023-04-26:

snapd.seeded.service Edit (619 bytes, text/plain)

Revision history for this message

Samir Akarioh (samiraka) wrote on 2023-04-26:

snapd.service Edit (2.3 KiB, text/plain)

Revision history for this message

Samir Akarioh (samiraka) wrote on 2023-04-26:

cloud-init.log Edit (258.7 KiB, text/plain)

Revision history for this message

Samir Akarioh (samiraka) wrote on 2023-04-26:

cloud-init-output.log Edit (10.8 KiB, text/plain)

Revision history for this message

Samir Akarioh (samiraka) wrote on 2023-04-26:

containerd Edit (141.6 KiB, text/plain)

Revision history for this message

George Kraft (cynerva) wrote on 2023-04-28:

I discussed this with Samir and had a look through logs. The aws-node pod is in CrashLoopBackOff because the aws-vpc-cni-init container seems stuck waiting for ipamd:

2023-04-26T13:10:24.619437213Z stdout F {"level":"info","ts":"2023-04-26T13:10:24.618Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
2023-04-26T13:10:26.639431058Z stdout F {"level":"info","ts":"2023-04-26T13:10:26.636Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
2023-04-26T13:10:28.649483016Z stdout F {"level":"info","ts":"2023-04-26T13:10:28.648Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

ipamd doesn't log any obvious errors, but does seem to be restarting constantly, where the last thing it logs is:

{"level":"info","ts":"2023-04-26T13:08:52.336Z","caller":"ipamd/ipamd.go:509","msg":"Reading ipam state from CRI"}
{"level":"debug","ts":"2023-04-26T13:08:52.336Z","caller":"datastore/data_store.go:389","msg":"Getting running pod sandboxes from \"unix:///var/run/dockershim.sock\""}

This looks similar to https://github.com/aws/amazon-vpc-cni-k8s/issues/2133 which was fixed in the Amazon EKS node's bootstrap.sh via https://github.com/awslabs/amazon-eks-ami/pull/921/files

I believe a similar fix needs to be applied to the bootstrap.sh used in Ubuntu EKS nodes.

Revision history for this message

Robby Pocase (rpocase) wrote on 2023-05-10:

Tom has made the relevant base image changes and this fix should be available tomorrow.

Changed in cloud-images:
importance:	Undecided → Medium
assignee:	nobody → Thomas Bechtold (toabctl)
status:	New → Fix Committed

Revision history for this message

Thomas Bechtold (toabctl) wrote on 2023-05-17:

Images with serial 20230517 do contain a fix.
Please let us know if that problem still occurs.

Changed in cloud-images:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

auto-github-aws-amazon-vpc-cni-k8s #2133
[closed question needs investigation] Edit

Bug watches keep track of this bug in other bug trackers.