Nodes don't join the cluster because of reboot

Bug #2017782 reported by Samir Akarioh
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-images
Fix Released
Medium
Thomas Bechtold

Bug Description

Reproduce the bug
=================

Creation of the nodegroup
-------------------------

I have created a cluster with file:

::
    apiVersion: eksctl.io/v1alpha5
    kind: ClusterConfig
    metadata:
    name: samirakariohtest
    region: eu-north-1
    version: '1.24'

After that i have created an launchtemplate with securitygroup of the cluster and this userdata and with my ssh key and using t3.2xlarge

::
    MIME-Version: 1.0
    Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="

    --==MYBOUNDARY==
    Content-Type: text/x-shellscript; charset="us-ascii"

    #!/bin/bash
    sudo /etc/eks/bootstrap.sh samirakariohtest

    --==MYBOUNDARY==
    Content-Type: text/cloud-config; charset="us-ascii"

    power_state:
    mode: reboot
    ubuntu_advantage:
    token: <your-token>
    enable:
    - fips

    --==MYBOUNDARY==--

After that i had this into my eksctl file:

::
    managedNodeGroups:
  - name: managed-samirakarioh-1
    launchTemplate:
      id: lt-016d58f3445bd3069

and i created the nodegroup with

::
    eksctl create nodegroup -f cluster.yaml --cfn-disable-rollback

The error
---------

On CloudFormation section i saw this error on the nodegroup stack:

::
    Resource handler returned message: "[Issue(Code=NodeCreationFailure, Message=Unhealthy nodes in the kubernetes cluster, ResourceIds=[i-05ac556a6d1d02534, i-00ca36e2807d96d1e])]
    (Service: null, Status Code: 0, Request ID: null)" (RequestToken: c299b416-d923-a0ca-8b8b-63b555a0666b, HandlerErrorCode: GeneralServiceException)

Try to debug
------------

i have done:

::
    $ kubectl get nodes

    NAME STATUS ROLES AGE VERSION
    ip-192-168-31-235.eu-north-1.compute.internal NotReady <none> 18m v1.24.9
    ip-192-168-46-23.eu-north-1.compute.internal NotReady <none> 18m v1.24.9

for each node i have done kubectl describe:

first node:

::

    Name: ip-192-168-31-235.eu-north-1.compute.internal
    Roles: <none>
    Labels: alpha.eksctl.io/cluster-name=samirakariohtest
                        alpha.eksctl.io/nodegroup-name=managed-samirakarioh-1
                        beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/instance-type=t3.2xlarge
                        beta.kubernetes.io/os=linux
                        eks.amazonaws.com/capacityType=ON_DEMAND
                        eks.amazonaws.com/nodegroup=managed-samirakarioh-1
                        eks.amazonaws.com/nodegroup-image=ami-0a56ed2c62f9d6179
                        eks.amazonaws.com/sourceLaunchTemplateId=lt-016d58f3445bd3069
                        eks.amazonaws.com/sourceLaunchTemplateVersion=4
                        failure-domain.beta.kubernetes.io/region=eu-north-1
                        failure-domain.beta.kubernetes.io/zone=eu-north-1b
                        k8s.io/cloud-provider-aws=ef811e5323187d0bc7394b2bd9d6f165
                        kubernetes.io/arch=amd64
                        kubernetes.io/hostname=ip-192-168-31-235
                        kubernetes.io/os=linux
                        node.kubernetes.io/instance-type=t3.2xlarge
                        topology.kubernetes.io/region=eu-north-1
                        topology.kubernetes.io/zone=eu-north-1b
    Annotations: node.alpha.kubernetes.io/ttl: 0
                        volumes.kubernetes.io/controller-managed-attach-detach: true
    CreationTimestamp: Wed, 26 Apr 2023 13:21:37 +0200
    Taints: node.kubernetes.io/not-ready:NoSchedule
    Unschedulable: false
    Lease:
    HolderIdentity: ip-192-168-31-235.eu-north-1.compute.internal
    AcquireTime: <unset>
    RenewTime: Wed, 26 Apr 2023 13:40:41 +0200
    Conditions:
    Type Status LastHeartbeatTime LastTransitionTime Reason Message
    ---- ------ ----------------- ------------------ ------ -------
    MemoryPressure False Wed, 26 Apr 2023 13:37:57 +0200 Wed, 26 Apr 2023 13:21:32 +0200 KubeletHasSufficientMemory kubelet has sufficient memory available
    DiskPressure False Wed, 26 Apr 2023 13:37:57 +0200 Wed, 26 Apr 2023 13:21:32 +0200 KubeletHasNoDiskPressure kubelet has no disk pressure
    PIDPressure False Wed, 26 Apr 2023 13:37:57 +0200 Wed, 26 Apr 2023 13:21:32 +0200 KubeletHasSufficientPID kubelet has sufficient PID available
    Ready False Wed, 26 Apr 2023 13:37:57 +0200 Wed, 26 Apr 2023 13:21:32 +0200 KubeletNotReady container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
    Addresses:
    InternalIP: 192.168.31.235
    ExternalIP: 16.171.27.242
    Hostname: ip-192-168-31-235.eu-north-1.compute.internal
    InternalDNS: ip-192-168-31-235.eu-north-1.compute.internal
    ExternalDNS: ec2-16-171-27-242.eu-north-1.compute.amazonaws.com
    Capacity:
    attachable-volumes-aws-ebs: 25
    cpu: 8
    ephemeral-storage: 20145724Ki
    hugepages-1Gi: 0
    hugepages-2Mi: 0
    memory: 32525216Ki
    pods: 58
    Allocatable:
    attachable-volumes-aws-ebs: 25
    cpu: 7910m
    ephemeral-storage: 17492557384
    hugepages-1Gi: 0
    hugepages-2Mi: 0
    memory: 31508384Ki
    pods: 58
    System Info:
    Machine ID: ec24db936fe7b8f610a5a41a8b202768
    System UUID: ec24db93-6fe7-b8f6-10a5-a41a8b202768
    Boot ID: e05710a4-c425-4b8f-a2a9-4af20536d322
    Kernel Version: 5.4.0-1021-aws-fips
    OS Image: Ubuntu 20.04.5 LTS
    Operating System: linux
    Architecture: amd64
    Container Runtime Version: containerd://1.5.9
    Kubelet Version: v1.24.9
    Kube-Proxy Version: v1.24.9
    ProviderID: aws:///eu-north-1b/i-05ac556a6d1d02534
    Non-terminated Pods: (2 in total)
    Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
    --------- ---- ------------ ---------- --------------- ------------- ---
    kube-system aws-node-5297r 25m (0%) 0 (0%) 0 (0%) 0 (0%) 19m
    kube-system kube-proxy-54jbh 100m (1%) 0 (0%) 0 (0%) 0 (0%) 19m
    Allocated resources:
    (Total limits may be over 100 percent, i.e., overcommitted.)
    Resource Requests Limits
    -------- -------- ------
    cpu 125m (1%) 0 (0%)
    memory 0 (0%) 0 (0%)
    ephemeral-storage 0 (0%) 0 (0%)
    hugepages-1Gi 0 (0%) 0 (0%)
    hugepages-2Mi 0 (0%) 0 (0%)
    attachable-volumes-aws-ebs 0 0
    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Normal Starting 18m kube-proxy
    Normal Starting 19m kubelet Starting kubelet.
    Warning InvalidDiskCapacity 19m kubelet invalid capacity 0 on image filesystem
    Normal NodeHasSufficientMemory 19m (x2 over 19m) kubelet Node ip-192-168-31-235.eu-north-1.compute.internal status is now: NodeHasSufficientMemory
    Normal NodeHasNoDiskPressure 19m (x2 over 19m) kubelet Node ip-192-168-31-235.eu-north-1.compute.internal status is now: NodeHasNoDiskPressure
    Normal NodeHasSufficientPID 19m (x2 over 19m) kubelet Node ip-192-168-31-235.eu-north-1.compute.internal status is now: NodeHasSufficientPID
    Normal NodeAllocatableEnforced 19m kubelet Updated Node Allocatable limit across pods
    Normal RegisteredNode 19m node-controller Node ip-192-168-31-235.eu-north-1.compute.internal event: Registered Node ip-192-168-31-235.eu-north-1.compute.internal in Controller

Revision history for this message
Samir Akarioh (samiraka) wrote :
Revision history for this message
Samir Akarioh (samiraka) wrote :
Revision history for this message
Samir Akarioh (samiraka) wrote :
Revision history for this message
Samir Akarioh (samiraka) wrote :
Revision history for this message
Samir Akarioh (samiraka) wrote :
Revision history for this message
Samir Akarioh (samiraka) wrote :
Revision history for this message
George Kraft (cynerva) wrote :

I discussed this with Samir and had a look through logs. The aws-node pod is in CrashLoopBackOff because the aws-vpc-cni-init container seems stuck waiting for ipamd:

2023-04-26T13:10:24.619437213Z stdout F {"level":"info","ts":"2023-04-26T13:10:24.618Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
2023-04-26T13:10:26.639431058Z stdout F {"level":"info","ts":"2023-04-26T13:10:26.636Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
2023-04-26T13:10:28.649483016Z stdout F {"level":"info","ts":"2023-04-26T13:10:28.648Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

ipamd doesn't log any obvious errors, but does seem to be restarting constantly, where the last thing it logs is:

{"level":"info","ts":"2023-04-26T13:08:52.336Z","caller":"ipamd/ipamd.go:509","msg":"Reading ipam state from CRI"}
{"level":"debug","ts":"2023-04-26T13:08:52.336Z","caller":"datastore/data_store.go:389","msg":"Getting running pod sandboxes from \"unix:///var/run/dockershim.sock\""}

This looks similar to https://github.com/aws/amazon-vpc-cni-k8s/issues/2133 which was fixed in the Amazon EKS node's bootstrap.sh via https://github.com/awslabs/amazon-eks-ami/pull/921/files

I believe a similar fix needs to be applied to the bootstrap.sh used in Ubuntu EKS nodes.

Revision history for this message
Robby Pocase (rpocase) wrote :

Tom has made the relevant base image changes and this fix should be available tomorrow.

Changed in cloud-images:
importance: Undecided → Medium
assignee: nobody → Thomas Bechtold (toabctl)
status: New → Fix Committed
Revision history for this message
Thomas Bechtold (toabctl) wrote :

Images with serial 20230517 do contain a fix.
Please let us know if that problem still occurs.

Changed in cloud-images:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.