Nodes don't join the cluster because of reboot
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
cloud-images |
Fix Released
|
Medium
|
Thomas Bechtold |
Bug Description
Reproduce the bug
=================
Creation of the nodegroup
-------
I have created a cluster with file:
::
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: samirakariohtest
region: eu-north-1
version: '1.24'
After that i have created an launchtemplate with securitygroup of the cluster and this userdata and with my ssh key and using t3.2xlarge
::
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=
--=
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
sudo /etc/eks/
--=
Content-Type: text/cloud-config; charset="us-ascii"
power_state:
mode: reboot
ubuntu_
token: <your-token>
enable:
- fips
--=
After that i had this into my eksctl file:
::
managedNode
- name: managed-
launchTemplate:
id: lt-016d58f3445b
and i created the nodegroup with
::
eksctl create nodegroup -f cluster.yaml --cfn-disable-
The error
---------
On CloudFormation section i saw this error on the nodegroup stack:
::
Resource handler returned message: "[Issue(
(Service: null, Status Code: 0, Request ID: null)" (RequestToken: c299b416-
Try to debug
------------
i have done:
::
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-
ip-
for each node i have done kubectl describe:
first node:
::
Name: ip-192-
Roles: <none>
Labels: alpha.eksctl.
Annotations: node.alpha.
CreationTim
Taints: node.kubernetes
Unschedulable: false
Lease:
HolderIdentity: ip-192-
AcquireTime: <unset>
RenewTime: Wed, 26 Apr 2023 13:40:41 +0200
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 26 Apr 2023 13:37:57 +0200 Wed, 26 Apr 2023 13:21:32 +0200 KubeletHasSuffi
DiskPressure False Wed, 26 Apr 2023 13:37:57 +0200 Wed, 26 Apr 2023 13:21:32 +0200 KubeletHasNoDis
PIDPressure False Wed, 26 Apr 2023 13:37:57 +0200 Wed, 26 Apr 2023 13:21:32 +0200 KubeletHasSuffi
Ready False Wed, 26 Apr 2023 13:37:57 +0200 Wed, 26 Apr 2023 13:21:32 +0200 KubeletNotReady container runtime network not ready: NetworkReady=false reason:
Addresses:
InternalIP: 192.168.31.235
ExternalIP: 16.171.27.242
Hostname: ip-192-
InternalDNS: ip-192-
ExternalDNS: ec2-16-
Capacity:
attachable-
cpu: 8
ephemeral-
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32525216Ki
pods: 58
Allocatable:
attachable-
cpu: 7910m
ephemeral-
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 31508384Ki
pods: 58
System Info:
Machine ID: ec24db936fe7b8f
System UUID: ec24db93-
Boot ID: e05710a4-
Kernel Version: 5.4.0-1021-aws-fips
OS Image: Ubuntu 20.04.5 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.5.9
Kubelet Version: v1.24.9
Kube-Proxy Version: v1.24.9
ProviderID: aws:///
Non-terminated Pods: (2 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system aws-node-5297r 25m (0%) 0 (0%) 0 (0%) 0 (0%) 19m
kube-system kube-proxy-54jbh 100m (1%) 0 (0%) 0 (0%) 0 (0%) 19m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 125m (1%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 18m kube-proxy
Normal Starting 19m kubelet Starting kubelet.
Warning InvalidDiskCapacity 19m kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficie
Normal NodeHasNoDiskPr
Normal NodeHasSufficie
Normal NodeAllocatable
Normal RegisteredNode 19m node-controller Node ip-192-
I discussed this with Samir and had a look through logs. The aws-node pod is in CrashLoopBackOff because the aws-vpc-cni-init container seems stuck waiting for ipamd:
2023-04- 26T13:10: 24.619437213Z stdout F {"level" :"info" ,"ts":" 2023-04- 26T13:10: 24.618Z" ,"caller" :"entrypoint. sh","msg" :"Retrying waiting for IPAM-D"} 26T13:10: 26.639431058Z stdout F {"level" :"info" ,"ts":" 2023-04- 26T13:10: 26.636Z" ,"caller" :"entrypoint. sh","msg" :"Retrying waiting for IPAM-D"} 26T13:10: 28.649483016Z stdout F {"level" :"info" ,"ts":" 2023-04- 26T13:10: 28.648Z" ,"caller" :"entrypoint. sh","msg" :"Retrying waiting for IPAM-D"}
2023-04-
2023-04-
ipamd doesn't log any obvious errors, but does seem to be restarting constantly, where the last thing it logs is:
{"level" :"info" ,"ts":" 2023-04- 26T13:08: 52.336Z" ,"caller" :"ipamd/ ipamd.go: 509","msg" :"Reading ipam state from CRI"} :"debug" ,"ts":" 2023-04- 26T13:08: 52.336Z" ,"caller" :"datastore/ data_store. go:389" ,"msg": "Getting running pod sandboxes from \"unix: ///var/ run/dockershim. sock\"" }
{"level"
This looks similar to https:/ /github. com/aws/ amazon- vpc-cni- k8s/issues/ 2133 which was fixed in the Amazon EKS node's bootstrap.sh via https:/ /github. com/awslabs/ amazon- eks-ami/ pull/921/ files
I believe a similar fix needs to be applied to the bootstrap.sh used in Ubuntu EKS nodes.