cloud-images

Rarely fails to join EKS cluster

Bug #2012689 reported by DingGGu on 2023-03-24

This bug affects 6 people

Affects		Status	Importance	Assigned to	Milestone
	cloud-images	Fix Released	High	Thomas Bechtold

Bug Description

We are running multiple clusters.
The cluster that frequently scale-in and out sometimes fail to join the cluster.

Looking at /var/log/user-data.log, running `snap start kubelet-eks` in /etc/eks/bootstrap.sh returns fail. Looking at journalctl, it seems as that is running without specifying kubelet's arguments at all.

If I manually run /etc/eks/bootstrap.sh after the nodes are orphaned, the cluster joins just fine.

I think this is a timing issue related to the snap and argument settings.

Using 1.24 AMI with ami-04c00a6fc53487c5a

Some interesting log for snap does not read argument:
kubelet-eks.daemon[935]: cat: /var/snap/kubelet-eks/92/args: No such file or directory

kubelet runs fail same errors:
kubelet-eks.daemon[889]: I0307 19:24:58.886750 889 util_unix.go:104] "Using this format as endpoint is deprecated, please consider using full url format." deprecatedFormat="" fullURLFormat="unix://"
kubelet-eks.daemon[889]: W0307 19:24:58.888995 889 clientconn.go:1331] [core] grpc:

Certains logs from journalctl:
systemd[1]: Started containerd container runtime.
systemd[1]: Started Service for snap application amazon-ssm-agent.amazon-ssm-agent.
systemd[1]: Reloading.
systemd[1]: Started Service for snap application kubelet-eks.daemon.
systemd[1]: Started snap.kubelet-eks.hook.configure.3540f36b-29a1-4974-8c41-31995a6c637e.scope.
kubelet-eks.daemon[935]: cat: /var/snap/kubelet-eks/92/args: No such file or directory
amazon-ssm-agent.amazon-ssm-agent[833]: Error occurred fetching the seelog config file path: open /etc/amazon/ssm/seelog.xml: no such file or directory
amazon-ssm-agent.amazon-ssm-agent[833]: Initializing new seelog logger
amazon-ssm-agent.amazon-ssm-agent[833]: New Seelog Logger Creation Complete
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 WARN Error adding the directory '/etc/amazon/ssm' to watcher: no such file or directory
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO Proxy environment variables:
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO https_proxy:
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO http_proxy:
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO no_proxy:
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO Agent will take identity from EC2
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO [amazon-ssm-agent] using named pipe channel for IPC
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO [amazon-ssm-agent] using named pipe channel for IPC
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO [amazon-ssm-agent] using named pipe channel for IPC
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO [amazon-ssm-agent] amazon-ssm-agent - v3.1.1732.0
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO [amazon-ssm-agent] OS: linux, Arch: amd64
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO [CredentialRefresher] Identity does not require credential refresher
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:54 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] [WorkerProvider] Worker ssm-agent-worker is not running, starting worker process
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:54 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] [WorkerProvider] Worker ssm-agent-worker (pid:948) started
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:54 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] Monitor long running worker health every 60 seconds
systemd[1]: snap.kubelet-eks.hook.configure.3540f36b-29a1-4974-8c41-31995a6c637e.scope: Succeeded.
dbus-daemon[531]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' requested by ':1.9' (uid=0 pid=539 comm="/usr/lib/snapd/snapd " label="unconfined")
systemd[1]: Starting Time & Date Service...
dbus-daemon[531]: [system] Successfully activated service 'org.freedesktop.timedate1'
systemd[1]: Started Time & Date Service.
systemd[1]: Started Kubernetes systemd probe.
kubelet-eks.daemon[889]: I0307 19:24:58.858562 889 server.go:399] "Kubelet version" kubeletVersion="v1.24.9"
kubelet-eks.daemon[889]: I0307 19:24:58.858619 889 server.go:401] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
kubelet-eks.daemon[889]: I0307 19:24:58.858841 889 server.go:562] "Standalone mode, no API client"
systemd[1]: run-r5647ef4c140746af8f68048b9b657df0.scope: Succeeded.
kubelet-eks.daemon[889]: I0307 19:24:58.886249 889 server.go:450] "No api server defined - no events will be sent to API server"
kubelet-eks.daemon[889]: I0307 19:24:58.886266 889 server.go:648] "--cgroups-per-qos enabled, but --cgroup-root was not specified. defaulting to /"
kubelet-eks.daemon[889]: I0307 19:24:58.886544 889 container_manager_linux.go:262] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
kubelet-eks.daemon[889]: I0307 19:24:58.886618 889 container_manager_linux.go:267] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: KubeletOOMScoreAdj:-999 ContainerRuntime: CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:cgroupfs KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:nodefs.inodesFree Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>} {Signal:imagefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.15} GracePeriod:0s MinReclaim:<nil>} {Signal:memory.available Operator:LessThan Value:{Quantity:100Mi Percentage:0} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.1} GracePeriod:0s MinReclaim:<nil>}]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalCPUManagerPolicyOptions:map[] ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalMemoryManagerPolicy:None ExperimentalMemoryManagerReservedMemory:[] ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}
kubelet-eks.daemon[889]: I0307 19:24:58.886635 889 topology_manager.go:133] "Creating topology manager with policy per scope" topologyPolicyName="none" topologyScopeName="container"
kubelet-eks.daemon[889]: I0307 19:24:58.886644 889 container_manager_linux.go:302] "Creating device plugin manager" devicePluginEnabled=true
kubelet-eks.daemon[889]: I0307 19:24:58.886706 889 state_mem.go:36] "Initialized new in-memory state store"
kubelet-eks.daemon[889]: I0307 19:24:58.886750 889 util_unix.go:104] "Using this format as endpoint is deprecated, please consider using full url format." deprecatedFormat="" fullURLFormat="unix://"
kubelet-eks.daemon[889]: W0307 19:24:58.888995 889 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to { <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix: missing address". Reconnecting...

See original description

Related branches

snap-kubelet:gkk/robust-service-starts

Merged into snap-kubelet:master at revision 6c89724888bc5427bbc8828620de0d3c509a5884

Kevin W Monroe: Approve on 2023-05-22

Adam Dyess: Needs Information on 2023-05-19

DingGGu (dinggggu) on 2023-03-24

description:

updated

Robby Pocase (rpocase) on 2023-04-04

Changed in cloud-images:
importance:	Undecided → High

Revision history for this message

Robby Pocase (rpocase) wrote on 2023-04-04:

Thanks for filing this bug @dinggggu and sorry for the extremely delayed response. We are working internally to consistently reproduce and try to identify a root cause. We will update ASAP.

Revision history for this message

Robby Pocase (rpocase) wrote on 2023-04-06:

@DinGGu Are you able to provide a bit more information on the topology of your clusters?

> The cluster that frequently scale-in and out sometimes fail to join the cluster.

I take this to mean you have at least one autoscaling cluster that has new nodes fails to join the cluster. Is that correct? If so, could you share your autoscaling deployment configuration (scrubbed if necessary for sensitive details)?

Revision history for this message

DingGGu (dinggggu) wrote on 2023-04-06:

Hello. I'm using Karpenter for auto-scaling. Build server (a.k.a CI) workers are operated in the form of Kubernetes Pods, and Pods are created when a build request is received, and when the Pod is in the Pending state, a new instance is started by Karpenter.
Dozens of build requests are created at once, and in this case, multiple instances are started and when job is done, removed by Karpenter.
Sometimes, instance cannot join to cluster.

Karpenter does not replacing bootstrap.sh and I just give to you instance user-data for you.

--//
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash -xe
exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1
/etc/eks/bootstrap.sh '#REDACT#' --apiserver-endpoint 'https://#REDACT#.eks.amazonaws.com' --b64-cluster-ca '....' \
--container-runtime containerd \
--kubelet-extra-args '--node-labels=karpenter.sh/capacity-type=on-demand,karpenter.sh/provisioner-name=#REDACT# --register-with-taints=dedicated=#REDACT#:NoSchedule'
--//--

Revision history for this message

Kevin W Monroe (kwmonroe) wrote on 2023-04-14 (last edit on 2023-04-14):

It appears the kubelet daemon is attempting to start before its $SNAP_DATA/args file is present. This is generated by the configure hook, so indeed it seems like a timing issue where 'start' precedes 'configure'.

A potential workaround is to manually restart kubelet-eks on the failing node (by this time, the args file should be present):

sudo systemctl restart snap.kubelet-eks.daemon

Revision history for this message

Thomas Bechtold (toabctl) wrote on 2023-04-18:

It would be good to see some timestamps for the logs.

could you share

- /var/log/cloud-init.log
- /var/log/cloud-init-output.log
- journalctl -u snapd.seeded.service
- journalctl -u snapd.service
- journalctl -u snap.kubelet-eks.daemon.service

please?

Revision history for this message

DingGGu (dinggggu) wrote on 2023-04-19:

Since it's been a while, there are currently no logs for that. I'll let you know if it happens again.

As Kevin said, if I run sudo systemctl restart snap.kubelet-eks.daemon manually, the kubelet runs fine. Running it manually was expected to be a timing issue with the snap happening long after the boot process.

Revision history for this message

DingGGu (dinggggu) wrote on 2023-04-20:

Download full text (24.0 KiB)

- /var/log/cloud-init.log

2023-04-19 16:22:50,628 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-aaaaaaaaaaaaaaaaaa/sem/config_scripts_user - wb: [644] 25 bytes
2023-04-19 16:22:50,628 - helpers.py[DEBUG]: Running config-scripts-user using lock (<FileLock using file '/var/lib/cloud/instances/i-aaaaaaaaaaaaaaaaaa/sem/config_scripts_user'>)
2023-04-19 16:22:50,628 - subp.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/part-001'] with allowed return codes [0] (shell=False, capture=False)
2023-04-19 16:23:02,236 - subp.py[DEBUG]: Unexpected error while running command.
Command: ['/var/lib/cloud/instance/scripts/part-001']
Exit code: 1
Reason: -
Stdout: -
Stderr: -
2023-04-19 16:23:02,236 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2023-04-19 16:23:02,237 - handlers.py[DEBUG]: finish: modules-final/config-scripts-user: FAIL: running config-scripts-user with frequency once-per-instance
2023-04-19 16:23:02,237 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
2023-04-19 16:23:02,237 - util.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/config/modules.py", line 246, in _run_modules
    ran, _r = cc.run(
  File "/usr/lib/python3/dist-packages/cloudinit/cloud.py", line 67, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 185, in run
    results = functor(*args)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py", line 54, in handle
    subp.runparts(runparts_path)
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 427, in runparts
    raise RuntimeError(
RuntimeError: Runparts: 1 failures (part-001) in 1 attempted commands

- /var/log/cloud-init-output.log
Cloud-init v. 23.1.1-0ubuntu0~20.04.1 running 'modules:config' at Wed, 19 Apr 2023 16:22:49 +0000. Up 19.55 seconds.
+ exec
++ tee /var/log/user-data.log
++ logger -t user-data -s
Cloud-init v. 23.1.1-0ubuntu0~20.04.1 running 'modules:final' at Wed, 19 Apr 2023 16:22:50 +0000. Up 20.32 seconds.
2023-04-19 16:23:02,236 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2023-04-19 16:23:02,237 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
Cloud-init v. 23.1.1-0ubuntu0~20.04.1 finished at Wed, 19 Apr 2023 16:23:02 +0000. Datasource DataSourceEc2Local. Up 32.06 seconds

- journalctl -u snapd.seeded.service
-- Reboot --
Apr 19 16:22:49 ip-*-*-*-* systemd[1]: Starting Wait until snapd is fully seeded...
Apr 19 16:22:49 ip-*-*-*-* systemd[1]: Finished Wait until snapd is fully seeded.

- journalctl -u snapd.service
-- Reboot --
Apr 19 16:22:42 ip-*-*-*-* systemd[1]: Starting Snap ...

- /var/log/cloud-init.log

- journalctl -u snapd.service
-- Reboot --
Apr 19 16:22:42 ip-*-*-*-* systemd[1]: Starting Snap Daemon...
Apr 19 16:22:48 ip-*-*-*-* snapd[862]: overlord.go:268: Acquiring state lock file
Apr 19 16:22:48 ip-*-*-*-* snapd[862]: overlord.go:273: Acquired state lock file
Apr 19 16:22:49 ip-*-*-*-* snapd[862]: daemon.go:247: started snapd/2.58.3 (series 16; classic) ubuntu/20.04 (amd64) linux/5.15.0-1033-aws.
Apr 19 16:22:49 ip-*-*-*-* snapd[862]: daemon.go:340: adjusting startup timeout by 1m0s (pessimistic estimate of 30s plus 5s per snap)
Apr 19 16:22:49 ip-*-*-*-* snapd[862]: backends.go:58: AppArmor status: apparmor is enabled and all features are available
Apr 19 16:22:49 ip-*-*-*-* systemd[1]: Started Snap Daemon.
Apr 19 16:22:55 ip-*-*-*-* snapd[862]: storehelpers.go:769: cannot refresh: snap has no updates available: "amazon-ssm-agent", "core20", "kubectl-eks", "kubelet-eks", "snapd"
Apr 19 16:23:02 ip-*-*-*-* systemd[1]: snapd.service: Got notification message from PID 9415, but reception only permitted for main PID 862
Apr 19 16:23:02 ip-*-*-*-* snapd[862]: taskrunner.go:289: [change 12 "Run service command \"start\" for services [\"daemon\"] of snap \"kubelet-eks\"" task] failed: systemctl command [start snap.kubelet-eks.daemon.service] failed with exit
 status 1: Job for snap.kubelet-eks.daemon.service failed because the control process exited with error code.
Apr 19 16:23:02 ip-*-*-*-* snapd[862]: See "systemctl status snap.kubelet-eks.daemon.service" and "journalctl -xe" for details.
Apr 19 16:23:12 ip-*-*-*-* snapd[862]: storehelpers.go:769: cannot refresh snap "core18": snap has no updates available
Apr 19 19:57:52 ip-*-*-*-* snapd[862]: storehelpers.go:769: cannot refresh: snap has no updates available: "amazon-ssm-agent", "core18", "core20", "kubectl-eks", "kubelet-eks", "snapd"
Apr 19 19:57:52 ip-*-*-*-* snapd[862]: autorefresh.go:551: auto-refresh: all snaps are up-to-date

- journalctl -u snap.kubelet-eks.daemon.service
-- Reboot --
Apr 19 16:22:42 ip-*-*-*-* systemd[1]: Started Service for snap application kubelet-eks.daemon.
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]: I0419 16:22:52.305503     856 server.go:399] "Kubelet version" kubeletVersion="v1.24.10"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]: I0419 16:22:52.305541     856 server.go:401] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]: I0419 16:22:52.305750     856 server.go:562] "Standalone mode, no API client"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]: I0419 16:22:52.372429     856 server.go:450] "No api server defined - no events will be sent to API server"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]: I0419 16:22:52.372449     856 server.go:648] "--cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]: I0419 16:22:52.372683     856 container_manager_linux.go:262] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]: I0419 16:22:52.372745     856 container_manager_linux.go:267] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: KubeletOOMScoreAdj:-999 ContainerRuntime: CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:cgroupfs KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:memory.available Operator:LessThan Value:{Quantity:100Mi Percentage:0} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.1} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.inodesFree Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>} {Signal:imagefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.15} GracePeriod:0s MinReclaim:<nil>}]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalCPUManagerPolicyOptions:map[] ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalMemoryManagerPolicy:None ExperimentalMemoryManagerReservedMemory:[] ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]: I0419 16:22:52.372758     856 topology_manager.go:133] "Creating topology manager with policy per scope" topologyPolicyName="none" topologyScopeName="container"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]: I0419 16:22:52.372767     856 container_manager_linux.go:302] "Creating device plugin manager" devicePluginEnabled=true
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]: I0419 16:22:52.372800     856 state_mem.go:36] "Initialized new in-memory state store"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]: I0419 16:22:52.372836     856 util_unix.go:104] "Using this format as endpoint is deprecated, please consider using full url format." deprecatedFormat="" fullURLFormat="unix://"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]: W0419 16:22:52.372943     856 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {  <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix: missing address". Reconnecting...
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]: Error: failed to run Kubelet: unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix: missing address"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]: Usage:
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]:   kubelet [flags]
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]: Flags:
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]:       --add-dir-header                                           If true, adds the file directory to the header of the log messages (DEPRECATED: will be removed in a future release, see https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components)
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]:       --address ip                                               The IP address for the Kubelet to serve on (set to '0.0.0.0' or '::' for listening in all interfaces and IP families) (default 0.0.0.0) (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.)
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]:       --allowed-unsafe-sysctls strings                           Comma-separated whitelist of unsafe sysctls or unsafe sysctl patterns (ending in *). Use these at your own risk. (DEPRECATE
......
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[856]: Error: failed to run Kubelet: unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix: missing address"
Apr 19 16:22:52 ip-*-*-*-* systemd[1]: snap.kubelet-eks.daemon.service: Scheduled restart job, restart counter is at 1.
Apr 19 16:22:52 ip-*-*-*-* systemd[1]: Stopped Service for snap application kubelet-eks.daemon.
Apr 19 16:22:52 ip-*-*-*-* systemd[1]: Started Service for snap application kubelet-eks.daemon.
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1288]: I0419 16:22:52.710863    1288 server.go:399] "Kubelet version" kubeletVersion="v1.24.10"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1288]: I0419 16:22:52.710898    1288 server.go:401] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1288]: I0419 16:22:52.711053    1288 server.go:562] "Standalone mode, no API client"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1288]: I0419 16:22:52.740383    1288 server.go:450] "No api server defined - no events will be sent to API server"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1288]: I0419 16:22:52.740399    1288 server.go:648] "--cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1288]: I0419 16:22:52.742538    1288 container_manager_linux.go:262] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1288]: I0419 16:22:52.742704    1288 container_manager_linux.go:267] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: KubeletOOMScoreAdj:-999 ContainerRuntime: CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:cgroupfs KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.1} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.inodesFree Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>} {Signal:imagefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.15} GracePeriod:0s MinReclaim:<nil>} {Signal:memory.available Operator:LessThan Value:{Quantity:100Mi Percentage:0} GracePeriod:0s MinReclaim:<nil>}]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalCPUManagerPolicyOptions:map[] ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalMemoryManagerPolicy:None ExperimentalMemoryManagerReservedMemory:[] ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1288]: I0419 16:22:52.742725    1288 topology_manager.go:133] "Creating topology manager with policy per scope" topologyPolicyName="none" topologyScopeName="container"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1288]: I0419 16:22:52.742735    1288 container_manager_linux.go:302] "Creating device plugin manager" devicePluginEnabled=true
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1288]: I0419 16:22:52.742758    1288 state_mem.go:36] "Initialized new in-memory state store"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1288]: I0419 16:22:52.742800    1288 util_unix.go:104] "Using this format as endpoint is deprecated, please consider using full url format." deprecatedFormat="" fullURLFormat="unix://"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1288]: W0419 16:22:52.742888    1288 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {  <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix: missing address". Reconnecting...
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1288]: Error: failed to run Kubelet: unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix: missing address"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1288]: Usage:
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1288]:   kubelet [flags]
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1288]: Flags:
...
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1288]: Error: failed to run Kubelet: unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix: missing address"
Apr 19 16:22:52 ip-*-*-*-* systemd[1]: snap.kubelet-eks.daemon.service: Scheduled restart job, restart counter is at 2.
Apr 19 16:22:52 ip-*-*-*-* systemd[1]: Stopped Service for snap application kubelet-eks.daemon.
Apr 19 16:22:52 ip-*-*-*-* systemd[1]: Started Service for snap application kubelet-eks.daemon.
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1342]: I0419 16:22:52.951163    1342 server.go:399] "Kubelet version" kubeletVersion="v1.24.10"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1342]: I0419 16:22:52.951196    1342 server.go:401] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1342]: I0419 16:22:52.951426    1342 server.go:562] "Standalone mode, no API client"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1342]: I0419 16:22:52.975278    1342 server.go:450] "No api server defined - no events will be sent to API server"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1342]: I0419 16:22:52.975297    1342 server.go:648] "--cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1342]: I0419 16:22:52.975544    1342 container_manager_linux.go:262] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1342]: I0419 16:22:52.975613    1342 container_manager_linux.go:267] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: KubeletOOMScoreAdj:-999 ContainerRuntime: CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:cgroupfs KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:memory.available Operator:LessThan Value:{Quantity:100Mi Percentage:0} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.1} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.inodesFree Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>} {Signal:imagefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.15} GracePeriod:0s MinReclaim:<nil>}]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalCPUManagerPolicyOptions:map[] ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalMemoryManagerPolicy:None ExperimentalMemoryManagerReservedMemory:[] ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1342]: I0419 16:22:52.975639    1342 topology_manager.go:133] "Creating topology manager with policy per scope" topologyPolicyName="none" topologyScopeName="container"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1342]: I0419 16:22:52.975651    1342 container_manager_linux.go:302] "Creating device plugin manager" devicePluginEnabled=true
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1342]: I0419 16:22:52.975678    1342 state_mem.go:36] "Initialized new in-memory state store"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1342]: I0419 16:22:52.975719    1342 util_unix.go:104] "Using this format as endpoint is deprecated, please consider using full url format." deprecatedFormat="" fullURLFormat="unix://"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1342]: W0419 16:22:52.975827    1342 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {  <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix: missing address". Reconnecting...
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1342]: Error: failed to run Kubelet: unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix: missing address"
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1342]: Usage:
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1342]:   kubelet [flags]
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1342]: Flags:
...
Apr 19 16:22:52 ip-*-*-*-* kubelet-eks.daemon[1342]: Error: failed to run Kubelet: unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix: missing address"
Apr 19 16:22:53 ip-*-*-*-* systemd[1]: snap.kubelet-eks.daemon.service: Scheduled restart job, restart counter is at 3.
Apr 19 16:22:53 ip-*-*-*-* systemd[1]: Stopped Service for snap application kubelet-eks.daemon.
Apr 19 16:22:53 ip-*-*-*-* systemd[1]: Started Service for snap application kubelet-eks.daemon.
Apr 19 16:22:53 ip-*-*-*-* kubelet-eks.daemon[1397]: I0419 16:22:53.206023    1397 server.go:399] "Kubelet version" kubeletVersion="v1.24.10"
Apr 19 16:22:53 ip-*-*-*-* kubelet-eks.daemon[1397]: I0419 16:22:53.206056    1397 server.go:401] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
Apr 19 16:22:53 ip-*-*-*-* kubelet-eks.daemon[1397]: I0419 16:22:53.206203    1397 server.go:562] "Standalone mode, no API client"
Apr 19 16:22:53 ip-*-*-*-* kubelet-eks.daemon[1397]: I0419 16:22:53.232363    1397 server.go:450] "No api server defined - no events will be sent to API server"
Apr 19 16:22:53 ip-*-*-*-* kubelet-eks.daemon[1397]: I0419 16:22:53.232379    1397 server.go:648] "--cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /"
Apr 19 16:22:53 ip-*-*-*-* kubelet-eks.daemon[1397]: I0419 16:22:53.232578    1397 container_manager_linux.go:262] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
Apr 19 16:22:53 ip-*-*-*-* kubelet-eks.daemon[1397]: I0419 16:22:53.232635    1397 container_manager_linux.go:267] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: KubeletOOMScoreAdj:-999 ContainerRuntime: CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:cgroupfs KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:imagefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.15} GracePeriod:0s MinReclaim:<nil>} {Signal:memory.available Operator:LessThan Value:{Quantity:100Mi Percentage:0} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.1} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.inodesFree Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>}]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalCPUManagerPolicyOptions:map[] ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalMemoryManagerPolicy:None ExperimentalMemoryManagerReservedMemory:[] ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}
Apr 19 16:22:53 ip-*-*-*-* kubelet-eks.daemon[1397]: I0419 16:22:53.232656    1397 topology_manager.go:133] "Creating topology manager with policy per scope" topologyPolicyName="none" topologyScopeName="container"
Apr 19 16:22:53 ip-*-*-*-* kubelet-eks.daemon[1397]: I0419 16:22:53.232674    1397 container_manager_linux.go:302] "Creating device plugin manager" devicePluginEnabled=true
Apr 19 16:22:53 ip-*-*-*-* kubelet-eks.daemon[1397]: I0419 16:22:53.232694    1397 state_mem.go:36] "Initialized new in-memory state store"
Apr 19 16:22:53 ip-*-*-*-* kubelet-eks.daemon[1397]: I0419 16:22:53.232726    1397 util_unix.go:104] "Using this format as endpoint is deprecated, please consider using full url format." deprecatedFormat="" fullURLFormat="unix://"
Apr 19 16:22:53 ip-*-*-*-* kubelet-eks.daemon[1397]: W0419 16:22:53.232811    1397 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {  <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix: missing address". Reconnecting...
Apr 19 16:22:53 ip-*-*-*-* kubelet-eks.daemon[1397]: Error: failed to run Kubelet: unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix: missing address"
Apr 19 16:22:53 ip-*-*-*-* kubelet-eks.daemon[1397]: Usage:
Apr 19 16:22:53 ip-*-*-*-* kubelet-eks.daemon[1397]:   kubelet [flags]
Apr 19 16:22:53 ip-*-*-*-* kubelet-eks.daemon[1397]: Flags:

-----

And retry 6 times.

---

Apr 19 16:22:53 ip-*-*-*-* kubelet-eks.daemon[1508]: Error: failed to run Kubelet: unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix: missing address"
Apr 19 16:22:53 ip-*-*-*-* systemd[1]: snap.kubelet-eks.daemon.service: Scheduled restart job, restart counter is at 6.
Apr 19 16:22:53 ip-*-*-*-* systemd[1]: Stopped Service for snap application kubelet-eks.daemon.
Apr 19 16:22:53 ip-*-*-*-* systemd[1]: snap.kubelet-eks.daemon.service: Start request repeated too quickly.
Apr 19 16:22:53 ip-*-*-*-* systemd[1]: snap.kubelet-eks.daemon.service: Failed with result 'exit-code'.
Apr 19 16:22:53 ip-*-*-*-* systemd[1]: Failed to start Service for snap application kubelet-eks.daemon.
Apr 19 16:23:02 ip-*-*-*-* systemd[1]: snap.kubelet-eks.daemon.service: Start request repeated too quickly.
Apr 19 16:23:02 ip-*-*-*-* systemd[1]: snap.kubelet-eks.daemon.service: Failed with result 'exit-code'.
Apr 19 16:23:02 ip-*-*-*-* systemd[1]: Failed to start Service for snap application kubelet-eks.daemon.

Revision history for this message

DingGGu (dinggggu) wrote on 2023-04-20 (last edit on 2023-04-20):

Please ignore `kubelet-eks.daemon[935]: cat: /var/snap/kubelet-eks/92/args: No such file or directory` Log output on my post.

This log occurred when creating an image from an EKS Ubuntu image using Packer.

In currently args was set from snap.

> cat /var/snap/kubelet-eks/104/args

--node-labels=instance-lifecycle=mixed,*=*,karpenter.sh/capacity-type=spot,karpenter.sh/provisioner-name=* --register-with-taints=dedicated=*:NoSchedule
--address="0.0.0.0"
--anonymous-auth=false
--authentication-token-webhook
--authorization-mode="Webhook"
--cgroup-driver="cgroupfs"
--client-ca-file="/etc/kubernetes/pki/ca.crt"
--cloud-provider="aws"
--cluster-dns="172.20.0.10"
--cluster-domain="cluster.local"
--config="/etc/kubernetes/kubelet/kubelet-config.json"
--container-runtime="remote"
--container-runtime-endpoint="unix:///run/containerd/containerd.sock"
--feature-gates="RotateKubeletServerCertificate=true"
--kubeconfig="/var/lib/kubelet/kubeconfig"
--max-pods="234"
--node-ip="*"
--pod-infra-container-image="602401143452.dkr.ecr.ap-northeast-2.amazonaws.com/eks/pause:3.5"
--register-node
--resolv-conf="/run/systemd/resolve/resolv.conf"

I'm interesting --register-with-taints was not return per args. However, there are times when it is registered normally, so this is not a problem, right?

---

After run manually "sudo systemctl restart snap.kubelet-eks.daemon", Node was registered normally.

Revision history for this message

George Kraft (cynerva) wrote on 2023-04-28 (last edit on 2023-04-28):

Highlighting the most relevant logs from comment #7. Snapd fails to start the kubelet-eks.daemon service:

Apr 19 16:23:02 ip-*-*-*-* snapd[862]: taskrunner.go:289: [change 12 "Run service command \"start\" for services [\"daemon\"] of snap \"kubelet-eks\"" task] failed: systemctl command [start snap.kubelet-eks.daemon.service] failed with exit status 1: Job for snap.kubelet-eks.daemon.service failed because the control process exited with error code.

And matching that timestamp up with the kubelet-eks.daemon logs:

Apr 19 16:23:02 ip-*-*-*-* systemd[1]: snap.kubelet-eks.daemon.service: Start request repeated too quickly.
Apr 19 16:23:02 ip-*-*-*-* systemd[1]: snap.kubelet-eks.daemon.service: Failed with result 'exit-code'.
Apr 19 16:23:02 ip-*-*-*-* systemd[1]: Failed to start Service for snap application kubelet-eks.daemon.

I think it goes like this: The kubelet-eks snap is initially installed, but hasn't been configured. During this time the service repeatedly crashes because it's missing important configuration. It hits the systemd start rate limit, at which point systemd blocks further start attempts. Later, bootstrap.sh runs `snap set ...` to configure the service, and `snap start kubelet-eks` to start it. The start attempt fails because the service is still being start rate limited by systemd.

Revision history for this message

Thomas Bechtold (toabctl) wrote on 2023-05-17:

#10

Images with serial 20230517 do contain a fix.
Please let us know if that problem still occurs.

Changed in cloud-images:
status:	New → Fix Released
assignee:	nobody → Thomas Bechtold (toabctl)

Revision history for this message

Robby Pocase (rpocase) wrote on 2023-05-18:

#11

We rolled back the primary fix (changing service start timing with the snapcraft definition) because it seems to have caused regressions in existing clusters. We'll revisit this ASAP.

Changed in cloud-images:
status:	Fix Released → In Progress

Revision history for this message

Robby Pocase (rpocase) wrote on 2023-05-22:

#12

Kevin just linked a MP that is aimed at fully fixing this issue. Once this is merged, the plan is to rebuild all snaps which WILL trigger a refresh again, but we expect it to not result in the same problematic state introduction that was noted in [0].

[0] - https://bugs.launchpad.net/cloud-images/+bug/2020072

Revision history for this message

Nikita Somikov (qwedcftyu) wrote on 2023-05-30 (last edit on 2023-05-30):

#13

Hello. Has the problem been fixed completely? I'm writing here, because my report [0] is marked as a duplicate of this one

[0] - https://bugs.launchpad.net/cloud-images/+bug/2020072

Revision history for this message

Robby Pocase (rpocase) wrote on 2023-05-30:

#14

@Nikita - sorry for the delayed confirmation. To the best of my knowledge, the latest AMI and snaps SHOULD resolve this issue. Definitely let us know if you see it happening again on new AMIs/the latest snap.

Changed in cloud-images:
status:	In Progress → Fix Released

Revision history for this message

Nikita Somikov (qwedcftyu) wrote on 2023-05-30:

#15

Now we're seeing strange behaviour when starting new nodes - no arguments are passed to the kubelet args file and we are seeing cloud-init errors (look at screenshots).
But this time I can say that we only see this behaviour on our AMIs that use your AMI as a source and have some additional installations by Packer (I didn't change anything in the packer templates).
It looks like the problem now is a combination of your final fix and our use of Packer. Maybe I should add some delays to Packer scripts?

Revision history for this message

Nikita Somikov (qwedcftyu) wrote on 2023-05-30:

#16

Screenshot-1.png Edit (698.9 KiB, image/png)

Revision history for this message

Nikita Somikov (qwedcftyu) wrote on 2023-05-30:

#17

Screenshot-2.png Edit (617.3 KiB, image/png)

Revision history for this message

Nikita Somikov (qwedcftyu) wrote on 2023-05-30:

#18

Screenshot-3.png Edit (836.2 KiB, image/png)

Revision history for this message

Thomas Bechtold (toabctl) wrote on 2023-05-31:

#19

Thanks Nikita for the screenshots.
could you share your packer config and also share the snapd logs please?

Revision history for this message

Nikita Somikov (qwedcftyu) wrote on 2023-06-01:

#20

snapd.txt Edit (3.1 KiB, text/plain)

Revision history for this message

Nikita Somikov (qwedcftyu) wrote on 2023-06-01:

#21

packer-template.json Edit (3.0 KiB, application/json)

Revision history for this message

Nikita Somikov (qwedcftyu) wrote on 2023-06-01:

#22

remove-swap.sh Edit (186 bytes, text/x-sh)

Revision history for this message

Nikita Somikov (qwedcftyu) wrote on 2023-06-01:

#23

bootstrap.sh Edit (565 bytes, text/x-sh)

Revision history for this message

Nikita Somikov (qwedcftyu) wrote on 2023-06-01:

#24

linux-tuning.sh Edit (775 bytes, text/x-sh)

Revision history for this message

Nikita Somikov (qwedcftyu) wrote on 2023-06-01:

#25

Yes, attached the details above

Thomas Bechtold (toabctl) on 2023-06-08

Changed in cloud-images:
assignee:	Thomas Bechtold (toabctl) → nobody
assignee:	nobody → Thomas Bechtold (toabctl)

Revision history for this message

jan grant (jangrant) wrote on 2023-06-09:

#26

We've got a similar bug, and think we have the root cause - see https://bugs.launchpad.net/cloud-images/+bug/2023284

Revision history for this message

DingGGu (dinggggu) wrote on 2023-06-14:

#27

It started happening again today with diffrent symptoms.

--- cloud-init.log
2023-06-13 21:03:41,434 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2023-06-13 21:03:41,434 - handlers.py[DEBUG]: finish: modules-final/config-scripts-user: FAIL: running config-scripts-user with frequency once-per-instance
2023-06-13 21:03:41,434 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
2023-06-13 21:03:41,434 - util.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/config/modules.py", line 246, in _run_modules
    ran, _r = cc.run(
  File "/usr/lib/python3/dist-packages/cloudinit/cloud.py", line 67, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 185, in run
    results = functor(*args)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py", line 54, in handle
    subp.runparts(runparts_path)
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 427, in runparts
    raise RuntimeError(
RuntimeError: Runparts: 1 failures (part-001) in 1 attempted commands

--- user-data.log
+ /etc/eks/bootstrap.sh REDACTED --apiserver-endpoint REDACTED --b64-cluster-ca REDACTED --container-runtime containerd --kubelet-extra-args '--node-labels=REDACTED --register-with-taints=dedicated=REDACTED:NoSchedule'
Using containerd as the container runtime
Aliasing EKS k8s snap commands
Added:
- kubelet-eks.kubelet as kubelet
Added:
- kubectl-eks.kubectl as kubectl
Stopping k8s daemons until configured
error: snap "kubelet-eks" has "auto-refresh" change in progress
Exited with error on line 353

---

And there is no snap.kubelet-eks.daemon jounral logs in system.

I also create custom ami via Packer, the source AMI was latest image that 20230517 serial.

It started happening again today with diffrent symptoms.

--- user-data.log
+ /etc/eks/bootstrap.sh REDACTED --apiserver-endpoint REDACTED --b64-cluster-ca REDACTED --container-runtime containerd --kubelet-extra-args '--node-labels=REDACTED --register-with-taints=dedicated=REDACTED:NoSchedule'
Using containerd as the container runtime
Aliasing EKS k8s snap commands
Added:
  - kubelet-eks.kubelet as kubelet
Added:
  - kubectl-eks.kubectl as kubectl
Stopping k8s daemons until configured
error: snap "kubelet-eks" has "auto-refresh" change in progress
Exited with error on line 353

---

And there is no snap.kubelet-eks.daemon jounral logs in system.

I also create custom ami via Packer, the source AMI was latest image that 20230517 serial.

Revision history for this message

Robby Pocase (rpocase) wrote on 2023-06-14:

#28

@DingGGu Thanks for the update. As a possible workaround, adding "snap refresh kubelet" to the beginning of your packer build steps (if not already present) should verify if the latest serial will resolve your issue. The latest serial for 1.24 is 20230607. We should also have a new serial later today that includes a fix for the bug Jan listed. I'm not positive it will fix your issue, but if its the same root cause I would expect it to resolve your lingering issues. Thanks for your patience on this!

Revision history for this message

George Kraft (cynerva) wrote on 2023-06-14:

#29

The timing of that last occurrence coincides with the new wave of kubelet-eks snaps we released yesterday to fix LP:2023284. I suspect this auto-refresh error may happen every time we release updated snaps, and I don't think there's anything we can do within the snap to prevent it. Seems like either AMIs will need to be built with `snap refresh --hold kubelet-eks` to prevent the snap from auto-refreshing, or the bootstrap.sh script will need to catch errors from `snap set` and retry in case an auto-refresh is occurring.

Revision history for this message

Robby Pocase (rpocase) wrote on 2023-06-14:

#30

We had discussion around adding `snap refresh --hold`, but felt that keeping the ability to force changes to a channel in times like these was paramount. The general expectation is that a `snap refresh` should never result in downtime for an active service (and if it does, it should be fixable within the snap for most cases). I am still open to the option, but I worry that we would not have enough visibility to advertise critical updates to EKS users with long running clusters.

The suggested bootstrap.sh fix seems perfectly reasonable and I think we could approach that in the near term. I'll make sure it gets prioritized soon.

Revision history for this message

DingGGu (dinggggu) wrote on 2023-06-15:

#31

Thanks for reply.
Setting up `snap refresh --hold kubelet-eks` on initialize script on packer does not work on my sides.

I think, snap was upgrade their package during booting sequence.

Can any ideas for fix this issue before relese new image?

Revision history for this message

Dave (dpedu) wrote on 2023-06-15:

#32

I am still seeing this issue with the 20230518 build of the 1.22 EKS ami, named amazon/ubuntu-eks/k8s_1.22/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230518, in us-east-1.

Revision history for this message

Robby Pocase (rpocase) wrote on 2023-06-16:

#33

@Dave - that makes sense. It looks like we haven't had a build come through of 20.04 since this fix was released. I'll check into that and make sure one is released ASAP. We are tracking additional fallout from this fix in LP:2023284. I've subscribed you, so you should get notifications when I drop an ami list

Revision history for this message

Robby Pocase (rpocase) wrote on 2023-06-16:

#34

@Dave - I just noticed your EKS version. 1.22 is past EOL and we no longer build images for it. To fully resolve this issue you will need to upgrade to EKS 1.23

Revision history for this message

Nikita Somikov (qwedcftyu) wrote on 2023-06-17:

#35

@Robby, I also have 1.22 version. But there were no such problems before, so there is a way to solve them. How can I solve the problem in a short term without upgrading the k8s?

Revision history for this message

Thomas Bechtold (toabctl) wrote on 2023-06-19:

#36

> @Robby, I also have 1.22 version. But there were no such problems before, so there is a way to solve them. How can I solve the problem in a short term without upgrading the k8s?

you can use older images for 1.22 .

Revision history for this message

Nikita Somikov (qwedcftyu) wrote on 2023-06-19:

#37

@Thomas, should I use older than 2023.05?

Revision history for this message

Thomas Bechtold (toabctl) wrote on 2023-06-20:

#38

> @Thomas, should I use older than 2023.05?

you need to use a version which has a different patch version for kubelet-eks (so it doesn't auto-update itself). Then you have a patch version missmatch between the control plane and the worker node but that might be ok.
try something from 202304 I would say.

Revision history for this message

Nikita Somikov (qwedcftyu) wrote on 2023-06-21:

#39

ubuntu-eks/k8s_1.22/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230420, unfortunately also failed (around 1 node from 10). It started with 95 snap revision, but updated (not all nodes) during start to 160

Revision history for this message

Stijn (sdehaes) wrote on 2023-06-22:

#40

we are seeing issues during startup with the kubectl-eks snap, similar to the issues seen here with kubelet-eks. It is making our nodes fail to boot, we are currently using the latest ubuntu eks 1.26 images.

Revision history for this message

Prem Sompura (premsompura) wrote on 2023-06-23:

#41

Facing similar issue, even in ubuntu-eks/k8s_1.22/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230113. We use custom bootstrap.sh script, so what I have done to fix this is disable auto-refresh temporarily. You can do that by adding this command in bootstrap.sh script `snap set system refresh.metered=hold` and post kubelet starts, you can enable auto-refresh `snap set system refresh.metered=null`

Revision history for this message

Robby Pocase (rpocase) wrote on 2023-06-23:

#42

@premsompura's suggestion is still going to be the most consistent workaround for now.

> 1.26 new serials
@sdehaes I just checked and we still don't have a release out. I'll ensure we have eyes on these pipelines today to get 1.26 (and the remaining serials) ushered through today. We also still don't have a release out for 1.24 or 1.25

Revision history for this message

Thomas Bechtold (toabctl) wrote on 2023-06-23:

#43

we have updated images for 1.23, 1.24, 1.25 and 1.26 (serial is 20230623) which hopefully help with this issue. please try those.

Revision history for this message

Prem Sompura (premsompura) wrote on 2023-06-26:

#44

Anyone using 1.22 image, can use this command in bootstrap.sh scrip to avoid auto-refresh issue - `snap refresh --hold=1h kubelet-eks kubectl-eks`