Rarely fails to join EKS cluster

Bug #2012689 reported by DingGGu
38
This bug affects 6 people
Affects Status Importance Assigned to Milestone
cloud-images
Fix Released
High
Thomas Bechtold

Bug Description

We are running multiple clusters.
The cluster that frequently scale-in and out sometimes fail to join the cluster.

Looking at /var/log/user-data.log, running `snap start kubelet-eks` in /etc/eks/bootstrap.sh returns fail. Looking at journalctl, it seems as that is running without specifying kubelet's arguments at all.

If I manually run /etc/eks/bootstrap.sh after the nodes are orphaned, the cluster joins just fine.

I think this is a timing issue related to the snap and argument settings.

Using 1.24 AMI with ami-04c00a6fc53487c5a

Some interesting log for snap does not read argument:
kubelet-eks.daemon[935]: cat: /var/snap/kubelet-eks/92/args: No such file or directory

kubelet runs fail same errors:
kubelet-eks.daemon[889]: I0307 19:24:58.886750 889 util_unix.go:104] "Using this format as endpoint is deprecated, please consider using full url format." deprecatedFormat="" fullURLFormat="unix://"
kubelet-eks.daemon[889]: W0307 19:24:58.888995 889 clientconn.go:1331] [core] grpc:

Certains logs from journalctl:
systemd[1]: Started containerd container runtime.
systemd[1]: Started Service for snap application amazon-ssm-agent.amazon-ssm-agent.
systemd[1]: Reloading.
systemd[1]: Started Service for snap application kubelet-eks.daemon.
systemd[1]: Started snap.kubelet-eks.hook.configure.3540f36b-29a1-4974-8c41-31995a6c637e.scope.
kubelet-eks.daemon[935]: cat: /var/snap/kubelet-eks/92/args: No such file or directory
amazon-ssm-agent.amazon-ssm-agent[833]: Error occurred fetching the seelog config file path: open /etc/amazon/ssm/seelog.xml: no such file or directory
amazon-ssm-agent.amazon-ssm-agent[833]: Initializing new seelog logger
amazon-ssm-agent.amazon-ssm-agent[833]: New Seelog Logger Creation Complete
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 WARN Error adding the directory '/etc/amazon/ssm' to watcher: no such file or directory
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO Proxy environment variables:
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO https_proxy:
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO http_proxy:
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO no_proxy:
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO Agent will take identity from EC2
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO [amazon-ssm-agent] using named pipe channel for IPC
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO [amazon-ssm-agent] using named pipe channel for IPC
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO [amazon-ssm-agent] using named pipe channel for IPC
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO [amazon-ssm-agent] amazon-ssm-agent - v3.1.1732.0
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO [amazon-ssm-agent] OS: linux, Arch: amd64
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:53 INFO [CredentialRefresher] Identity does not require credential refresher
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:54 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] [WorkerProvider] Worker ssm-agent-worker is not running, starting worker process
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:54 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] [WorkerProvider] Worker ssm-agent-worker (pid:948) started
amazon-ssm-agent.amazon-ssm-agent[833]: 2023-03-07 19:24:54 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] Monitor long running worker health every 60 seconds
systemd[1]: snap.kubelet-eks.hook.configure.3540f36b-29a1-4974-8c41-31995a6c637e.scope: Succeeded.
dbus-daemon[531]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' requested by ':1.9' (uid=0 pid=539 comm="/usr/lib/snapd/snapd " label="unconfined")
systemd[1]: Starting Time & Date Service...
dbus-daemon[531]: [system] Successfully activated service 'org.freedesktop.timedate1'
systemd[1]: Started Time & Date Service.
systemd[1]: Started Kubernetes systemd probe.
kubelet-eks.daemon[889]: I0307 19:24:58.858562 889 server.go:399] "Kubelet version" kubeletVersion="v1.24.9"
kubelet-eks.daemon[889]: I0307 19:24:58.858619 889 server.go:401] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
kubelet-eks.daemon[889]: I0307 19:24:58.858841 889 server.go:562] "Standalone mode, no API client"
systemd[1]: run-r5647ef4c140746af8f68048b9b657df0.scope: Succeeded.
kubelet-eks.daemon[889]: I0307 19:24:58.886249 889 server.go:450] "No api server defined - no events will be sent to API server"
kubelet-eks.daemon[889]: I0307 19:24:58.886266 889 server.go:648] "--cgroups-per-qos enabled, but --cgroup-root was not specified. defaulting to /"
kubelet-eks.daemon[889]: I0307 19:24:58.886544 889 container_manager_linux.go:262] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
kubelet-eks.daemon[889]: I0307 19:24:58.886618 889 container_manager_linux.go:267] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: KubeletOOMScoreAdj:-999 ContainerRuntime: CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:cgroupfs KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:nodefs.inodesFree Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>} {Signal:imagefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.15} GracePeriod:0s MinReclaim:<nil>} {Signal:memory.available Operator:LessThan Value:{Quantity:100Mi Percentage:0} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.1} GracePeriod:0s MinReclaim:<nil>}]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalCPUManagerPolicyOptions:map[] ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalMemoryManagerPolicy:None ExperimentalMemoryManagerReservedMemory:[] ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}
kubelet-eks.daemon[889]: I0307 19:24:58.886635 889 topology_manager.go:133] "Creating topology manager with policy per scope" topologyPolicyName="none" topologyScopeName="container"
kubelet-eks.daemon[889]: I0307 19:24:58.886644 889 container_manager_linux.go:302] "Creating device plugin manager" devicePluginEnabled=true
kubelet-eks.daemon[889]: I0307 19:24:58.886706 889 state_mem.go:36] "Initialized new in-memory state store"
kubelet-eks.daemon[889]: I0307 19:24:58.886750 889 util_unix.go:104] "Using this format as endpoint is deprecated, please consider using full url format." deprecatedFormat="" fullURLFormat="unix://"
kubelet-eks.daemon[889]: W0307 19:24:58.888995 889 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to { <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix: missing address". Reconnecting...

Related branches

DingGGu (dinggggu)
description: updated
Robby Pocase (rpocase)
Changed in cloud-images:
importance: Undecided → High
Revision history for this message
Robby Pocase (rpocase) wrote :

Thanks for filing this bug @dinggggu and sorry for the extremely delayed response. We are working internally to consistently reproduce and try to identify a root cause. We will update ASAP.

Revision history for this message
Robby Pocase (rpocase) wrote :

@DinGGu Are you able to provide a bit more information on the topology of your clusters?

> The cluster that frequently scale-in and out sometimes fail to join the cluster.

I take this to mean you have at least one autoscaling cluster that has new nodes fails to join the cluster. Is that correct? If so, could you share your autoscaling deployment configuration (scrubbed if necessary for sensitive details)?

Revision history for this message
DingGGu (dinggggu) wrote :

Hello. I'm using Karpenter for auto-scaling. Build server (a.k.a CI) workers are operated in the form of Kubernetes Pods, and Pods are created when a build request is received, and when the Pod is in the Pending state, a new instance is started by Karpenter.
Dozens of build requests are created at once, and in this case, multiple instances are started and when job is done, removed by Karpenter.
Sometimes, instance cannot join to cluster.

Karpenter does not replacing bootstrap.sh and I just give to you instance user-data for you.

--//
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash -xe
exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1
/etc/eks/bootstrap.sh '#REDACT#' --apiserver-endpoint 'https://#REDACT#.eks.amazonaws.com' --b64-cluster-ca '....' \
--container-runtime containerd \
--kubelet-extra-args '--node-labels=karpenter.sh/capacity-type=on-demand,karpenter.sh/provisioner-name=#REDACT# --register-with-taints=dedicated=#REDACT#:NoSchedule'
--//--

Revision history for this message
Kevin W Monroe (kwmonroe) wrote (last edit ):

It appears the kubelet daemon is attempting to start before its $SNAP_DATA/args file is present. This is generated by the configure hook, so indeed it seems like a timing issue where 'start' precedes 'configure'.

A potential workaround is to manually restart kubelet-eks on the failing node (by this time, the args file should be present):

sudo systemctl restart snap.kubelet-eks.daemon

Revision history for this message
Thomas Bechtold (toabctl) wrote :

It would be good to see some timestamps for the logs.

could you share

- /var/log/cloud-init.log
- /var/log/cloud-init-output.log
- journalctl -u snapd.seeded.service
- journalctl -u snapd.service
- journalctl -u snap.kubelet-eks.daemon.service

please?

Revision history for this message
DingGGu (dinggggu) wrote :

Since it's been a while, there are currently no logs for that. I'll let you know if it happens again.

As Kevin said, if I run sudo systemctl restart snap.kubelet-eks.daemon manually, the kubelet runs fine. Running it manually was expected to be a timing issue with the snap happening long after the boot process.

Revision history for this message
DingGGu (dinggggu) wrote :
Download full text (24.0 KiB)

- /var/log/cloud-init.log

2023-04-19 16:22:50,628 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-aaaaaaaaaaaaaaaaaa/sem/config_scripts_user - wb: [644] 25 bytes
2023-04-19 16:22:50,628 - helpers.py[DEBUG]: Running config-scripts-user using lock (<FileLock using file '/var/lib/cloud/instances/i-aaaaaaaaaaaaaaaaaa/sem/config_scripts_user'>)
2023-04-19 16:22:50,628 - subp.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/part-001'] with allowed return codes [0] (shell=False, capture=False)
2023-04-19 16:23:02,236 - subp.py[DEBUG]: Unexpected error while running command.
Command: ['/var/lib/cloud/instance/scripts/part-001']
Exit code: 1
Reason: -
Stdout: -
Stderr: -
2023-04-19 16:23:02,236 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2023-04-19 16:23:02,237 - handlers.py[DEBUG]: finish: modules-final/config-scripts-user: FAIL: running config-scripts-user with frequency once-per-instance
2023-04-19 16:23:02,237 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
2023-04-19 16:23:02,237 - util.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/config/modules.py", line 246, in _run_modules
    ran, _r = cc.run(
  File "/usr/lib/python3/dist-packages/cloudinit/cloud.py", line 67, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 185, in run
    results = functor(*args)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py", line 54, in handle
    subp.runparts(runparts_path)
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 427, in runparts
    raise RuntimeError(
RuntimeError: Runparts: 1 failures (part-001) in 1 attempted commands

- /var/log/cloud-init-output.log
Cloud-init v. 23.1.1-0ubuntu0~20.04.1 running 'modules:config' at Wed, 19 Apr 2023 16:22:49 +0000. Up 19.55 seconds.
+ exec
++ tee /var/log/user-data.log
++ logger -t user-data -s
Cloud-init v. 23.1.1-0ubuntu0~20.04.1 running 'modules:final' at Wed, 19 Apr 2023 16:22:50 +0000. Up 20.32 seconds.
2023-04-19 16:23:02,236 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2023-04-19 16:23:02,237 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
Cloud-init v. 23.1.1-0ubuntu0~20.04.1 finished at Wed, 19 Apr 2023 16:23:02 +0000. Datasource DataSourceEc2Local. Up 32.06 seconds

- journalctl -u snapd.seeded.service
-- Reboot --
Apr 19 16:22:49 ip-*-*-*-* systemd[1]: Starting Wait until snapd is fully seeded...
Apr 19 16:22:49 ip-*-*-*-* systemd[1]: Finished Wait until snapd is fully seeded.

- journalctl -u snapd.service
-- Reboot --
Apr 19 16:22:42 ip-*-*-*-* systemd[1]: Starting Snap ...

Revision history for this message
DingGGu (dinggggu) wrote (last edit ):

Please ignore `kubelet-eks.daemon[935]: cat: /var/snap/kubelet-eks/92/args: No such file or directory` Log output on my post.

This log occurred when creating an image from an EKS Ubuntu image using Packer.

In currently args was set from snap.

> cat /var/snap/kubelet-eks/104/args

--node-labels=instance-lifecycle=mixed,*=*,karpenter.sh/capacity-type=spot,karpenter.sh/provisioner-name=* --register-with-taints=dedicated=*:NoSchedule
--address="0.0.0.0"
--anonymous-auth=false
--authentication-token-webhook
--authorization-mode="Webhook"
--cgroup-driver="cgroupfs"
--client-ca-file="/etc/kubernetes/pki/ca.crt"
--cloud-provider="aws"
--cluster-dns="172.20.0.10"
--cluster-domain="cluster.local"
--config="/etc/kubernetes/kubelet/kubelet-config.json"
--container-runtime="remote"
--container-runtime-endpoint="unix:///run/containerd/containerd.sock"
--feature-gates="RotateKubeletServerCertificate=true"
--kubeconfig="/var/lib/kubelet/kubeconfig"
--max-pods="234"
--node-ip="*"
--pod-infra-container-image="602401143452.dkr.ecr.ap-northeast-2.amazonaws.com/eks/pause:3.5"
--register-node
--resolv-conf="/run/systemd/resolve/resolv.conf"

I'm interesting --register-with-taints was not return per args. However, there are times when it is registered normally, so this is not a problem, right?

---

After run manually "sudo systemctl restart snap.kubelet-eks.daemon", Node was registered normally.

Revision history for this message
George Kraft (cynerva) wrote (last edit ):

Highlighting the most relevant logs from comment #7. Snapd fails to start the kubelet-eks.daemon service:

Apr 19 16:23:02 ip-*-*-*-* snapd[862]: taskrunner.go:289: [change 12 "Run service command \"start\" for services [\"daemon\"] of snap \"kubelet-eks\"" task] failed: systemctl command [start snap.kubelet-eks.daemon.service] failed with exit status 1: Job for snap.kubelet-eks.daemon.service failed because the control process exited with error code.

And matching that timestamp up with the kubelet-eks.daemon logs:

Apr 19 16:23:02 ip-*-*-*-* systemd[1]: snap.kubelet-eks.daemon.service: Start request repeated too quickly.
Apr 19 16:23:02 ip-*-*-*-* systemd[1]: snap.kubelet-eks.daemon.service: Failed with result 'exit-code'.
Apr 19 16:23:02 ip-*-*-*-* systemd[1]: Failed to start Service for snap application kubelet-eks.daemon.

I think it goes like this: The kubelet-eks snap is initially installed, but hasn't been configured. During this time the service repeatedly crashes because it's missing important configuration. It hits the systemd start rate limit, at which point systemd blocks further start attempts. Later, bootstrap.sh runs `snap set ...` to configure the service, and `snap start kubelet-eks` to start it. The start attempt fails because the service is still being start rate limited by systemd.

Revision history for this message
Thomas Bechtold (toabctl) wrote :

Images with serial 20230517 do contain a fix.
Please let us know if that problem still occurs.

Changed in cloud-images:
status: New → Fix Released
assignee: nobody → Thomas Bechtold (toabctl)
Revision history for this message
Robby Pocase (rpocase) wrote :

We rolled back the primary fix (changing service start timing with the snapcraft definition) because it seems to have caused regressions in existing clusters. We'll revisit this ASAP.

Changed in cloud-images:
status: Fix Released → In Progress
Revision history for this message
Robby Pocase (rpocase) wrote :

Kevin just linked a MP that is aimed at fully fixing this issue. Once this is merged, the plan is to rebuild all snaps which WILL trigger a refresh again, but we expect it to not result in the same problematic state introduction that was noted in [0].

[0] - https://bugs.launchpad.net/cloud-images/+bug/2020072

Revision history for this message
Nikita Somikov (qwedcftyu) wrote (last edit ):

Hello. Has the problem been fixed completely? I'm writing here, because my report [0] is marked as a duplicate of this one

[0] - https://bugs.launchpad.net/cloud-images/+bug/2020072

Revision history for this message
Robby Pocase (rpocase) wrote :

@Nikita - sorry for the delayed confirmation. To the best of my knowledge, the latest AMI and snaps SHOULD resolve this issue. Definitely let us know if you see it happening again on new AMIs/the latest snap.

Changed in cloud-images:
status: In Progress → Fix Released
Revision history for this message
Nikita Somikov (qwedcftyu) wrote :

Now we're seeing strange behaviour when starting new nodes - no arguments are passed to the kubelet args file and we are seeing cloud-init errors (look at screenshots).
But this time I can say that we only see this behaviour on our AMIs that use your AMI as a source and have some additional installations by Packer (I didn't change anything in the packer templates).
It looks like the problem now is a combination of your final fix and our use of Packer. Maybe I should add some delays to Packer scripts?

Revision history for this message
Nikita Somikov (qwedcftyu) wrote :
Revision history for this message
Nikita Somikov (qwedcftyu) wrote :
Revision history for this message
Nikita Somikov (qwedcftyu) wrote :
Revision history for this message
Thomas Bechtold (toabctl) wrote :

Thanks Nikita for the screenshots.
could you share your packer config and also share the snapd logs please?

Revision history for this message
Nikita Somikov (qwedcftyu) wrote :
Revision history for this message
Nikita Somikov (qwedcftyu) wrote :
Revision history for this message
Nikita Somikov (qwedcftyu) wrote :
Revision history for this message
Nikita Somikov (qwedcftyu) wrote :
Revision history for this message
Nikita Somikov (qwedcftyu) wrote :
Revision history for this message
Nikita Somikov (qwedcftyu) wrote :

Yes, attached the details above

Changed in cloud-images:
assignee: Thomas Bechtold (toabctl) → nobody
assignee: nobody → Thomas Bechtold (toabctl)
Revision history for this message
jan grant (jangrant) wrote :

We've got a similar bug, and think we have the root cause - see https://bugs.launchpad.net/cloud-images/+bug/2023284

Revision history for this message
DingGGu (dinggggu) wrote :

It started happening again today with diffrent symptoms.

--- cloud-init.log
2023-06-13 21:03:41,434 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2023-06-13 21:03:41,434 - handlers.py[DEBUG]: finish: modules-final/config-scripts-user: FAIL: running config-scripts-user with frequency once-per-instance
2023-06-13 21:03:41,434 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
2023-06-13 21:03:41,434 - util.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/config/modules.py", line 246, in _run_modules
    ran, _r = cc.run(
  File "/usr/lib/python3/dist-packages/cloudinit/cloud.py", line 67, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 185, in run
    results = functor(*args)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py", line 54, in handle
    subp.runparts(runparts_path)
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 427, in runparts
    raise RuntimeError(
RuntimeError: Runparts: 1 failures (part-001) in 1 attempted commands

--- user-data.log
+ /etc/eks/bootstrap.sh REDACTED --apiserver-endpoint REDACTED --b64-cluster-ca REDACTED --container-runtime containerd --kubelet-extra-args '--node-labels=REDACTED --register-with-taints=dedicated=REDACTED:NoSchedule'
Using containerd as the container runtime
Aliasing EKS k8s snap commands
Added:
  - kubelet-eks.kubelet as kubelet
Added:
  - kubectl-eks.kubectl as kubectl
Stopping k8s daemons until configured
error: snap "kubelet-eks" has "auto-refresh" change in progress
Exited with error on line 353

---

And there is no snap.kubelet-eks.daemon jounral logs in system.

I also create custom ami via Packer, the source AMI was latest image that 20230517 serial.

Revision history for this message
Robby Pocase (rpocase) wrote :

@DingGGu Thanks for the update. As a possible workaround, adding "snap refresh kubelet" to the beginning of your packer build steps (if not already present) should verify if the latest serial will resolve your issue. The latest serial for 1.24 is 20230607. We should also have a new serial later today that includes a fix for the bug Jan listed. I'm not positive it will fix your issue, but if its the same root cause I would expect it to resolve your lingering issues. Thanks for your patience on this!

Revision history for this message
George Kraft (cynerva) wrote :

The timing of that last occurrence coincides with the new wave of kubelet-eks snaps we released yesterday to fix LP:2023284. I suspect this auto-refresh error may happen every time we release updated snaps, and I don't think there's anything we can do within the snap to prevent it. Seems like either AMIs will need to be built with `snap refresh --hold kubelet-eks` to prevent the snap from auto-refreshing, or the bootstrap.sh script will need to catch errors from `snap set` and retry in case an auto-refresh is occurring.

Revision history for this message
Robby Pocase (rpocase) wrote :

We had discussion around adding `snap refresh --hold`, but felt that keeping the ability to force changes to a channel in times like these was paramount. The general expectation is that a `snap refresh` should never result in downtime for an active service (and if it does, it should be fixable within the snap for most cases). I am still open to the option, but I worry that we would not have enough visibility to advertise critical updates to EKS users with long running clusters.

The suggested bootstrap.sh fix seems perfectly reasonable and I think we could approach that in the near term. I'll make sure it gets prioritized soon.

Revision history for this message
DingGGu (dinggggu) wrote :

Thanks for reply.
Setting up `snap refresh --hold kubelet-eks` on initialize script on packer does not work on my sides.

I think, snap was upgrade their package during booting sequence.

Can any ideas for fix this issue before relese new image?

Revision history for this message
Dave (dpedu) wrote :

I am still seeing this issue with the 20230518 build of the 1.22 EKS ami, named amazon/ubuntu-eks/k8s_1.22/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230518, in us-east-1.

Revision history for this message
Robby Pocase (rpocase) wrote :

@Dave - that makes sense. It looks like we haven't had a build come through of 20.04 since this fix was released. I'll check into that and make sure one is released ASAP. We are tracking additional fallout from this fix in LP:2023284. I've subscribed you, so you should get notifications when I drop an ami list

Revision history for this message
Robby Pocase (rpocase) wrote :

@Dave - I just noticed your EKS version. 1.22 is past EOL and we no longer build images for it. To fully resolve this issue you will need to upgrade to EKS 1.23

Revision history for this message
Nikita Somikov (qwedcftyu) wrote :

@Robby, I also have 1.22 version. But there were no such problems before, so there is a way to solve them. How can I solve the problem in a short term without upgrading the k8s?

Revision history for this message
Thomas Bechtold (toabctl) wrote :

> @Robby, I also have 1.22 version. But there were no such problems before, so there is a way to solve them. How can I solve the problem in a short term without upgrading the k8s?

you can use older images for 1.22 .

Revision history for this message
Nikita Somikov (qwedcftyu) wrote :

@Thomas, should I use older than 2023.05?

Revision history for this message
Thomas Bechtold (toabctl) wrote :

> @Thomas, should I use older than 2023.05?

you need to use a version which has a different patch version for kubelet-eks (so it doesn't auto-update itself). Then you have a patch version missmatch between the control plane and the worker node but that might be ok.
try something from 202304 I would say.

Revision history for this message
Nikita Somikov (qwedcftyu) wrote :

ubuntu-eks/k8s_1.22/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230420, unfortunately also failed (around 1 node from 10). It started with 95 snap revision, but updated (not all nodes) during start to 160

Revision history for this message
Stijn (sdehaes) wrote :

we are seeing issues during startup with the kubectl-eks snap, similar to the issues seen here with kubelet-eks. It is making our nodes fail to boot, we are currently using the latest ubuntu eks 1.26 images.

Revision history for this message
Prem Sompura (premsompura) wrote :

Facing similar issue, even in ubuntu-eks/k8s_1.22/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230113. We use custom bootstrap.sh script, so what I have done to fix this is disable auto-refresh temporarily. You can do that by adding this command in bootstrap.sh script `snap set system refresh.metered=hold` and post kubelet starts, you can enable auto-refresh `snap set system refresh.metered=null`

Revision history for this message
Robby Pocase (rpocase) wrote :

@premsompura's suggestion is still going to be the most consistent workaround for now.

> 1.26 new serials
@sdehaes I just checked and we still don't have a release out. I'll ensure we have eyes on these pipelines today to get 1.26 (and the remaining serials) ushered through today. We also still don't have a release out for 1.24 or 1.25

Revision history for this message
Thomas Bechtold (toabctl) wrote :

we have updated images for 1.23, 1.24, 1.25 and 1.26 (serial is 20230623) which hopefully help with this issue. please try those.

Revision history for this message
Prem Sompura (premsompura) wrote :

Anyone using 1.22 image, can use this command in bootstrap.sh scrip to avoid auto-refresh issue - `snap refresh --hold=1h kubelet-eks kubectl-eks`

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.