Bug #2023284 “racy configuration of kubelet with recent images -...” : Bugs : cloud-images

Revision history for this message

jan grant (jangrant) wrote on 2023-06-09:

#1

The problem here is two "snap set" commands, one after the other, in bootstrap.sh.

The first gives kubelet all the information to contact the API server and register itself, but not the extra args. This restarts kubelet and begins a race.

The second adds the extra-args and restarts kubelet.

--register-with-taint and --node-label have no effect if they are applied after the node's registered.

If the first restart of kubelet gets as far as registering the node, those taints / labels will *never* apply.

The fix is to merge the two `kubelet set` commands into one.

A more principled fix would be to not start kubelet at all until it's configured properly.

jan grant (jangrant) on 2023-06-09

summary:

- amazon/ubuntu-eks/k8s_1.24/images/hvm-ssd/ubuntu-
- focal-20.04-amd64-server-20230607 - eks nodes failing to taint / label
- on creation
+ racy configuration of kubelet with recent images - eks nodes failing to
+ taint / label on creation

Revision history for this message

jan grant (jangrant) wrote on 2023-06-09:

#2

Our workaround is to preconfigure the critical flags in snap prior to launching the bootstrap script: userdata now has this -

[[[
snap set kubelet-eks register-with-taints=foo-node=true:NoSchedule
snap set kubelet-eks node-labels=foo-node=true
B64_CLUSTER_CA=
API_SERVER_URL=
/etc/eks/bootstrap.sh yet-another-cluster --container-runtime containerd --kubelet-extra-args '--register-with-taints=foo-node=true:NoSchedule --node-labels=foo-node=true' --b64-cluster-ca $B64_CLUSTER_CA --apiserver-endpoint $API_SERVER_URL
]]]

and this is reliably working around the race.

Revision history for this message

Robby Pocase (rpocase) wrote on 2023-06-12:

#3

Hey Jan - thanks for filing this! The thorough notes are really appreciated. I agree with your assessment. This is an unforeseen consequence of [1]. I'll aim to work on a more permanent resolution ASAP. For a full fix, I believe we need to do two things.

* Update bootstrap.sh as mentioned (single snap set - original impl assumed snap set was atomic and did not restart the service)
* Follow up with the MP owner and discuss methods to adjust this to not add this inconsistency for users of older AMI deployments (e.g. if they manage their fleet and want to snap upgrade to consume the other fixes without a full AMI upgrade).

[1] - https://git.launchpad.net/snap-kubelet/commit/?id=6c89724888bc5427bbc8828620de0d3c509a5884

Revision history for this message

Robby Pocase (rpocase) wrote on 2023-06-12:

#4

@Nikita added you as a watched here as the first proposed fix may be sufficient to address your packer concerns. I'll let you know when we have AMIs out with this fix (hopefully tomorrow, but depends on if I can fully cycle to bootstrap.sh). If this does not address it, we should follow up with a new LP bug to focus our troubleshooting.

Revision history for this message

Badal (badaldavda8) wrote on 2023-06-12:

#5

Download full text (12.1 KiB)

I get the following output on running self-managed nodes -

etc/eks/bootstrap.sh myeksABC123 --kubelet-extra-args --max-pods=29 --b64-cluster-ca ABC123.gr7.eu-west-1.eks.amazonaws.com --use-max-pods false
Using containerd as the container runtime
Aliasing EKS k8s snap commands
error: snap "kubelet-eks" has "auto-refresh" change in progress
Exited with error on line 349

I added sleep of 30 secs, but I still get issues. Issue with Ubuntu EKS AMI v1.26.4

Output -
nap tasks --last=auto-refresh
Status Spawn Ready Summary
Done today at 19:26 UTC today at 19:26 UTC Ensure prerequisites for "snapd" are available
Done today at 19:26 UTC today at 19:26 UTC Download snap "snapd" (19361) from channel "latest/stable"
Done today at 19:26 UTC today at 19:26 UTC Fetch and check assertions for snap "snapd" (19361)
Done today at 19:26 UTC today at 19:26 UTC Mount snap "snapd" (19361)
Done today at 19:26 UTC today at 19:26 UTC Run pre-refresh hook of "snapd" snap if present
Done today at 19:26 UTC today at 19:26 UTC Stop snap "snapd" services
Done today at 19:26 UTC today at 19:26 UTC Remove aliases for snap "snapd"
Done today at 19:26 UTC today at 19:26 UTC Make current revision for snap "snapd" unavailable
Done today at 19:26 UTC today at 19:26 UTC Copy snap "snapd" data
Done today at 19:26 UTC today at 19:26 UTC Setup snap "snapd" (19361) security profiles
Done today at 19:26 UTC today at 19:26 UTC Make snap "snapd" (19361) available to the system
Done today at 19:26 UTC today at 19:26 UTC Automatically connect eligible plugs and slots of snap "snapd"
Done today at 19:26 UTC today at 19:26 UTC Set automatic aliases for snap "snapd"
Done today at 19:26 UTC today at 19:26 UTC Setup snap "snapd" aliases
Done today at 19:26 UTC today at 19:26 UTC Run post-refresh hook of "snapd" snap if present
Done today at 19:26 UTC today at 19:26 UTC Start snap "snapd" (19361) services
Done today at 19:26 UTC today at 19:26 UTC Clean up "snapd" (19361) install
Done today at 19:26 UTC today at 19:26 UTC Run health check of "snapd" snap
Done today at 19:26 UTC today at 19:26 UTC Ensure prerequisites for "core18" are available
Done today at 19:26 UTC today at 19:27 UTC Download snap "core18" (2785) from channel "latest/stable"
Done today at 19:26 UTC today at 19:27 UTC Fetch and check assertions for snap "core18" (2785)
Done today at 19:26 UTC today at 19:27 UTC Mount snap "core18" (2785)
Done today at 19:26 UTC today at 19:27 UTC Run pre-refresh hook of "core18" snap if present
Done today at 19:26 UTC today at 19:27 UTC Stop snap "core18" services
Done today at 19:26 UTC today at 19:27 UTC Remove aliases for snap "core18"
Done today at 19:26 UTC today at 19:27 UTC Make current revision for snap "core18" unavailable
Done today at 19:26 UTC today at 19:27 UTC Copy snap "core18" data
Done today at 19:26 UTC today at 19:27 UTC Setup snap "core18" (2785) security profiles
Done today at 19:26 UTC today at 19:27 UTC Make snap "core18" (2785) available to the system
Done today at 19:26 UTC today at 19:27 UT...

I get the following output on running self-managed nodes -

etc/eks/bootstrap.sh myeksABC123 --kubelet-extra-args --max-pods=29 --b64-cluster-ca ABC123.gr7.eu-west-1.eks.amazonaws.com --use-max-pods false
Using containerd as the container runtime
Aliasing EKS k8s snap commands
error: snap "kubelet-eks" has "auto-refresh" change in progress
Exited with error on line 349

I added sleep of 30 secs, but I still get issues. Issue with Ubuntu EKS AMI v1.26.4

Output - 
nap tasks --last=auto-refresh
Status  Spawn               Ready               Summary
Done    today at 19:26 UTC  today at 19:26 UTC  Ensure prerequisites for "snapd" are available
Done    today at 19:26 UTC  today at 19:26 UTC  Download snap "snapd" (19361) from channel "latest/stable"
Done    today at 19:26 UTC  today at 19:26 UTC  Fetch and check assertions for snap "snapd" (19361)
Done    today at 19:26 UTC  today at 19:26 UTC  Mount snap "snapd" (19361)
Done    today at 19:26 UTC  today at 19:26 UTC  Run pre-refresh hook of "snapd" snap if present
Done    today at 19:26 UTC  today at 19:26 UTC  Stop snap "snapd" services
Done    today at 19:26 UTC  today at 19:26 UTC  Remove aliases for snap "snapd"
Done    today at 19:26 UTC  today at 19:26 UTC  Make current revision for snap "snapd" unavailable
Done    today at 19:26 UTC  today at 19:26 UTC  Copy snap "snapd" data
Done    today at 19:26 UTC  today at 19:26 UTC  Setup snap "snapd" (19361) security profiles
Done    today at 19:26 UTC  today at 19:26 UTC  Make snap "snapd" (19361) available to the system
Done    today at 19:26 UTC  today at 19:26 UTC  Automatically connect eligible plugs and slots of snap "snapd"
Done    today at 19:26 UTC  today at 19:26 UTC  Set automatic aliases for snap "snapd"
Done    today at 19:26 UTC  today at 19:26 UTC  Setup snap "snapd" aliases
Done    today at 19:26 UTC  today at 19:26 UTC  Run post-refresh hook of "snapd" snap if present
Done    today at 19:26 UTC  today at 19:26 UTC  Start snap "snapd" (19361) services
Done    today at 19:26 UTC  today at 19:26 UTC  Clean up "snapd" (19361) install
Done    today at 19:26 UTC  today at 19:26 UTC  Run health check of "snapd" snap
Done    today at 19:26 UTC  today at 19:26 UTC  Ensure prerequisites for "core18" are available
Done    today at 19:26 UTC  today at 19:27 UTC  Download snap "core18" (2785) from channel "latest/stable"
Done    today at 19:26 UTC  today at 19:27 UTC  Fetch and check assertions for snap "core18" (2785)
Done    today at 19:26 UTC  today at 19:27 UTC  Mount snap "core18" (2785)
Done    today at 19:26 UTC  today at 19:27 UTC  Run pre-refresh hook of "core18" snap if present
Done    today at 19:26 UTC  today at 19:27 UTC  Stop snap "core18" services
Done    today at 19:26 UTC  today at 19:27 UTC  Remove aliases for snap "core18"
Done    today at 19:26 UTC  today at 19:27 UTC  Make current revision for snap "core18" unavailable
Done    today at 19:26 UTC  today at 19:27 UTC  Copy snap "core18" data
Done    today at 19:26 UTC  today at 19:27 UTC  Setup snap "core18" (2785) security profiles
Done    today at 19:26 UTC  today at 19:27 UTC  Make snap "core18" (2785) available to the system
Done    today at 19:26 UTC  today at 19:27 UTC  Automatically connect eligible plugs and slots of snap "core18"
Done    today at 19:26 UTC  today at 19:27 UTC  Set automatic aliases for snap "core18"
Done    today at 19:26 UTC  today at 19:27 UTC  Setup snap "core18" aliases
Done    today at 19:26 UTC  today at 19:27 UTC  Run post-refresh hook of "core18" snap if present
Done    today at 19:26 UTC  today at 19:27 UTC  Start snap "core18" (2785) services
Done    today at 19:26 UTC  today at 19:27 UTC  Clean up "core18" (2785) install
Done    today at 19:26 UTC  today at 19:27 UTC  Run health check of "core18" snap
Done    today at 19:26 UTC  today at 19:26 UTC  Ensure prerequisites for "core20" are available
Done    today at 19:26 UTC  today at 19:27 UTC  Download snap "core20" (1891) from channel "latest/stable"
Done    today at 19:26 UTC  today at 19:27 UTC  Fetch and check assertions for snap "core20" (1891)
Done    today at 19:26 UTC  today at 19:27 UTC  Mount snap "core20" (1891)
Done    today at 19:26 UTC  today at 19:27 UTC  Run pre-refresh hook of "core20" snap if present
Done    today at 19:26 UTC  today at 19:27 UTC  Stop snap "core20" services
Done    today at 19:26 UTC  today at 19:27 UTC  Remove aliases for snap "core20"
Done    today at 19:26 UTC  today at 19:27 UTC  Make current revision for snap "core20" unavailable
Done    today at 19:26 UTC  today at 19:27 UTC  Copy snap "core20" data
Done    today at 19:26 UTC  today at 19:27 UTC  Setup snap "core20" (1891) security profiles
Done    today at 19:26 UTC  today at 19:27 UTC  Make snap "core20" (1891) available to the system
Done    today at 19:26 UTC  today at 19:27 UTC  Automatically connect eligible plugs and slots of snap "core20"
Done    today at 19:26 UTC  today at 19:27 UTC  Set automatic aliases for snap "core20"
Done    today at 19:26 UTC  today at 19:27 UTC  Setup snap "core20" aliases
Done    today at 19:26 UTC  today at 19:27 UTC  Run post-refresh hook of "core20" snap if present
Done    today at 19:26 UTC  today at 19:27 UTC  Start snap "core20" (1891) services
Done    today at 19:26 UTC  today at 19:27 UTC  Clean up "core20" (1891) install
Done    today at 19:26 UTC  today at 19:27 UTC  Run health check of "core20" snap
Done    today at 19:26 UTC  today at 19:27 UTC  Ensure prerequisites for "kubectl-eks" are available
Done    today at 19:26 UTC  today at 19:27 UTC  Download snap "kubectl-eks" (141) from channel "1.26.4/stable"
Done    today at 19:26 UTC  today at 19:27 UTC  Fetch and check assertions for snap "kubectl-eks" (141)
Done    today at 19:26 UTC  today at 19:27 UTC  Mount snap "kubectl-eks" (141)
Done    today at 19:26 UTC  today at 19:27 UTC  Run pre-refresh hook of "kubectl-eks" snap if present
Done    today at 19:26 UTC  today at 19:27 UTC  Stop snap "kubectl-eks" services
Done    today at 19:26 UTC  today at 19:27 UTC  Remove aliases for snap "kubectl-eks"
Done    today at 19:26 UTC  today at 19:27 UTC  Make current revision for snap "kubectl-eks" unavailable
Done    today at 19:26 UTC  today at 19:27 UTC  Copy snap "kubectl-eks" data
Done    today at 19:26 UTC  today at 19:27 UTC  Setup snap "kubectl-eks" (141) security profiles
Done    today at 19:26 UTC  today at 19:27 UTC  Make snap "kubectl-eks" (141) available to the system
Done    today at 19:26 UTC  today at 19:27 UTC  Automatically connect eligible plugs and slots of snap "kubectl-eks"
Done    today at 19:26 UTC  today at 19:27 UTC  Set automatic aliases for snap "kubectl-eks"
Done    today at 19:26 UTC  today at 19:27 UTC  Setup snap "kubectl-eks" aliases
Done    today at 19:26 UTC  today at 19:27 UTC  Run post-refresh hook of "kubectl-eks" snap if present
Done    today at 19:26 UTC  today at 19:27 UTC  Start snap "kubectl-eks" (141) services
Done    today at 19:26 UTC  today at 19:27 UTC  Clean up "kubectl-eks" (141) install
Done    today at 19:26 UTC  today at 19:27 UTC  Run configure hook of "kubectl-eks" snap if present
Done    today at 19:26 UTC  today at 19:27 UTC  Run health check of "kubectl-eks" snap
Done    today at 19:26 UTC  today at 19:27 UTC  Ensure prerequisites for "amazon-ssm-agent" are available
Done    today at 19:26 UTC  today at 19:27 UTC  Download snap "amazon-ssm-agent" (6563) from channel "latest/stable/ubuntu-20.04"
Done    today at 19:26 UTC  today at 19:27 UTC  Fetch and check assertions for snap "amazon-ssm-agent" (6563)
Done    today at 19:26 UTC  today at 19:27 UTC  Mount snap "amazon-ssm-agent" (6563)
Done    today at 19:26 UTC  today at 19:27 UTC  Run pre-refresh hook of "amazon-ssm-agent" snap if present
Done    today at 19:26 UTC  today at 19:27 UTC  Stop snap "amazon-ssm-agent" services
Done    today at 19:26 UTC  today at 19:27 UTC  Remove aliases for snap "amazon-ssm-agent"
Done    today at 19:26 UTC  today at 19:27 UTC  Make current revision for snap "amazon-ssm-agent" unavailable
Done    today at 19:26 UTC  today at 19:27 UTC  Copy snap "amazon-ssm-agent" data
Done    today at 19:26 UTC  today at 19:27 UTC  Setup snap "amazon-ssm-agent" (6563) security profiles
Done    today at 19:26 UTC  today at 19:27 UTC  Make snap "amazon-ssm-agent" (6563) available to the system
Done    today at 19:26 UTC  today at 19:27 UTC  Automatically connect eligible plugs and slots of snap "amazon-ssm-agent"
Done    today at 19:26 UTC  today at 19:27 UTC  Set automatic aliases for snap "amazon-ssm-agent"
Done    today at 19:26 UTC  today at 19:27 UTC  Setup snap "amazon-ssm-agent" aliases
Done    today at 19:26 UTC  today at 19:27 UTC  Run post-refresh hook of "amazon-ssm-agent" snap if present
Done    today at 19:26 UTC  today at 19:27 UTC  Start snap "amazon-ssm-agent" (6563) services
Done    today at 19:26 UTC  today at 19:27 UTC  Remove data for snap "amazon-ssm-agent" (6312)
Done    today at 19:26 UTC  today at 19:27 UTC  Remove snap "amazon-ssm-agent" (6312) from the system
Done    today at 19:26 UTC  today at 19:27 UTC  Clean up "amazon-ssm-agent" (6563) install
Done    today at 19:26 UTC  today at 19:27 UTC  Run configure hook of "amazon-ssm-agent" snap if present
Done    today at 19:26 UTC  today at 19:27 UTC  Run health check of "amazon-ssm-agent" snap
Done    today at 19:26 UTC  today at 19:27 UTC  Ensure prerequisites for "kubelet-eks" are available
Done    today at 19:26 UTC  today at 19:27 UTC  Download snap "kubelet-eks" (142) from channel "1.26.4/stable"
Done    today at 19:26 UTC  today at 19:27 UTC  Fetch and check assertions for snap "kubelet-eks" (142)
Done    today at 19:26 UTC  today at 19:27 UTC  Mount snap "kubelet-eks" (142)
Done    today at 19:26 UTC  today at 19:27 UTC  Run pre-refresh hook of "kubelet-eks" snap if present
Done    today at 19:26 UTC  today at 19:27 UTC  Stop snap "kubelet-eks" services
Done    today at 19:26 UTC  today at 19:27 UTC  Remove aliases for snap "kubelet-eks"
Done    today at 19:26 UTC  today at 19:27 UTC  Make current revision for snap "kubelet-eks" unavailable
Done    today at 19:26 UTC  today at 19:27 UTC  Copy snap "kubelet-eks" data
Done    today at 19:26 UTC  today at 19:27 UTC  Setup snap "kubelet-eks" (142) security profiles
Done    today at 19:26 UTC  today at 19:27 UTC  Make snap "kubelet-eks" (142) available to the system
Done    today at 19:26 UTC  today at 19:27 UTC  Automatically connect eligible plugs and slots of snap "kubelet-eks"
Done    today at 19:26 UTC  today at 19:27 UTC  Set automatic aliases for snap "kubelet-eks"
Done    today at 19:26 UTC  today at 19:27 UTC  Setup snap "kubelet-eks" aliases
Done    today at 19:26 UTC  today at 19:27 UTC  Run post-refresh hook of "kubelet-eks" snap if present
Done    today at 19:26 UTC  today at 19:27 UTC  Start snap "kubelet-eks" (142) services
Done    today at 19:26 UTC  today at 19:27 UTC  Clean up "kubelet-eks" (142) install
Done    today at 19:26 UTC  today at 19:27 UTC  Run configure hook of "kubelet-eks" snap if present
Done    today at 19:26 UTC  today at 19:27 UTC  Run health check of "kubelet-eks" snap
Done    today at 19:26 UTC  today at 19:27 UTC  Handling re-refresh of "amazon-ssm-agent", "core18", "core20", "kubectl-eks", "kubelet-eks", "snapd" as needed
Done    today at 19:27 UTC  today at 19:27 UTC  restart of [kubelet-eks.daemon]
Done    today at 19:27 UTC  today at 19:27 UTC  Run service command "restart" for services ["daemon"] of snap "kubelet-eks"

......................................................................
Make snap "snapd" (19361) available to the system

2023-06-12T19:26:48Z INFO Requested daemon restart (snapd snap).

......................................................................
Automatically connect eligible plugs and slots of snap "snapd"

2023-06-12T19:26:48Z INFO Waiting for automatic snapd restart...
2023-06-12T19:26:48Z INFO Waiting for automatic snapd restart...

......................................................................
Handling re-refresh of "amazon-ssm-agent", "core18", "core20", "kubectl-eks", "kubelet-eks", "snapd" as needed

2023-06-12T19:27:48Z INFO No re-refreshes found.

......................................................................
restart of [kubelet-eks.daemon]

2023-06-12T19:27:48Z INFO task ignored

Looks like it expects kubelet-eks daemon restart.

Is this related to above or should I open a new bug?

Revision history for this message

Nikita Somikov (qwedcftyu) wrote on 2023-06-13:

#6

@Robby, thanks!

Revision history for this message

Robby Pocase (rpocase) wrote on 2023-06-14:

#7

George reverted some of the kubelet snap changes yesterday. New serials should roll out today with this fix included and it is expected to resolve the sited problems in this bug report. For the time being, we are holding off on bootstrap.sh changes since George beat us to the bunch on one of the possible solution paths (thanks again!). I'll be monitoring pipelines and will try to notify on latest serials as soon as they become available.

Changed in cloud-images:
assignee:	nobody → George Kraft (cynerva)
importance:	Undecided → High
status:	New → Fix Committed

Revision history for this message

Robby Pocase (rpocase) wrote on 2023-06-16:

#8

I just checked our latest serials and we still haven't built for some reason. I'll touch base with the team early AM to make sure this is triggered properly to roll out new serials.

Revision history for this message

Robby Pocase (rpocase) wrote on 2023-06-16:

#9

Quick update - new serials are out for 1.23 (20230616), but we are seeing test failures on 1.24-1.26. These seem related to not using a kubelet version that matches EKS exactly. We are working on this and hope to have new builds out by Monday, but is likely to slip to Tuesday because of the late day Friday request for new snap channels and most of the people on teams responsible for the snaps + EKS image will be out Monday (US Holiday).

Revision history for this message

Krishna Venkata (krishna-venkata) wrote on 2023-06-22:

#10

@rpocase do you have an update about 1.24 AMI's? when it would be available!

Revision history for this message

Robby Pocase (rpocase) wrote on 2023-06-23:

#11

@krishna 1.24 SHOULD be out today. We've hit multiple (completely unrelated) issues on getting these released, but I have dedicated eyes on the EKS pipelines today.

Revision history for this message

Thomas Bechtold (toabctl) wrote on 2023-06-23:

#12

we have updated images for 1.23, 1.24, 1.25 and 1.26 (serial is 20230623) which hopefully help with this issue. please try those.

Changed in cloud-images:
status:	Fix Committed → Fix Released

cloud-images

racy configuration of kubelet with recent images - eks nodes failing to taint / label on creation

Bug Description

Other bug subscribers

Remote bug watches