support k8s-worker being deployed as different names

Bug #1906732 reported by Syed Mohammad Adnan Karim
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
Fix Released
Undecided
Kevin W Monroe

Bug Description

With the latest charms I was able to get an all-green juju status but only 3/9 kubernetes-worker nodes would register with the cluster when I ran kubectl get nodes.

Pastebin of failure scenario with newest kubernetes-master charm:
https://pastebin.canonical.com/p/gjV6vyFrx8/

This seems to be an issue with kubernetes-master charm versions newer than 850 because I was able to successfully deploy and get all workers registered with kubernetes-master-850.

Pastebin of success with kubernetes-master-850:
https://pastebin.canonical.com/p/Q3MSnVJXHS/

The bundle I used to reproduce this can be found here:
https://pastebin.canonical.com/p/TRNgs8gpn2/

As can be seen from the bundle, I also have different kubernetes-worker types (each one is configured with diffrent labels and taints) which may be a cause of the problem on the newer charm versions.

The k8s version is 1.17 but I recall facing the same issue with even 1.19.

description: updated
Revision history for this message
Syed Mohammad Adnan Karim (karimsye) wrote :
Changed in charm-kubernetes-master:
assignee: nobody → Kevin W Monroe (kwmonroe)
status: New → In Progress
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

Your first pastebin link has some data that seems weird to me. The node list includes:

node02ob100 Ready <none> 37m v1.17.14

From the same output, that node is machine 5:

5 started 172.27.100.106 node02ob100 bionic

But machine 5 is k8s-master/1:

kubernetes-master/1* active idle 5 172.27.100.106

I'm confused why a k8s-master would show up as a cluster node since kubelet doesn't run on masters by default. A few questions for you:

- Is the failing deployment re-using old machines? I ask because your second pastebin (the successful one) shows 172.27.100.106 as the IP address for k8s-worker-ref/0, which would have been a valid node for a different deployment.

- Did you see this failure during an upgrade or new deployment of the latest stable charms? If the former, can you provide the charm revs that you started with?

- Are you attempting to run kubelet on master units? If so, what is your process for installing/configuring kubelet?

- If you have a failed env available, ssh to the available nodes (3/9 in your original failure case) and check the "server:" entry in /root/cdk/kubeconfig. Is it pointing to the expected load balancer / master address?

- In the failed env, where did you run "kubectl get no" from -- within the cluster or a separate management workstation? The ~/.kube/config file does change with the current stable charms, so it's possible you have an old kubeconfig that is pointing to the wrong cluster and/or auth mechanism.

Fwiw, I deployed the stable charms with 1.19 and tainted/labelled workers with no effect on them being recognized as cluster nodes. I'll keep poking through the crashdump for more clues, but answers to the above would help narrow this down.

Changed in charm-kubernetes-master:
status: In Progress → Incomplete
Revision history for this message
Syed Mohammad Adnan Karim (karimsye) wrote :

I am also confused at that pastebin now and I will redeploy and verify if I gave the correct info or not.

- Is the failing deployment re-using old machines? I ask because your second pastebin (the successful one) shows 172.27.100.106 as the IP address for k8s-worker-ref/0, which would have been a valid node for a different deployment.

I am deploying on a MAAS cloud and I released all the nodes (and destroyed and recreated the juju model) before deploying so it should not be re-using a machine that was used for something else.

- Did you see this failure during an upgrade or new deployment of the latest stable charms? If the former, can you provide the charm revs that you started with?

I saw it first during a new deployment and later on I tried to do an upgrade and that also failed. The charm versions I upgraded from are:

    charm: cs:~containers/kubernetes-master-850
    charm: cs:~containers/kubernetes-worker-682
    charm: cs:hacluster-72
    charm: cs:~containers/flannel-506
    charm: cs:etcd-540
    charm: cs:~containers/easyrsa-333
    charm: cs:~containers/docker-86
    charm: cs:~containers/kubeapi-load-balancer (latest)

- In the failed env, where did you run "kubectl get no" from -- within the cluster or a separate management workstation? The ~/.kube/config file does change with the current stable charms, so it's possible you have an old kubeconfig that is pointing to the wrong cluster and/or auth mechanism.

I tried running it from the k8s-masters within the cluster and also from one of the MAAS nodes that has access to the cluster and the result was the same and I also made sure to juju scp the kubeconfig file everytime.

Revision history for this message
Syed Mohammad Adnan Karim (karimsye) wrote :

I am not deploying kubelets on the masters FYI

Revision history for this message
Chris Sanders (chris.sanders) wrote :

@Syed have you done the redeploy or can you tell us when you expect to?

Revision history for this message
Syed Mohammad Adnan Karim (karimsye) wrote :

I have not redeployed yet but I expect to do it by end of day today.
Hopefully in a few hours or so.

Revision history for this message
Syed Mohammad Adnan Karim (karimsye) wrote :

I have redeployed the kubernetes cluster with the latest charms (my failure scenario) and I can confirm that the first pastebin was incorrect. The result is still that only 3 worker nodes were registered:
https://pastebin.canonical.com/p/B56nbRCJQQ/

I also updated my deployment/bundle with one more machine (kubernetes-worker-zinc):
https://pastebin.canonical.com/p/tYs3Wbt2rt/

Here is an updated juju-crashdump: https://drive.google.com/file/d/1zLG7yMxn5xg6tPsqLyb69tUIvfgLmKoy/view?usp=sharing

Changed in charm-kubernetes-master:
status: Incomplete → In Progress
summary: - kubernetes-master charm versions newer than 850 not registering all
- worker nodes
+ support k8s-worker being deployed as different names
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

This looks to be bug 1906094. There's a bit more debug data in this bug, so we'll work it here. Paul discovered that the userid for worker tokens relied on unit numbers -- these are not unique enough when worker charms are deployed as different application names. This leads to a situation where:

kubernetes-worker-aud/0
kubernetes-worker-consul/0
kubernetes-worker-med/0
etc...

Would collide with each other as their tokens would all be identified as "kubelet-0". In fact, only one /n unit for kubernetes-worker-$foo/n would register as a cluster node. This matches what we see in your latest logs, where:

node07ob100 (kubernetes-worker-zinc/0) is kubelet-0
node10ob100 (kubernetes-worker-zinc/1) is kubelet-1
kubernetes-worker-consul-3 is kubelet-2

If you wanted to verify this, you could add a unit to kubernetes-worker-consul, which would create a /3 unit. You should see that new unit in `kubectl get nodes` since there are no other /3 units for any k8s-worker applications.

One workaround would be to ensure the unit numbers are unique across all k8s-worker applications. Fair warning, this is gross:

juju deploy -n 3 kubernetes-worker kubernetes-worker-foo
juju deploy -n 6 kubernetes-worker kubernetes-worker-bar
juju remove-unit kubernetes-worker-bar/0
juju remove-unit kubernetes-worker-bar/1
juju remove-unit kubernetes-worker-bar/2

That would leave you with foo/[0-2] and bar[3-5]. That of course doesn't scale, and it would quickly fail if you ever added a unit that resulted in an overlapping unit number.

We should have a proper fix out in the first bugfix release of CK 1.20.

Changed in charm-kubernetes-master:
milestone: none → 1.20+ck1
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

Quick progress report here. This is technically fixed with bug #1911445 where we added a unique string to secret names. On top of that, I'll be adjusting the secret names for this bug to make it more obvious what secret is associated with a given worker.

Revision history for this message
Kevin W Monroe (kwmonroe) wrote :
tags: added: review-needed
Changed in charm-kubernetes-master:
status: In Progress → Fix Committed
tags: removed: review-needed
Revision history for this message
George Kraft (cynerva) wrote :
Changed in charm-kubernetes-master:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.