[CK1.22] Kubernetes master leader stuck on "Applying system:monitoring RBAC role"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Kubernetes Control Plane Charm |
Fix Released
|
Critical
|
Cory Johns |
Bug Description
During a CK 1.22 release test run the k8s-master units fail to fully come up, with the leader being stuck in maintenance on applying RBAC roles:
kubernetes-master/0 waiting idle 0/lxd/1 10.246.64.219 6443/tcp Waiting for auth-webhook tokens
calico/7 waiting idle 10.246.64.219 Waiting to retry Calico node configuration
containerd/7 active idle 10.246.64.219 Container runtime available
hacluster-
kubernetes-master/1 waiting executing 2/lxd/1 10.246.64.215 6443/tcp Waiting for auth-webhook tokens
calico/6 waiting idle 10.246.64.215 Waiting to retry Calico node configuration
containerd/6 active idle 10.246.64.215 Container runtime available
hacluster-
kubernetes-
calico/8 waiting idle 10.246.64.216 Waiting to retry Calico node configuration
containerd/8 active idle 10.246.64.216 Container runtime available
hacluster-
In the leader's juju log we can see it's attempting to connect to <something>
...
2021-08-26 14:57:25 INFO juju-log Executing ['kubectl', '--kubeconfig=
2021-08-26 14:57:28 WARNING update-status Unable to connect to the server: dial tcp 10.246.64.82:6443: connect: no route to host
2021-08-26 14:57:28 DEBUG jujuc server.go:211 running hook tool "juju-log" for kubernetes-
2021-08-26 14:57:28 INFO juju-log Traceback (most recent call last):
File "/var/lib/
kubectl(
File "/var/lib/
return check_output(
File "/usr/lib/
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib/
raise CalledProcessEr
subprocess.
2021-08-26 14:57:28 DEBUG jujuc server.go:211 running hook tool "juju-log" for kubernetes-
2021-08-26 14:57:28 INFO juju-log Waiting to retry applying system:monitoring RBAC role
...
From the available logs I don't know offhand what is at that 10.246.64.82 address, however it seems similar to LP#1929234 which we have seen on both baremetal (as in this case) as well as AWS.
Test run: https:/
Crashdump: https:/
Bundle: https:/
tags: | added: cdo-release-blocker |
tags: | added: cdo-qa foundations-engine |
Changed in charm-kubernetes-master: | |
status: | Fix Committed → Fix Released |
On AWS, the kubernetes-master charm is stuck because kubectl calls are failing, but only sometimes:
# kubectl get po
error: You must be logged in to the server (Unauthorized)
# kubectl get po
No resources found in default namespace.
This is happening because /root/.kube/config is pointing to the IP of kubeapi- load-balancer, which is distributing traffic between multiple kubernetes-master units. The two units do not agree on what the admin token is:
$ juju ssh kubernetes-master/0 sudo cat /root/.kube/config | grep token
token: admin::RAO...c8r
$ juju ssh kubernetes-master/1 sudo cat /root/.kube/config | grep token
token: admin::e2z...jYz
The auth webhook reads the local /root/.kube/config to determine what the admin token is, so, if the request from kubernetes-master/0 lands on kubernetes-master/1 or vice versa, the request fails.
The two kubernetes-master units are unable to progress to the point where they eventually agree on what the token is. They try and fail, repeatedly, to create the admin token as a Kubernetes secret:
unit-kubernetes -master- 1: 13:09:43 INFO unit.kubernetes -master/ 1.juju- log WARN: Unable to create secret for admin -master- 0: 13:10:20 INFO unit.kubernetes -master/ 0.juju- log WARN: Unable to create secret for admin -master- 0: 18:03:11 INFO unit.kubernetes -master/ 0.juju- log WARN: Unable to create secret for admin -master- 1: 18:18:43 INFO unit.kubernetes -master/ 1.juju- log WARN: Unable to create secret for admin
unit-kubernetes
...
unit-kubernetes
unit-kubernetes
In the past, this wasn't a problem because kubernetes-master kubectl requests went straight to the local IP, not kubeapi- load-balancer. So the request would always land on the local unit, where the local admin token is guaranteed to work.
We'll need to either revert kubernetes-master back to using its local IP, or fix the admin token handling to be less reliant on a successful connection to the Kubernetes API.