Kubernetes Control Plane Charm

[CK1.22] Kubernetes master leader stuck on "Applying system:monitoring RBAC role"

Bug #1941763 reported by Michael Skalka on 2021-08-26

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Kubernetes Control Plane Charm	Fix Released	Critical	Cory Johns	Kubernetes Control Plane Charm 1.22

Bug Description

During a CK 1.22 release test run the k8s-master units fail to fully come up, with the leader being stuck in maintenance on applying RBAC roles:

kubernetes-master/0 waiting idle 0/lxd/1 10.246.64.219 6443/tcp Waiting for auth-webhook tokens
  calico/7 waiting idle 10.246.64.219 Waiting to retry Calico node configuration
  containerd/7 active idle 10.246.64.219 Container runtime available
  hacluster-kubernetes-master/1 waiting idle 10.246.64.219 Resource: res_kube_scheduler_snap.kube_scheduler.daemon not yet configured
kubernetes-master/1 waiting executing 2/lxd/1 10.246.64.215 6443/tcp Waiting for auth-webhook tokens
  calico/6 waiting idle 10.246.64.215 Waiting to retry Calico node configuration
  containerd/6 active idle 10.246.64.215 Container runtime available
  hacluster-kubernetes-master/0* active executing 10.246.64.215 Unit is ready and clustered
kubernetes-master/2* maintenance idle 4/lxd/1 10.246.64.216 6443/tcp Applying system:monitoring RBAC role
  calico/8 waiting idle 10.246.64.216 Waiting to retry Calico node configuration
  containerd/8 active idle 10.246.64.216 Container runtime available
  hacluster-kubernetes-master/2 waiting idle 10.246.64.216 Resource: res_kube_scheduler_snap.kube_scheduler.daemon not yet configured

In the leader's juju log we can see it's attempting to connect to <something>

...
2021-08-26 14:57:25 INFO juju-log Executing ['kubectl', '--kubeconfig=/root/.kube/config', 'apply', '-f', '/root/cdk/system-monitoring-rbac-role.yaml']
2021-08-26 14:57:28 WARNING update-status Unable to connect to the server: dial tcp 10.246.64.82:6443: connect: no route to host
2021-08-26 14:57:28 DEBUG jujuc server.go:211 running hook tool "juju-log" for kubernetes-master/2-update-status-418852934066114952
2021-08-26 14:57:28 INFO juju-log Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-kubernetes-master-2/charm/reactive/kubernetes_master.py", line 1874, in apply_system_monitoring_rbac_role
    kubectl("apply", "-f", path)
  File "/var/lib/juju/agents/unit-kubernetes-master-2/charm/lib/charms/layer/kubernetes_common.py", line 258, in kubectl
    return check_output(command)
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['kubectl', '--kubeconfig=/root/.kube/config', 'apply', '-f', '/root/cdk/system-monitoring-rbac-role.yaml']' returned non-zero exit status 1.

2021-08-26 14:57:28 DEBUG jujuc server.go:211 running hook tool "juju-log" for kubernetes-master/2-update-status-418852934066114952
2021-08-26 14:57:28 INFO juju-log Waiting to retry applying system:monitoring RBAC role
...

From the available logs I don't know offhand what is at that 10.246.64.82 address, however it seems similar to LP#1929234 which we have seen on both baremetal (as in this case) as well as AWS.

Test run: https://solutions.qa.canonical.com/testruns/testRun/896b19db-62e9-4d83-8393-6592bf1ce3b6
Crashdump: https://oil-jenkins.canonical.com/artifacts/896b19db-62e9-4d83-8393-6592bf1ce3b6/generated/generated/kubernetes/juju-crashdump-kubernetes-2021-08-26-14.50.32.tar.gz
Bundle: https://oil-jenkins.canonical.com/artifacts/896b19db-62e9-4d83-8393-6592bf1ce3b6/generated/generated/kubernetes/bundle.yaml

Tags:

Michael Skalka (mskalka) on 2021-08-26

tags:	added: cdo-release-blocker
tags:	added: cdo-qa foundations-engine

Revision history for this message

George Kraft (cynerva) wrote on 2021-08-26:

On AWS, the kubernetes-master charm is stuck because kubectl calls are failing, but only sometimes:

# kubectl get po
error: You must be logged in to the server (Unauthorized)
# kubectl get po
No resources found in default namespace.

This is happening because /root/.kube/config is pointing to the IP of kubeapi-load-balancer, which is distributing traffic between multiple kubernetes-master units. The two units do not agree on what the admin token is:

$ juju ssh kubernetes-master/0 sudo cat /root/.kube/config | grep token
token: admin::RAO...c8r
$ juju ssh kubernetes-master/1 sudo cat /root/.kube/config | grep token
token: admin::e2z...jYz

The auth webhook reads the local /root/.kube/config to determine what the admin token is, so, if the request from kubernetes-master/0 lands on kubernetes-master/1 or vice versa, the request fails.

The two kubernetes-master units are unable to progress to the point where they eventually agree on what the token is. They try and fail, repeatedly, to create the admin token as a Kubernetes secret:

unit-kubernetes-master-1: 13:09:43 INFO unit.kubernetes-master/1.juju-log WARN: Unable to create secret for admin
unit-kubernetes-master-0: 13:10:20 INFO unit.kubernetes-master/0.juju-log WARN: Unable to create secret for admin
...
unit-kubernetes-master-0: 18:03:11 INFO unit.kubernetes-master/0.juju-log WARN: Unable to create secret for admin
unit-kubernetes-master-1: 18:18:43 INFO unit.kubernetes-master/1.juju-log WARN: Unable to create secret for admin

In the past, this wasn't a problem because kubernetes-master kubectl requests went straight to the local IP, not kubeapi-load-balancer. So the request would always land on the local unit, where the local admin token is guaranteed to work.

We'll need to either revert kubernetes-master back to using its local IP, or fix the admin token handling to be less reliant on a successful connection to the Kubernetes API.

Revision history for this message

Cory Johns (johnsca) wrote on 2021-08-26:

The control plane nodes talking to the LB address was not intended and is a bug introduced with the Azure LB support. I'll get that sorted.

Changed in charm-kubernetes-master:
assignee:	nobody → Cory Johns (johnsca)
milestone:	none → 1.22
importance:	Undecided → Critical
status:	New → In Progress