[CK1.22] Kubernetes master leader stuck on "Applying system:monitoring RBAC role"

Bug #1941763 reported by Michael Skalka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
Fix Released
Critical
Cory Johns

Bug Description

During a CK 1.22 release test run the k8s-master units fail to fully come up, with the leader being stuck in maintenance on applying RBAC roles:

kubernetes-master/0 waiting idle 0/lxd/1 10.246.64.219 6443/tcp Waiting for auth-webhook tokens
  calico/7 waiting idle 10.246.64.219 Waiting to retry Calico node configuration
  containerd/7 active idle 10.246.64.219 Container runtime available
  hacluster-kubernetes-master/1 waiting idle 10.246.64.219 Resource: res_kube_scheduler_snap.kube_scheduler.daemon not yet configured
kubernetes-master/1 waiting executing 2/lxd/1 10.246.64.215 6443/tcp Waiting for auth-webhook tokens
  calico/6 waiting idle 10.246.64.215 Waiting to retry Calico node configuration
  containerd/6 active idle 10.246.64.215 Container runtime available
  hacluster-kubernetes-master/0* active executing 10.246.64.215 Unit is ready and clustered
kubernetes-master/2* maintenance idle 4/lxd/1 10.246.64.216 6443/tcp Applying system:monitoring RBAC role
  calico/8 waiting idle 10.246.64.216 Waiting to retry Calico node configuration
  containerd/8 active idle 10.246.64.216 Container runtime available
  hacluster-kubernetes-master/2 waiting idle 10.246.64.216 Resource: res_kube_scheduler_snap.kube_scheduler.daemon not yet configured

In the leader's juju log we can see it's attempting to connect to <something>

...
2021-08-26 14:57:25 INFO juju-log Executing ['kubectl', '--kubeconfig=/root/.kube/config', 'apply', '-f', '/root/cdk/system-monitoring-rbac-role.yaml']
2021-08-26 14:57:28 WARNING update-status Unable to connect to the server: dial tcp 10.246.64.82:6443: connect: no route to host
2021-08-26 14:57:28 DEBUG jujuc server.go:211 running hook tool "juju-log" for kubernetes-master/2-update-status-418852934066114952
2021-08-26 14:57:28 INFO juju-log Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-kubernetes-master-2/charm/reactive/kubernetes_master.py", line 1874, in apply_system_monitoring_rbac_role
    kubectl("apply", "-f", path)
  File "/var/lib/juju/agents/unit-kubernetes-master-2/charm/lib/charms/layer/kubernetes_common.py", line 258, in kubectl
    return check_output(command)
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['kubectl', '--kubeconfig=/root/.kube/config', 'apply', '-f', '/root/cdk/system-monitoring-rbac-role.yaml']' returned non-zero exit status 1.

2021-08-26 14:57:28 DEBUG jujuc server.go:211 running hook tool "juju-log" for kubernetes-master/2-update-status-418852934066114952
2021-08-26 14:57:28 INFO juju-log Waiting to retry applying system:monitoring RBAC role
...

From the available logs I don't know offhand what is at that 10.246.64.82 address, however it seems similar to LP#1929234 which we have seen on both baremetal (as in this case) as well as AWS.

Test run: https://solutions.qa.canonical.com/testruns/testRun/896b19db-62e9-4d83-8393-6592bf1ce3b6
Crashdump: https://oil-jenkins.canonical.com/artifacts/896b19db-62e9-4d83-8393-6592bf1ce3b6/generated/generated/kubernetes/juju-crashdump-kubernetes-2021-08-26-14.50.32.tar.gz
Bundle: https://oil-jenkins.canonical.com/artifacts/896b19db-62e9-4d83-8393-6592bf1ce3b6/generated/generated/kubernetes/bundle.yaml

Michael Skalka (mskalka)
tags: added: cdo-release-blocker
tags: added: cdo-qa foundations-engine
Revision history for this message
George Kraft (cynerva) wrote :

On AWS, the kubernetes-master charm is stuck because kubectl calls are failing, but only sometimes:

# kubectl get po
error: You must be logged in to the server (Unauthorized)
# kubectl get po
No resources found in default namespace.

This is happening because /root/.kube/config is pointing to the IP of kubeapi-load-balancer, which is distributing traffic between multiple kubernetes-master units. The two units do not agree on what the admin token is:

$ juju ssh kubernetes-master/0 sudo cat /root/.kube/config | grep token
    token: admin::RAO...c8r
$ juju ssh kubernetes-master/1 sudo cat /root/.kube/config | grep token
    token: admin::e2z...jYz

The auth webhook reads the local /root/.kube/config to determine what the admin token is, so, if the request from kubernetes-master/0 lands on kubernetes-master/1 or vice versa, the request fails.

The two kubernetes-master units are unable to progress to the point where they eventually agree on what the token is. They try and fail, repeatedly, to create the admin token as a Kubernetes secret:

unit-kubernetes-master-1: 13:09:43 INFO unit.kubernetes-master/1.juju-log WARN: Unable to create secret for admin
unit-kubernetes-master-0: 13:10:20 INFO unit.kubernetes-master/0.juju-log WARN: Unable to create secret for admin
...
unit-kubernetes-master-0: 18:03:11 INFO unit.kubernetes-master/0.juju-log WARN: Unable to create secret for admin
unit-kubernetes-master-1: 18:18:43 INFO unit.kubernetes-master/1.juju-log WARN: Unable to create secret for admin

In the past, this wasn't a problem because kubernetes-master kubectl requests went straight to the local IP, not kubeapi-load-balancer. So the request would always land on the local unit, where the local admin token is guaranteed to work.

We'll need to either revert kubernetes-master back to using its local IP, or fix the admin token handling to be less reliant on a successful connection to the Kubernetes API.

Revision history for this message
Cory Johns (johnsca) wrote :

The control plane nodes talking to the LB address was not intended and is a bug introduced with the Azure LB support. I'll get that sorted.

Changed in charm-kubernetes-master:
assignee: nobody → Cory Johns (johnsca)
milestone: none → 1.22
importance: Undecided → Critical
status: New → In Progress
Revision history for this message
Cory Johns (johnsca) wrote (last edit ):
Revision history for this message
George Kraft (cynerva) wrote :
Changed in charm-kubernetes-master:
status: In Progress → Fix Committed
Changed in charm-kubernetes-master:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.