Webhook unable to load secrets causes node NotReady flapping

Bug #1926534 reported by Chris Johnston
This bug report is a duplicate of:  Bug #1927145: Make auth-webhook async. Edit Remove
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
Triaged
Medium
Unassigned

Bug Description

If the webhook is unable to load secrets for any reason, the service waits for 10s prior to an auth request timeout and then sends the auth request to keystone where service account auth would fail anyway. Each request takes a gunicorn process, blocking it for 10s+. 13 requests and things start failing. This should be more robust and hopefully fail faster if needed to keep gunicorn able to handle requests.

[2021-04-28 06:30:44 +0000] [14185] [DEBUG] POST /v1beta1
[2021-04-28 06:30:44 +0000] [14185] [DEBUG] REQ: {'kind': 'TokenReview', 'apiVersion': 'authentication.k8s.io/v1beta1', 'metadata': {'creationTimestamp': None}, 'spec': {'token': '********', 'audiences': ['https://kubernetes.default.svc']}, 'status': {'user': {}}}
[2021-04-28 06:30:44 +0000] [14185] [INFO] Checking token
[2021-04-28 06:30:44 +0000] [14185] [INFO] Checking secret
[2021-04-28 06:30:54 +0000] [14185] [INFO] Unable to load secrets: Command '['/snap/bin/kubectl', '--kubeconfig=/root/.kube/config', 'get', 'secrets', '-n', 'kube-system', '-o', 'json']' timed out after 10 seconds.
[2021-04-28 06:30:54 +0000] [14185] [INFO] Checking Keystone
[2021-04-28 06:30:54 +0000] [14185] [DEBUG] Forwarding to: https://k8s-auth-svc:8443/webhook
[2021-04-28 06:30:54 +0000] [14185] [DEBUG] SSLError with server; skipping cert validation
[2021-04-28 06:30:56 +0000] [14185] [DEBUG] NAK: {'kind': 'TokenReview', 'apiVersion': 'authentication.k8s.io/v1beta1', 'metadata': {'creationTimestamp': None}, 'spec': {'token': '********', 'audiences': ['https://kubernetes.default.svc']}, 'status': {'authenticated': False}}

Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

The '13' from the description is coming from the gunicorn recommendation [0] of num_workers = (2*cores)+1. They also note that 4-12 workers should be plenty, so the k8-master charm caps the core count [1] at 6 such that num_workers = (2*6)+1 = 13.

The auth-webhook service is using default synchronous gunicorn workers, which I think is causing the problem here. As noted, if 'kubectl get secrets' times out, it blocks a worker for 10s. Worse, if an external authn service (like keystone) is configured, the worker will block up to 30s waiting for an external response. An active k8s cluster may issue dozens of authn requests per second, which can quickly block all 13 auth-webhook workers on a k8s-master.

At that point, the master gunicorn process will fail, auth-webhook will restart, and requests will once again be processed. It sounds like in this case, there are a large number of authn requests in a queue so that even after auth-webhook restarts, it's quickly overrun again.

One immediate workaround may be to increase auth-webhook's 'num_workers' in the $CHARM_DIR template, e.g.:

/var/lib/juju/agents/unit-kubernetes-master-0/charm/templates/cdk.master.auth-webhook.service

I would try hard coding this value to 16, 24, 32, etc to see if there's a number of workers that can stay ahead of the authn request queue. Note, this value will be overwritten on upgrade-charm, so this is just a temporary suggestion.

Long term, the fix should be to use async gunicorn workers, make our 'check_secrets' method fail faster, and find a way for 'kubectl get secrets' to be authenticated without tying up another auth-webhook worker.

[0] https://docs.gunicorn.org/en/stable/design.html#how-many-workers
[1] https://github.com/charmed-kubernetes/charm-kubernetes-master/blob/3ade399bb01caf103c9b3449bc1e500e90f98dc7/reactive/kubernetes_master.py#L1048

George Kraft (cynerva)
Changed in charm-kubernetes-master:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

This should now be fixed with the work that went into:

https://bugs.launchpad.net/charm-kubernetes-master/+bug/1927145

Marking this bug as a dupe of that one.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.