Kubernetes-control-plane fails to get correct certs from vault

Bug #2009515 reported by Moises Emilio Benzan Mora
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
Fix Released
High
Unassigned

Bug Description

Just seen this on a K8s Jammy AWS run where the kubernetes control plane/scheduler nodes fail to get a correct certificate from the vault charm, leading to the scheduler not able to query the resources. This leads to all pods being stuck on Pending.

Relevant logs from the scheduler's journal:

Mar 04 06:53:14 ip-172-31-33-162 kube-scheduler.daemon[138112]: E0304 06:53:14.739507 138112 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.CSIDriver: failed to list *v1.CSIDriver: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1/csidrivers?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:21 ip-172-31-33-162 kube-scheduler.daemon[138112]: W0304 06:53:21.255804 138112 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.CSIStorageCapacity: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1/csistoragecapacities?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:21 ip-172-31-33-162 kube-scheduler.daemon[138112]: E0304 06:53:21.255844 138112 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.CSIStorageCapacity: failed to list *v1.CSIStorageCapacity: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1/csistoragecapacities?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:22 ip-172-31-33-162 kube-scheduler.daemon[138112]: W0304 06:53:22.668905 138112 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.StatefulSet: Get "https://127.0.0.1:6443/apis/apps/v1/statefulsets?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:22 ip-172-31-33-162 kube-scheduler.daemon[138112]: E0304 06:53:22.668943 138112 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.StatefulSet: failed to list *v1.StatefulSet: Get "https://127.0.0.1:6443/apis/apps/v1/statefulsets?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:22 ip-172-31-33-162 kube-scheduler.daemon[138112]: W0304 06:53:22.833359 138112 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.PersistentVolumeClaim: Get "https://127.0.0.1:6443/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:22 ip-172-31-33-162 kube-scheduler.daemon[138112]: E0304 06:53:22.833400 138112 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.PersistentVolumeClaim: failed to list *v1.PersistentVolumeClaim: Get "https://127.0.0.1:6443/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:27 ip-172-31-33-162 kube-scheduler.daemon[138112]: W0304 06:53:27.195447 138112 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.Service: Get "https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:27 ip-172-31-33-162 kube-scheduler.daemon[138112]: E0304 06:53:27.195484 138112 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:27 ip-172-31-33-162 kube-scheduler.daemon[138112]: W0304 06:53:27.823391 138112 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.Pod: Get "https://127.0.0.1:6443/api/v1/pods?fieldSelector=status.phase%21%3DSucceeded%2Cstatus.phase%21%3DFailed&limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:27 ip-172-31-33-162 kube-scheduler.daemon[138112]: E0304 06:53:27.823430 138112 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.Pod: failed to list *v1.Pod: Get "https://127.0.0.1:6443/api/v1/pods?fieldSelector=status.phase%21%3DSucceeded%2Cstatus.phase%21%3DFailed&limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:33 ip-172-31-33-162 kube-scheduler.daemon[138112]: W0304 06:53:33.659088 138112 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.ReplicationController: Get "https://127.0.0.1:6443/api/v1/replicationcontrollers?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:33 ip-172-31-33-162 kube-scheduler.daemon[138112]: E0304 06:53:33.659124 138112 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.ReplicationController: failed to list *v1.ReplicationController: Get "https://127.0.0.1:6443/api/v1/replicationcontrollers?limit=500&resourceVersion=0": x509: certificate signed by unknown authority

Test run: https://solutions.qa.canonical.com/v2/testruns/e4753baa-9a6b-4b3f-ae31-5d6fd3c57064/
Artifacts: https://oil-jenkins.canonical.com/artifacts/e4753baa-9a6b-4b3f-ae31-5d6fd3c57064/index.html
Crashdump: https://oil-jenkins.canonical.com/artifacts/e4753baa-9a6b-4b3f-ae31-5d6fd3c57064/generated/generated/kubernetes-aws/juju-crashdump-kubernetes-aws-2023-03-04-06.52.23.tar.gz

Revision history for this message
George Kraft (cynerva) wrote :

This is a race condition between build_kubeconfig, start_control_plane, and configure_apiserver.

In build_kubeconfig, a new client kubeconfig was written[1] with the new CA. Later in build_kubeconfig, it tried to fetch kube-scheduler's token from a secret[2]. Fetching the secret failed:

2023-03-04 02:53:50 INFO unit.kubernetes-control-plane/0.juju-log server.go:316 certificates:55: Executing ['kubectl', '--kubeconfig=/root/.kube/config', 'get', 'secrets', '-n', 'kube-system', '--field-selector', 'type=juju.is/token-auth', '-o', 'json']
2023-03-04 02:53:50 WARNING unit.kubernetes-control-plane/0.certificates-relation-changed logger.go:60 E0304 02:53:50.359454 135532 memcache.go:238] couldn't get current server API group list: Get "https://127.0.0.1:6443/api?timeout=32s": x509: certificate signed by unknown authority
2023-03-04 02:53:50 WARNING unit.kubernetes-control-plane/0.certificates-relation-changed logger.go:60 E0304 02:53:50.365873 135532 memcache.go:238] couldn't get current server API group list: Get "https://127.0.0.1:6443/api?timeout=32s": x509: certificate signed by unknown authority
2023-03-04 02:53:50 WARNING unit.kubernetes-control-plane/0.certificates-relation-changed logger.go:60 E0304 02:53:50.369305 135532 memcache.go:238] couldn't get current server API group list: Get "https://127.0.0.1:6443/api?timeout=32s": x509: certificate signed by unknown authority
2023-03-04 02:53:50 WARNING unit.kubernetes-control-plane/0.certificates-relation-changed logger.go:60 Unable to connect to the server: x509: certificate signed by unknown authority

This is because the client kubeconfig had the new CA, but kube-apiserver had not been restarted yet, so it was still serving with a server certificate from the old CA. Since build_kubeconfig could not obtain the secret, it skipped writing a new kubeconfig for kube-scheduler.

During start_control_plane, the charm restarted kube-scheduler to pick up the new CA. However, since no new kubeconfig had been written for kube-scheduler, it started with the old kubeconfig instead, still using the old CA.

Later, configure_apiserver ran, which restarted kube-apiserver with the new server certificate. This fixed the charm's ability to get secrets, but the damage had already been done. Kube-scheduler was never restarted again.

[1]: https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/d9f276f1e54c22f3f5d739c82f1a3b5894d140c7/reactive/kubernetes_control_plane.py#L2151-L2157
[2]: https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/d9f276f1e54c22f3f5d739c82f1a3b5894d140c7/reactive/kubernetes_control_plane.py#L2198-L2206

Changed in charm-kubernetes-master:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Konstantinos Kaskavelis (kaskavel) wrote :
Revision history for this message
George Kraft (cynerva) wrote :

This issue is really grounded in race conditions arising from the Reactive framework's flag-based code dispatch. We'll be rewriting the kubernetes charms soon to use the Operator framework. I don't think this issue will be relevant anymore following the conversion.

Changed in charm-kubernetes-master:
milestone: none → 1.28
Revision history for this message
George Kraft (cynerva) wrote :

To say it better, this should be fixed with the conversion.

Adam Dyess (addyess)
Changed in charm-kubernetes-master:
milestone: 1.28 → 1.28+ck1
Adam Dyess (addyess)
Changed in charm-kubernetes-master:
milestone: 1.28+ck1 → 1.29
Changed in charm-kubernetes-master:
status: Triaged → Fix Committed
Changed in charm-kubernetes-master:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.