While testing k8s 1.26 on aws with jammy and calico the test_clusterip_service_endpoint test failed to connect to to the hello-world pod, connect timed out.
Looking at the status of the hello-world pod, k8s-cp 0 seems to be the one having problems launching the pod.
it starts creating the pod:
Dec 13 20:40:49 ip-172-31-34-74 kube-controller-manager.daemon[126430]: I1213 20:40:49.442691 126430 replica_set.go:571] "Too few replicas" replicaSet="default/hello-world-7b8bb957c6" need=1 creating=1
and then is able to bind the pod to the node after a few more POSTs/PUTs:
Dec 13 20:40:49 ip-172-31-34-74 kube-scheduler.daemon[122674]: I1213 20:40:49.459258 122674 schedule_one.go:252] "Successfully bound pod to node" pod="default/hello-world-7b8bb957c6-2gll9" node="ip-172-31-36-18.ec2.internal" evaluatedNodes=5 feasibleNodes=3
and then comes back with:
Dec 13 20:40:49 ip-172-31-34-74 kube-controller-manager.daemon[126430]: I1213 20:40:49.463994 126430 deployment_controller.go:491] "Error syncing deployment" deployment="default/hello-world" err="Operation cannot be fulfilled on deployments.apps \"hello-world\": the object has been modified; please apply your changes to the latest version and try again"
a few more successful POSTs and GETs and then everything comes back and destroys the pod:
Dec 13 20:40:53 ip-172-31-34-74 kube-controller-manager.daemon[126430]: I1213 20:40:53.844319 126430 replica_set.go:607] "Too many replicas" replicaSet="default/hello-world-7b8bb957c6" need=0 deleting=1
Dec 13 20:40:53 ip-172-31-34-74 kube-controller-manager.daemon[126430]: I1213 20:40:53.844361 126430 replica_set.go:235] "Found related ReplicaSets" replicaSet="default/hello-world-7b8bb957c6" relatedReplicaSets="[default/hello-world-7b8bb957c6 default/hello-world-8d968cd7b]"
Dec 13 20:40:53 ip-172-31-34-74 kube-controller-manager.daemon[126430]: I1213 20:40:53.844429 126430 controller_utils.go:592] "Deleting pod" controller="hello-world-7b8bb957c6" pod="default/hello-world-7b8bb957c6-2gll9"
I'm not quite sure what would have caused this error, it could have been a one off.
testrun at:
https://solutions.qa.canonical.com/v2/testruns/7ce07815-6d4e-4866-af87-40c190554bd2/
crashdump at:
https://oil-jenkins.canonical.com/artifacts/7ce07815-6d4e-4866-af87-40c190554bd2/generated/generated/kubernetes-aws/juju-crashdump-kubernetes-aws-2022-12-13-20.43.43.tar.gz
bundle at:
https://oil-jenkins.canonical.com/artifacts/7ce07815-6d4e-4866-af87-40c190554bd2/generated/generated/kubernetes-aws/bundle.yaml
The kube-proxy service on kubernetes- control- plane/1 seems to have gotten stuck in a bad state with x509 errors:
Dec 13 20:44:56 ip-172-31-33-85 kube-proxy. daemon[ 135314] : W1213 20:44:56.142414 135314 reflector.go:424] vendor/ k8s.io/ client- go/informers/ factory. go:150: failed to list *v1.Service: Get "https:/ /172.31. 33.85:6443/ api/v1/ services? labelSelector= %21service. kubernetes. io%2Fheadless% 2C%21service. kubernetes. io%2Fservice- proxy-name& limit=500& resourceVersion =0": x509: certificate signed by unknown authority daemon[ 135314] : E1213 20:44:56.142478 135314 reflector.go:140] vendor/ k8s.io/ client- go/informers/ factory. go:150: Failed to watch *v1.Service: failed to list *v1.Service: Get "https:/ /172.31. 33.85:6443/ api/v1/ services? labelSelector= %21service. kubernetes. io%2Fheadless% 2C%21service. kubernetes. io%2Fservice- proxy-name& limit=500& resourceVersion =0": x509: certificate signed by unknown authority
Dec 13 20:44:56 ip-172-31-33-85 kube-proxy.
172.31.33.85 is also kubernetes- control- plane/1, so, kube-apiserver and kube-proxy on the same host disagree about the CA.
This is a race condition. Timeline:
20:22:22 foundation.log: vault reissue- certificates action finishes relation- changed hook starts relation- changed hook finishes
20:28:30 kcp1: certificates-
20:28:46 kcp1: build_kubeconfig handler writes new kubeconfig for kube-proxy
20:28:53 kcp1: store_ca handler writes new CA to disk
20:29:05 kcp1: start_control_plane handler restarts kube-proxy
20:31:47 kcp1: certificates-
The kubeconfig written by build_kubeconfig had the old CA, because store_ca had not run yet. So when kube-proxy got restarted, it had the old CA in its kubeconfig. No further restarts of kube-proxy occurred, so it was stuck with the old CA config forever.