kube-ovn controller pod is in crashloop backoff after cidr-expansion

Bug #1995139 reported by Adam Dyess
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Charmed Kubernetes Testing
Fix Released
High
George Kraft
Kubernetes Control Plane Charm
Fix Released
High
George Kraft

Bug Description

while running the end-to-end validation tests from jenkins [0], a test failure usually manifests in the wrong spot after the cidr expansion tests rather than during. This was revealed by a CrashLoopBackoff Pod in the kube-ovn-controller deployment [1]. I checked the deployment config [2] and it had the right expanded cidr

Steps I believe are necessary to reproduce:
```
juju add-model kubernetes-ovn
juju deploy charmed-kubernetes --overlay overlays/kube-ovn.yaml
juju-wait
tox -e py -- .tox/py/bin/pytest jobs/integration/validation.py --cloud $CLOUD --controller $CONTROLLER --model kubernetes-ovn -k "cidr_expansion and toggle_metrics"
```

[0]: https://github.com/charmed-kubernetes/jenkins/blob/main/jobs/integration/validation.py
[1]: https://paste.ubuntu.com/p/ytMZMfRGd8/
[2]: https://paste.ubuntu.com/p/tt9vYcPTrm/

Revision history for this message
Adam Dyess (addyess) wrote :

After investigating the crash, and while writing this bug report, eventually the deployment stabilized and kube-ovn-controller was up without crashes. This may be why the rest of the tests in the suite continue normally. Perhaps there is a longer recovery time than expected after the cidr expansion is begun?

Revision history for this message
Adam Dyess (addyess) wrote :

After it was stable for 15m, I ran ONLY the `toggle_metrics` test again, and the kube-ovn-controller began crashing again. It may not be COMPLETELY related to this test, but something about what changing the metrics-server config does within the control-plane charm that exacerbates the crash

Revision history for this message
Adam Dyess (addyess) wrote (last edit ):

containerd go1.18 active 5 containerd stable 41 no Container runtime available
easyrsa 3.0.1 active 1 easyrsa stable 26 no Certificate Authority connected.
etcd 3.4.5 active 3 etcd stable 718 no Healthy with 3 known peers
kube-ovn active 5 kube-ovn edge 34 no
kubeapi-load-balancer 1.18.0 active 1 kubeapi-load-balancer stable 42 yes Loadbalancer ready.
kubernetes-control-plane 1.25.3 active 2 kubernetes-control-plane edge 208 no Kubernetes control-plane running.
kubernetes-worker 1.25.3 active 3 kubernetes-worker edge 72 yes Kubernetes worker running.

Revision history for this message
Adam Dyess (addyess) wrote :

Things look good in the kube-controller logs while it adds the metrics-server pod [0]

But then the api-server is restarted by kubernetes-control-plane and i lose the logs. When the api-server is back up, i can grab some of the crash logs [1]

[0]: https://paste.ubuntu.com/p/XJRXDngbR3/
[1]: https://paste.ubuntu.com/p/KfmwBfqYxK/

Revision history for this message
George Kraft (cynerva) wrote :

I can repro this. The kube-ovn-controller pod enters CrashLoopBackOff because of disrupted access to the Kubernetes API.

When test_service_cidr_expansion runs, there's a brief period where kube-apiserver serves with an old certificate that doesn't have the new 10.152.182.1 address in its SANs. This causes x509 errors that cause kube-ovn-controller to crash.

When test_toggle_metrics runs, it toggles the enable-metrics config, which causes kube-apiserver to get reconfigured and restarted. This causes "connection refused" errors that cause kube-ovn-controller to crash.

Each time it crashes, the backoff gets exponentially worse, up to a cap of 5 minutes between each restart. That's longer than the timeout of test_toggle_metrics.

I recommend fixing this in two ways:
1. Raise the test_toggle_metrics timeout to 10 minutes
2. Update kubernetes-control-plane so it doesn't reconfigure kube-apiserver every time enable-metrics changes

no longer affects: charm-kube-ovn
Changed in charmed-kubernetes-testing:
milestone: none → 1.25+ck3
Changed in charm-kubernetes-master:
milestone: none → 1.25+ck3
Changed in charmed-kubernetes-testing:
status: New → Triaged
Changed in charm-kubernetes-master:
status: New → Triaged
Changed in charmed-kubernetes-testing:
assignee: nobody → George Kraft (cynerva)
Changed in charm-kubernetes-master:
assignee: nobody → George Kraft (cynerva)
Changed in charmed-kubernetes-testing:
status: Triaged → In Progress
Changed in charm-kubernetes-master:
status: Triaged → In Progress
Changed in charmed-kubernetes-testing:
importance: Undecided → High
Changed in charm-kubernetes-master:
importance: Undecided → High
Revision history for this message
George Kraft (cynerva) wrote :
Changed in charmed-kubernetes-testing:
status: In Progress → Fix Committed
Changed in charm-kubernetes-master:
status: In Progress → Fix Committed
Changed in charmed-kubernetes-testing:
milestone: 1.25+ck3 → 1.26
Changed in charm-kubernetes-master:
milestone: 1.25+ck3 → 1.26
Adam Dyess (addyess)
Changed in charmed-kubernetes-testing:
status: Fix Committed → Fix Released
Changed in charm-kubernetes-master:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.