Kubernetes Control Plane Charm

baremetal + kube_ovn: k8s-cp stuck waiting for 46 pods to start

Bug #2006957 reported by Alexander Balderson on 2023-02-10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Kubernetes Control Plane Charm	Fix Released	High	Mateo Florido	Kubernetes Control Plane Charm 1.27

Bug Description

On a baremetal deployment of charmed kuberentes using kube_ovn as the cni, all 3 k8s cp's are stuck waiting for 46 pods to start. All the charms are running latest/sable and are on the k8s 1.26 snap channel. This is running in our standard environment with no changes, where we see passes regularly.

From the logs in /var/log/pods i see a few notes about ovn-central pods failing to find a leader, or connect to any meaningful addresses, but the since there are logs, id assume the pod is actually up?

The logs we collect from kubectl after the run shows that there are 138 pods across all the k8s-cp's, this makes sense with 46 pods on 3 k8s cp-s, but im unsure why the pods are failing to start.

the testrun can be found at:
https://solutions.qa.canonical.com/v2/testruns/5ed81b47-3413-44ba-8207-d16cc4afb030/
bundle at:
https://solutions.qa.canonical.com/v2/testruns/5ed81b47-3413-44ba-8207-d16cc4afb030/
and crashdump at:
https://oil-jenkins.canonical.com/artifacts/5ed81b47-3413-44ba-8207-d16cc4afb030/generated/generated/kubernetes-maas/juju-crashdump-kubernetes-maas-2023-02-10-01.34.49.tar.gz

Tags:

Revision history for this message

George Kraft (cynerva) wrote on 2023-02-10:

I'm still looking into this. It's a bizarre one. There are 23 ovn-central pods and 29 kube-ovn-monitor pods. There should only be 3 of each. The extra pods are failing with:

Status: Failed
Reason: NodeAffinity
Message: Pod was rejected: Predicate NodeAffinity failed

These pods have anti-affinity policies that prevent more than 1 of each from being placed on any given node. The failure makes sense, but it does not make sense why there are so many pods. The charm correctly requested 3 replicas.

From the kube-controller-manager logs, it looks like the ReplicaSet controller spam-created pods during a 10 second window in which one of the control-plane nodes was NotReady. The extra pods were never deleted.

I think I should be able to reproduce this by forcing a kubernetes-control-plane node into a NotReady state during kube-ovn deployment. Let me give that a try.

Revision history for this message

George Kraft (cynerva) wrote on 2023-02-10:

I'm not having any luck reproducing this. Under normal conditions, if there are too many pods, the NodeAffinity error doesn't occur because kube-scheduler is smart enough not to schedule multiple conflicting pods onto the same node. Somehow, this deployment hit a corner case where kube-scheduler thought that the node had room for a new pod, but kubelet thought that it had a conflict.

Revision history for this message

George Kraft (cynerva) wrote on 2023-02-10:

Regardless of how the deployment got here, I think the kubernetes-control-plane charm either should not block on the Failed pods (which will not retry and will not be cleaned up by Kubernetes), or it should clean up Failed pods on behalf of the user.

Changed in charm-kubernetes-master:
importance:	Undecided → Critical
importance:	Critical → High
status:	New → Triaged

Revision history for this message

Alexander Balderson (asbalderson) wrote on 2023-02-13:

Thanks for the triage George,

I can say that we've only seen this one time, so something weird happened for sure and I'm not certain we can reproduce it either, but ill let you know if we see it more.

Mateo Florido (mateoflorido) on 2023-02-15

Changed in charm-kubernetes-master:
assignee:	nobody → Mateo Florido (mateoflorido)

Revision history for this message

Mateo Florido (mateoflorido) wrote on 2023-02-15:

PR: https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/pull/269

Changed in charm-kubernetes-master:
status:	Triaged → Fix Committed

Kevin W Monroe (kwmonroe) on 2023-02-15

Changed in charm-kubernetes-master:
milestone:	none → 1.27

Kevin W Monroe (kwmonroe) on 2023-04-23

Changed in charm-kubernetes-master:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.