Canonical Juju

[k8s] Pod Limits, Requests and QOS setup

Bug #2039215 reported by Pedro Guimarães on 2023-10-12

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	Wishlist	Unassigned

Bug Description

There are multiple issues opened requesting APIs to give control for limits and requests:
* https://bugs.launchpad.net/juju/+bug/2016242
* https://bugs.launchpad.net/juju/+bug/2023782
* https://bugs.launchpad.net/juju/+bug/1919976

This bug aims at highlighting what are the consequences of having these values set, and how to set them.

Juju does allow to configure limits and requests via constraints, however, there are two main problems: (1) it will build it with the same values; and (2) it will not set the initContainer, which means it is not possible to configure Pods with QOSClass=Guaranteed, as described in: https://github.com/kubernetes/kubernetes/issues/96044

As described in LP#1919976, comment #8, it is possible to set these parameters post deployment, for example:
$ kubectl patch sts postgresql-k8s -n test -p '{"spec":{"template":{"spec":{"containers":[{"name":"postgresql", "resources":{"limits":{"memory":"2Gi", "cpu": "1"}, "requests":{"memory": "2Gi", "cpu": "1"} }}]}}}}'

That will trigger a RESTART in the workload.

---------------------------------------------------------------------------------------

I've built a quick test environment with microk8s 1.27.5 in classic confinement. Here is what I have observed.

As-is, when we deploy a workload, each Pod of the statefulset will come up without limits nor requests. Each Pod will have QOSClass set to:
$ kubectl get po -n test postgresql-k8s-0 -o=yaml | grep -i qosclass
qosClass: BestEffort

Checking the specifics of that workload (postgresql): https://pastebin.ubuntu.com/p/XQrSn4hxbp/

Patching one of the containers to have limits and requests (with the same size, and for both CPU and memory):
$ kubectl patch sts postgresql-k8s -n test -p '{"spec":{"template":{"spec":{"containers":[{"name":"postgresql", "resources":{"limits":{"memory":"2Gi", "cpu": "1"}, "requests":{"memory": "2Gi", "cpu": "1"} }}]}}}}'
statefulset.apps/postgresql-k8s patched

And results in QOSClass=Burstable: https://pastebin.ubuntu.com/p/QS3Qh2jh3V/

It also results in pods with a better OOM score and enforced max memory.

If all containers are patched, including the init-containers, then the pod moves to QOSClass=Guaranteed.
To discover all containers' names, use:
$ kubectl get sts -o=json -n test postgresql-k8s | jq .spec.template.spec.initContainers[].name
$ kubectl get sts -o=json -n test postgresql-k8s | jq .spec.template.spec.containers[].name

The result of patching all the containers AND init-containers to limits and requests having the same values for both memory and cpu results in QOSClass=Guaranteed and:
https://pastebin.ubuntu.com/p/87Qb99hSTW/

---------------------------------------------------------------------------------

As discussed in LP#2023782, limits and requests will create cgroups to enforce these limitations. Each container will receive a different cgroup, as for example postgresql container and its processes are the only isolated in one cgroup:
https://pastebin.ubuntu.com/p/nkMhDwJQ64/

Whereas charm container has a different cgroup: https://pastebin.ubuntu.com/p/Rp55fByJJp/

---------------------------------------------------------------------------------

Conclusions:
1) It is possible to set QOS values post-deployment, via charms
2) Moving from BestEffort to Guaranteed reduces drastically reduces the oom_score_adj. According to the documentation [1], the lower oom_score_adj is, the least is the chance OOMKiller will kill the process
3) Guaranteed also gives the best chance to avoid eviction from Kubernetes [2]
4) However, if the workload crosses the memory limit, it will have a very high chance of being OOM'ed [4]

Therefore, for workloads such as databases, it is very interesting to be set as Guaranteed.
I recommend we also set the Juju controller as Guaranteed if we can guarantee its memory consumption. That reduces the chance of OOM.

As a side note, Pods must be Guaranteed to allow other setups, such as static CPU (i.e. pinning workloads to cores) [3]. That can be interesting for other workloads.

From Juju team, we need to either provide APIs to configure the values above OR to allow charms to edit these values and defined moments Juju must override them (e.g. at upgrades).

-------------------------------------------------------------------------------

[1] https://www.kernel.org/doc/Documentation/filesystems/proc.txt -- Chapter 3:
Acceptable values range from -1000
(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows userspace to
polarize the preference for oom killing either by always preferring a certain
task or completely disabling it. The lowest possible value, -1000, is
equivalent to disabling oom killing entirely for that task since it will always
report a badness score of 0.

[2] https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#guaranteed
[3] https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy

[4] https://docs.kernel.org/admin-guide/cgroup-v2.html
Memory usage hard limit. This is the main mechanism to limit memory usage of a cgroup. If a cgroup's memory usage reaches this limit and can't be reduced, the OOM killer is invoked in the cgroup. Under certain circumstances, the usage may go over the limit temporarily.

[5] Reference from Kubernetes source code where it decides to keep the Guaranteed status, as long as Limits and Requests have the same value:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kubectl/pkg/util/qos/qos.go#L96-L103

See original description

Tags:

Revision history for this message

Pedro Guimarães (pguimaraes) wrote on 2023-10-12:

Regarding:
* https://bugs.launchpad.net/juju/+bug/2035102

It is also possible to set terminationGracePeriodSeconds post-deployment of the charm with:

$ kubectl patch sts postgresql-k8s -n test -p '{"spec":{"template":{"spec":{"terminationGracePeriodSeconds": VALUE_OF_CHOICE}}}}'

That will trigger a restart in the statefulset.

Pedro Guimarães (pguimaraes) on 2023-10-12

description:

updated

Pedro Guimarães (pguimaraes) on 2023-10-13

description:

updated

Joseph Phillips (manadart) on 2023-10-19

Changed in juju:
status:	New → Triaged
importance:	Undecided → Wishlist

Pedro Guimarães (pguimaraes) on 2024-01-23

tags:

added: canonical-data-platform-eng

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.