kubelet image garbage collection settings too high with no mechanism to reconfigure

Bug #1977754 reported by Jim Gauld
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Jim Gauld

Bug Description

Brief Description
-----------------
/var/lib/docker file-system thresholds are being hit prior to the kubelet configurable image garbage collection settings kick in. There is a global 80% setting for all file-systems. The kubelet default imageGC kicks in at 85% based on the threshold settings.

If there is extraneous remnant docker data, or if a customer has large images (eg, 2.5GB), then there can be little room left after we go beyond 85% where pod hard-evictions occur and we won't be able to schedule new pods on the node due to Node pressure. The kubelet default hard-eviction limit for images is 15%. This means that we actually have zero room left at 86% /var/lib/docker usage before pods stop scheduling, basically we cannot effectively even use the remaining docker space.

Need to reduce the imageGC setting below 80%.
Should configure the hard-eviction or imagefs to more reasonable value like 1GiB or 2GiB instead of 15% (eg, this translates to 4.5GiB of 30).

Once a system has installed via 'kubeadm' there is currently no mechanism to update any kubelet environment configuration settings from whatever they had initially.

Need a mechanism to update kubelet-config values and persist those changes on kubernetes nodes.

Severity
--------
Major: Cannot update kubelet-config settings. Sites require manual monitoring and periodic manual steps to recover /var/lib/docker usage. There is potential for a site to 'blow up'.

Steps to Reproduce
------------------
Fresh install ISO. Manually pull in various docker images.
The initial n3000-opae docker image and other manual docker images/pulls remain.

Expected Behavior
------------------
Expect to see kubelet logs where image garbage collection (imageGC) kick in and automatically remove images before 80% file-system alarms kick in. Expect no /var/lib/docker file-system alarms.

daemon.log.7.gz:2022-05-26T16:30:41.296 controller-0 kubelet[94391]: info I0526 16:30:41.296674 94391 image_gc_manager.go:304] "Disk usage on image filesystem is over the high threshold, trying to free bytes down to the low threshold" usage=85 highThreshold=85 amountToFree=1394143232 lowThreshold=80

Actual Behavior
----------------
See /var/lib/docker disk usage hitting 85% prior to imageGC removing images.
See pods being evicted and then scheduling Node pressure (unable to schedule) when we hit 86%. Unable to actually use the last 4.5GiB of the /var/lib/docker filesystem.
The /var/lib/docker/x directory may contain remnants of initial install, where all the new CRI stuff is under /var/lib/docker/io.containerd.* .

Reproducibility
---------------
100%. We always get default imageGC and hard-eviction settings.
We always see some docker usage from initial install that never gets removed.
Occasionally we see huge docker usage outside of CRI due to stuff that was never removed.

System Configuration
--------------------
AIO-DX. All K8S configurations.

Branch/Pull Time/Commit
-----------------------
BUILD_DATE="2022-06-01 13:08:06 -0400"

Last Pass
---------
Day one issue.

Timestamp/Logs
--------------
Can see collectd related filesystem logs change over time as there are step jumps in usage:
zgrep collectd daemon.log.3.gz daemon.log.4.gz daemon.log.3.gz daemon.log.2.gz daemon.log.1.gz daemon.log |grep -e reading |grep docker
daemon.log.3.gz:2022-01-12T13:51:06.853 controller-0 collectd[2718413]: info alarm notifier reading: 72.47 % usage - /var/lib/docker

When kubelet image GC runs, will see logs like:
2022-04-22T20:02:54.303 controller-0 kubelet[2311223]: info I0422 20:02:54.303461 2311223 image_gc_manager.go:376] [imageGCManager]: Removing image "sha256:5d0da3dc976460b72c77d94c8a1ad043720b0416bfc16c52c45d4847e53fadb6" to free 83521519 bytes

Will see overall file-system usage like this:
Filesystem Type 1M-blocks Used Available Use% Mounted on
/dev/mapper/cgts--vg-docker--lv xfs 30705 3294 27412 11% /var/lib/docker/

In the case where too much space is chewed up (eg, say 6GB or lots more, not in CRI), can see this where image GC no longer can cleanup:
daemon.log.11.gz:2022-01-12T00:35:15.718 controller-0 kubelet[88933]: info E0112 00:35:15.718627 88933 kubelet.go:1305] Image garbage collection failed once. Stats initialization may not have completed yet: failed to garbage collect required amount of images. Wanted to free 1316421632 bytes, but freed 297819 bytes

Test Activity
-------------
Feature Testing, Evaluation

Workaround
----------
Periodically and manually cleanup /var/lib/docker using commands like:
docker system prune --force
crictl rmi --prune

Can manually inspect and remove individual images too,
eg, "docker rmi x", "crictl rmi x"

Manually cleanup evicted pods like this:
crictl ps --state=Exited --quiet | xargs -r -I {} crictl rm {}

This does not address being able to reconfigure kubelet settings.

Jim Gauld (jgauld)
Changed in starlingx:
assignee: nobody → Jim Gauld (jgauld)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/844305
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/9400162956a032747866cb2dc3a1a052bf014a5b
Submitter: "Zuul (22348)"
Branch: master

commit 9400162956a032747866cb2dc3a1a052bf014a5b
Author: Jim Gauld <email address hidden>
Date: Wed Jun 1 10:24:46 2022 -0400

    Updated kubelet imageGC and evictionHard settings

    Configure kubelet-config settings for image garbage collection and
    hard eviction. New settings reduce likelihood of Node-Pressure
    Eviction that occurs essentially near 86% /var/lib/docker usage.

    The default upstream default imageGCHighThresholdPercent 85 is too high,
    especially with evictionHard imagefs.available default of 15%.

    The new image garbage collection parameters are engineered below
    the system global default 80% file-system threshold. This allows
    kubelet imageGC to cleanup space prior to hitting /var/lib/docker
    alarms.

    The evictionHard imagefs.available is reduced to 2Gi,
    from the previous setting 15% which translated to 4.5Gi.

    TESTING:
    PASS - AIO-DX fresh install gets updated kubelet config
    PASS - manually fill /var/lib/docker to exceed imageGC and
           verify GC operate
    PASS - manually fill /var/lib/docker to exceed 'size - 2Gi'
           and verify Node-Pressure eviction

    Partial-Bug: 1977754

    Signed-off-by: Jim Gauld <email address hidden>
    Change-Id: I5c5c7ba5dfcd8f854084ee954338d974726ea453

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/844298
Committed: https://opendev.org/starlingx/stx-puppet/commit/a430b9a203ef938bf8cc7030d03b13ddbba6513a
Submitter: "Zuul (22348)"
Branch: master

commit a430b9a203ef938bf8cc7030d03b13ddbba6513a
Author: Jim Gauld <email address hidden>
Date: Wed May 18 11:51:14 2022 -0400

    Add method to reconfigure kubelet at runtime

    This adds two runtime methods to reconfigure kubelet:
    platform::kubernetes::master::update_kubelet_params::runtime
    - this updates the kubelet-config ConfigMap with new parameters

    platform::kubernetes::update_kubelet_config::runtime
    - on each node, 'kubeadm upgrade node phase kubelet-config'
      is used to regenerate the /var/lib/kubelet/config.yaml file,
      then kubelet is restarted.

    Along with this new configuration update mechanism, new kubelet-config
    values from puppet are formatted with update script, i.e.,
    imageGCHighThresholdPercent: 79
    imageGCLowThresholdPercent: 75
    evictionHard:
      imagefs.available: 2Gi

    These new settings reduces likelihood of Node-Pressure Eviction
    that occurs essentially near 86% /var/lib/docker usage. The default
    upstream default imageGCHighThresholdPercent 85 is too high,
    especially with evictionHard imagefs.available default of 15%.

    The new image garbage collection parameters are engineered below
    the system global default 80% file-system threshold. This allows
    kubelet imageGC to cleanup space prior to hitting /var/lib/docker
    alarms.

    The evictionHard imagefs.available is reduced to 2Gi,
    from the previous setting 15% which translated to 4.5Gi.

    TESTING:
    PASS - manually fill /var/lib/docker to exceed imageGC and
           verify GC operates
    PASS - AIO-DX fresh install gets updated kubelet config
    PASS - AIO-DX apply/remove designer patch with updated kubelet config
    PASS - 'system kube-config-kubelet' updates K8S nodes kubelet config
    PASS - AIO-DX reinstall controller-1 has updated kubelet config
    PASS - AIO-DX install new worker node gets updated kubelet config

    Partial-Bug: 1977754

    Signed-off-by: Jim Gauld <email address hidden>
    Change-Id: If634a8f59be3c13bf48612c7c67ca2802a03fc28

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/844317
Committed: https://opendev.org/starlingx/config/commit/d57d3a07b8e2897f1cf087c7fd757d67645b0ffe
Submitter: "Zuul (22348)"
Branch: master

commit d57d3a07b8e2897f1cf087c7fd757d67645b0ffe
Author: Jim Gauld <email address hidden>
Date: Wed Jun 1 11:24:14 2022 -0400

    Add runtime reconfiguration of kubelet

    This adds the CLI command 'system kube-config-kubelet'. This invokes
    puppet runtime manifests to reconfigure kubelet-config ConfigMap
    with new parameters, and to upgrade kubernetes nodes with new
    parameters, and restart kubelet. This gives the ability to update
    kubelet parameters with a software patch.

    The specific kubelet-config parameters are provided within the puppet
    manifests and its supporting parameters script. The specific settings
    values and engineering are described in the puppet component.
    Identical settings are also configured at install time in
    ansible-playbooks.

    TESTING:
    PASS - manually fill /var/lib/docker to exceed imageGC and
           verify GC operates
    PASS - AIO-DX fresh install gets updated kubelet config
    PASS - AIO-DX apply/remove designer patch with updated kubelet config
    PASS - 'system kube-config-kubelet' updates K8S nodes kubelet config
    PASS - AIO-DX reinstall controller-1 has updated kubelet config
    PASS - AIO-DX install new worker node gets updated kubelet config
    PASS - build and view REST documentation

    Partial-Bug: 1977754
    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/844298
    Depends-On: https://review.opendev.org/c/starlingx/ansible-playbooks/+/844305

    Signed-off-by: Jim Gauld <email address hidden>
    Change-Id: Iad32a724d3f681bc9854fa663299f8539f70fd2a

Ghada Khalil (gkhalil)
tags: added: stx.containers
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
Jim Gauld (jgauld)
Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

The above fixes pre-date the creation of the r/stx.7.0 branch, so updating the release tag accordingly.

tags: added: stx.7.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.