commit 4bbe235ea4ef4d8fd9758d28559d11e5a5e3060c
Author: Steven Webster <email address hidden>
Date: Fri Oct 15 14:21:06 2021 -0400
Implement CNI cache file cleanup for stale files
It has been observed in systems running for months -> years
that the CNI cache files (representing attributes of
network attachment definitions of pods) can accumulate in
large numbers in the /var/lib/cni/results/ and
/var/lib/cni/multus/ directories.
The cache files in /var/lib/cni/results/ have a naming signature of:
<type>-<pod id>-<interface name>
While the cache files in /var/lib/cni/multus have a naming signature
of:
<pod id>
Normally these files are cleaned up automatically (I believe
this is the responsibility of containerd). It has been seen
that this happens reliably when one manually deletes a pod.
The issue has been reproduced in the case of a host being manually
rebooted. In this case, the pods are re-created when the host comes
back up, but with a different pod-id than was used before
In this case, _most_ of the time the cache files from the previous
instantiation of the pod are deleted, but occasionally a few are
missed by the internal garbage collection mechanism.
Once a cache file from the previous instantiation of a pod escapes
garbage collection, it seems to be left as a stale file for all
subsequent reboots. Over time, this can cause these stale files
to accumulate and take up disk space unnecessarily.
This commit attempts to alleviate the problem by introducing
a CNI cache cleanup script which runs as a cron job every 24 hours
and deletes files which are over 1 day old.
The cleanup mechanism analyzes the cache files by name and
compares them with the id(s) of the currently running pods. Any
stale files detected are deleted.
TEST PLAN:
- Confirm job runs at prescribed time
- Confirm existing pods cache files are not deleted
- Confirm stale cache files from no longer existing pods are
deleted after the file is 6 hours old.
- Confirm stale cache files from no longer existing pods are
not deleted if the file is younger than 6 hours old.
- Confirm the script does not run if kubelet is not up yet
Reviewed: https:/ /review. opendev. org/c/starlingx /stx-puppet/ +/814441 /opendev. org/starlingx/ stx-puppet/ commit/ 4bbe235ea4ef4d8 fd9758d28559d11 e5a5e3060c
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 4bbe235ea4ef4d8 fd9758d28559d11 e5a5e3060c
Author: Steven Webster <email address hidden>
Date: Fri Oct 15 14:21:06 2021 -0400
Implement CNI cache file cleanup for stale files
It has been observed in systems running for months -> years cni/results/ and lib/cni/ multus/ directories.
that the CNI cache files (representing attributes of
network attachment definitions of pods) can accumulate in
large numbers in the /var/lib/
/var/
The cache files in /var/lib/ cni/results/ have a naming signature of:
<type>-<pod id>-<interface name>
While the cache files in /var/lib/cni/multus have a naming signature
of:
<pod id>
Normally these files are cleaned up automatically (I believe
this is the responsibility of containerd). It has been seen
that this happens reliably when one manually deletes a pod.
The issue has been reproduced in the case of a host being manually
rebooted. In this case, the pods are re-created when the host comes
back up, but with a different pod-id than was used before
In this case, _most_ of the time the cache files from the previous
instantiation of the pod are deleted, but occasionally a few are
missed by the internal garbage collection mechanism.
Once a cache file from the previous instantiation of a pod escapes
garbage collection, it seems to be left as a stale file for all
subsequent reboots. Over time, this can cause these stale files
to accumulate and take up disk space unnecessarily.
This commit attempts to alleviate the problem by introducing
a CNI cache cleanup script which runs as a cron job every 24 hours
and deletes files which are over 1 day old.
The cleanup mechanism analyzes the cache files by name and
compares them with the id(s) of the currently running pods. Any
stale files detected are deleted.
TEST PLAN:
- Confirm job runs at prescribed time
- Confirm existing pods cache files are not deleted
- Confirm stale cache files from no longer existing pods are
deleted after the file is 6 hours old.
- Confirm stale cache files from no longer existing pods are
not deleted if the file is younger than 6 hours old.
- Confirm the script does not run if kubelet is not up yet
Depends-On: https:/ /review. opendev. org/c/starlingx /integ/ +/814439
Closes-Bug: 1947386
Signed-off-by: Steven Webster <email address hidden> 7a9477bbb47bf4b 0fc16b8a776
Change-Id: Ife36b48ef97d4a