StarlingX

Bug #1947386
Comment #4

Comment 4 for bug 1947386

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-05: Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/814441
Committed: https://opendev.org/starlingx/stx-puppet/commit/4bbe235ea4ef4d8fd9758d28559d11e5a5e3060c
Submitter: "Zuul (22348)"
Branch: master

commit 4bbe235ea4ef4d8fd9758d28559d11e5a5e3060c
Author: Steven Webster <email address hidden>
Date: Fri Oct 15 14:21:06 2021 -0400

Implement CNI cache file cleanup for stale files

    It has been observed in systems running for months -> years
    that the CNI cache files (representing attributes of
    network attachment definitions of pods) can accumulate in
    large numbers in the /var/lib/cni/results/ and
    /var/lib/cni/multus/ directories.

The cache files in /var/lib/cni/results/ have a naming signature of:

While the cache files in /var/lib/cni/multus have a naming signature
of:

    Normally these files are cleaned up automatically (I believe
    this is the responsibility of containerd). It has been seen
    that this happens reliably when one manually deletes a pod.

    The issue has been reproduced in the case of a host being manually
    rebooted. In this case, the pods are re-created when the host comes
    back up, but with a different pod-id than was used before

    In this case, _most_ of the time the cache files from the previous
    instantiation of the pod are deleted, but occasionally a few are
    missed by the internal garbage collection mechanism.

    Once a cache file from the previous instantiation of a pod escapes
    garbage collection, it seems to be left as a stale file for all
    subsequent reboots. Over time, this can cause these stale files
    to accumulate and take up disk space unnecessarily.

    This commit attempts to alleviate the problem by introducing
    a CNI cache cleanup script which runs as a cron job every 24 hours
    and deletes files which are over 1 day old.

    The cleanup mechanism analyzes the cache files by name and
    compares them with the id(s) of the currently running pods. Any
    stale files detected are deleted.

TEST PLAN:

    - Confirm job runs at prescribed time
    - Confirm existing pods cache files are not deleted
    - Confirm stale cache files from no longer existing pods are
      deleted after the file is 6 hours old.
    - Confirm stale cache files from no longer existing pods are
      not deleted if the file is younger than 6 hours old.
    - Confirm the script does not run if kubelet is not up yet

Depends-On: https://review.opendev.org/c/starlingx/integ/+/814439
Closes-Bug: 1947386

Signed-off-by: Steven Webster <email address hidden>
Change-Id: Ife36b48ef97d4a7a9477bbb47bf4b0fc16b8a776

Reviewed:  https://review.opendev.org/c/starlingx/stx-puppet/+/814441
Committed: https://opendev.org/starlingx/stx-puppet/commit/4bbe235ea4ef4d8fd9758d28559d11e5a5e3060c
Submitter: "Zuul (22348)"
Branch:    master

commit 4bbe235ea4ef4d8fd9758d28559d11e5a5e3060c
Author: Steven Webster <steven.webster@windriver.com>
Date:   Fri Oct 15 14:21:06 2021 -0400

Implement CNI cache file cleanup for stale files
    
    It has been observed in systems running for months -> years
    that the CNI cache files (representing attributes of
    network attachment definitions of pods) can accumulate in
    large numbers in the /var/lib/cni/results/ and
    /var/lib/cni/multus/ directories.
    
    The cache files in /var/lib/cni/results/ have a naming signature of:
    
    <type>-<pod id>-<interface name>
    
    While the cache files in /var/lib/cni/multus have a naming signature
    of:
    
    <pod id>
    
    Normally these files are cleaned up automatically (I believe
    this is the responsibility of containerd).  It has been seen
    that this happens reliably when one manually deletes a pod.
    
    The issue has been reproduced in the case of a host being manually
    rebooted.  In this case, the pods are re-created when the host comes
    back up, but with a different pod-id than was used before
    
    In this case, _most_ of the time the cache files from the previous
    instantiation of the pod are deleted, but occasionally a few are
    missed by the internal garbage collection mechanism.
    
    Once a cache file from the previous instantiation of a pod escapes
    garbage collection, it seems to be left as a stale file for all
    subsequent reboots.  Over time, this can cause these stale files
    to accumulate and take up disk space unnecessarily.
    
    This commit attempts to alleviate the problem by introducing
    a CNI cache cleanup script which runs as a cron job every 24 hours
    and deletes files which are over 1 day old.
    
    The cleanup mechanism analyzes the cache files by name and
    compares them with the id(s) of the currently running pods. Any
    stale files detected are deleted.
    
    TEST PLAN:
    
    - Confirm job runs at prescribed time
    - Confirm existing pods cache files are not deleted
    - Confirm stale cache files from no longer existing pods are
      deleted after the file is 6 hours old.
    - Confirm stale cache files from no longer existing pods are
      not deleted if the file is younger than 6 hours old.
    - Confirm the script does not run if kubelet is not up yet
    
    Depends-On: https://review.opendev.org/c/starlingx/integ/+/814439
    Closes-Bug: 1947386
    
    Signed-off-by: Steven Webster <steven.webster@windriver.com>
    Change-Id: Ife36b48ef97d4a7a9477bbb47bf4b0fc16b8a776