Excessive disk space taken up by /var/lib/cni/ files

Bug #1947386 reported by Steven Webster
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Steven Webster

Bug Description

Brief Description
-----------------
When a system with multiple pods has been up for a long time (months - years), it can be seen that the amount of files and the space they take in /var/lib/cni/results/ and /var/lib/cni/multus/ can grow to a large number.

These files represent a cache of the network attachments to the pods and should be cleaned up when the pod no longer exists. Usually this happens successfully, but in certain circumstances like host reboots, some of the files have been observed to not be cleaned up properly.

Severity
--------
Major: Given enough time, the stale cache files will fill up all available disk space.

Steps to Reproduce
------------------
Create multiple pods, preferably with multiple network attachments per pod. Continually reboot the host and observe the number of files grow in /var/lib/cni/results/* and /var/lib/cni/multus/*

Expected Behavior
------------------
The files should be cleaned up when the pod id no longer exists.

Actual Behavior
----------------
The number of files can continually grow. If a file has not been selected for automatic garbage collection, it is there forever before manual intervention.

Reproducibility
---------------
It seems to depend on the network attachment definitions and does not occur every time.

System Configuration
--------------------
N/A

Branch/Pull Time/Commit
-----------------------
master

Last Pass
---------
N/A

Timestamp/Logs
--------------
N/A

Test Activity
-------------
Developer Testing

Workaround
----------
It is possible to cleanup the files manually. Each file has the pod-id (given by crictl ps -v) embedded in it's name. If the file is no longer associated with a running pod-id, it can be safely deleted.

Changed in starlingx:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Steven Webster (swebster-wr)
Ghada Khalil (gkhalil)
tags: added: stx.6.0 stx.containers stx.networking
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/814439

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/814441

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/814439
Committed: https://opendev.org/starlingx/integ/commit/5d1a26b89d2ecb7d81c1e45b160e275da2ef11d1
Submitter: "Zuul (22348)"
Branch: master

commit 5d1a26b89d2ecb7d81c1e45b160e275da2ef11d1
Author: Steven Webster <email address hidden>
Date: Fri Oct 15 10:00:20 2021 -0400

    Implement CNI cache file cleanup for stale files

    It has been observed in systems running for months -> years
    that the CNI cache files (representing attributes of
    network attachment definitions of pods) can accumulate in
    large numbers in the /var/lib/cni/results/ and
    /var/lib/cni/multus/ directories.

    The cache files in /var/lib/cni/results/ have a naming signature of:

    <type>-<pod id>-<interface name>

    While the cache files in /var/lib/cni/multus have a naming signature
    of:

    <pod id>

    Normally these files are cleaned up automatically (I believe
    this is the responsibility of containerd). It has been seen
    that this happens reliably when one manually deletes a pod.

    The issue has been reproduced in the case of a host being manually
    rebooted. In this case, the pods are re-created when the host comes
    back up, but with a different pod-id than was used before

    In this case, _most_ of the time the cache files from the previous
    instantiation of the pod are deleted, but occasionally a few are
    missed by the internal garbage collection mechanism.

    Once a cache file from the previous instantiation of a pod escapes
    garbage collection, it seems to be left as a stale file for all
    subsequent reboots. Over time, this can cause these stale files
    to accumulate and take up disk space unnecessarily.

    The script will be called once by the k8s-pod-recovery service
    on system startup, and then periodically via a cron job installed
    by puppet.

    The cleanup mechanism analyzes the cache files by name and
    compares them with the id(s) of the currently running pods. Any
    stale files detected are deleted.

    Test Plan:

    PASS: Verify existing pods do not have their cache files removed
    PASS: Verify files younger than the specified 'olderthan' time
          are not removed
    PASS: Verify stale cache files for pods that do not exist anymore
          are removed.
    PASS: Verify the script does not run if kubelet is not up yet.

    Failure Path:

    PASS: Verify files not matching the naming signature (pod id
          embedded in file name) are not processed

    Regression:

    PASS: Verify system install
    PASS: Verify feature logging

    Partial-Bug: 1947386

    Signed-off-by: Steven Webster <email address hidden>
    Change-Id: I0ce06646001e52d1cc6d204b924f41d049264b4c

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/814441
Committed: https://opendev.org/starlingx/stx-puppet/commit/4bbe235ea4ef4d8fd9758d28559d11e5a5e3060c
Submitter: "Zuul (22348)"
Branch: master

commit 4bbe235ea4ef4d8fd9758d28559d11e5a5e3060c
Author: Steven Webster <email address hidden>
Date: Fri Oct 15 14:21:06 2021 -0400

    Implement CNI cache file cleanup for stale files

    It has been observed in systems running for months -> years
    that the CNI cache files (representing attributes of
    network attachment definitions of pods) can accumulate in
    large numbers in the /var/lib/cni/results/ and
    /var/lib/cni/multus/ directories.

    The cache files in /var/lib/cni/results/ have a naming signature of:

    <type>-<pod id>-<interface name>

    While the cache files in /var/lib/cni/multus have a naming signature
    of:

    <pod id>

    Normally these files are cleaned up automatically (I believe
    this is the responsibility of containerd). It has been seen
    that this happens reliably when one manually deletes a pod.

    The issue has been reproduced in the case of a host being manually
    rebooted. In this case, the pods are re-created when the host comes
    back up, but with a different pod-id than was used before

    In this case, _most_ of the time the cache files from the previous
    instantiation of the pod are deleted, but occasionally a few are
    missed by the internal garbage collection mechanism.

    Once a cache file from the previous instantiation of a pod escapes
    garbage collection, it seems to be left as a stale file for all
    subsequent reboots. Over time, this can cause these stale files
    to accumulate and take up disk space unnecessarily.

    This commit attempts to alleviate the problem by introducing
    a CNI cache cleanup script which runs as a cron job every 24 hours
    and deletes files which are over 1 day old.

    The cleanup mechanism analyzes the cache files by name and
    compares them with the id(s) of the currently running pods. Any
    stale files detected are deleted.

    TEST PLAN:

    - Confirm job runs at prescribed time
    - Confirm existing pods cache files are not deleted
    - Confirm stale cache files from no longer existing pods are
      deleted after the file is 6 hours old.
    - Confirm stale cache files from no longer existing pods are
      not deleted if the file is younger than 6 hours old.
    - Confirm the script does not run if kubelet is not up yet

    Depends-On: https://review.opendev.org/c/starlingx/integ/+/814439
    Closes-Bug: 1947386

    Signed-off-by: Steven Webster <email address hidden>
    Change-Id: Ife36b48ef97d4a7a9477bbb47bf4b0fc16b8a776

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.