StarlingX

Bug #1865924
Comment #0

Comment 0 for bug 1865924

Revision history for this message

Tee Ngo (teewrs) wrote on 2020-03-03:

Brief Description
-----------------
The following issue was observed in a distributed cloud configuration. The /var/log partition was filled up due to space taken by a large number of filebeat deleted files.

Severity
--------
Critical

Steps to Reproduce
------------------
Set up a large distributed cloud with stx-monitor applied and soak for a few days with some test activities such as deploying, managing/unamaging and removing subclouds.

Expected Behavior
------------------
Service logs are saved to disks and rotated accordingly

Actual Behavior
----------------
logmgmt process was hogging cpu, no logs were flushed to disk. Log files were rotated rapidly with almost no content and critical alarms were generated.

The problem documented here (courtesy of Al Bailey)
https://www.elastic.co/guide/en/beats/filebeat/master/faq-deleted-files-are-not-freed.html
might be the cause of this issue

Reproducibility
---------------
Seen once

System Configuration
--------------------
IPv6 Distributed Cloud

Branch/Pull Time/Commit
-----------------------
Feb 22 master code

Last Pass
---------
N/A

Timestamp/Logs
--------------
As logs were not flushed to disk, there are
See list of deleted files as a result of running the command "sudo lsof|grep deleted" attached

Test Activity
-------------
Evaluation

Workaround
----------
Kill logmgmt process and delete filebeat pods.