A large number of files generated by filebeat pod are not removed

Bug #1865924 reported by Tee Ngo
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Kevin Smith

Bug Description

Brief Description
-----------------
The following issue was observed in a distributed cloud configuration. The /var/log partition was filled up due to space taken by a large number of filebeat deleted files.

Severity
--------
Critical

Steps to Reproduce
------------------
Set up a large distributed cloud with stx-monitor applied and soak for a few days with some test activities such as deploying, managing/unamaging and removing subclouds.

Expected Behavior
------------------
Service logs are saved to disks and rotated accordingly

Actual Behavior
----------------
logmgmt process was hogging cpu, no logs were flushed to disk. Log files were rotated rapidly with almost no content and filesystem critical alarm was generated.

The problem documented here (courtesy of Al Bailey)
https://www.elastic.co/guide/en/beats/filebeat/master/faq-deleted-files-are-not-freed.html
might be the cause of this issue

Reproducibility
---------------
Seen once

System Configuration
--------------------
IPv6 Distributed Cloud

Branch/Pull Time/Commit
-----------------------
Feb 22 master code

Last Pass
---------
N/A

Timestamp/Logs
--------------
As logs were not flushed to disk, there are
See list of deleted files as a result of running the command "sudo lsof|grep deleted" attached

Test Activity
-------------
Evaluation

 Workaround
 ----------
 Kill logmgmt process and delete filebeat pods.

Revision history for this message
Tee Ngo (teewrs) wrote :
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / high priority - stx-monitor resulting in running out of log space on distributed cloud

tags: added: stx.4.0 stx.distcloud stx.monitor
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Kevin Smith (kevin.smith.wrs)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/713957
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=241ea2871b15965bd694895f796660f7f1fddbf3
Submitter: Zuul
Branch: master

commit 241ea2871b15965bd694895f796660f7f1fddbf3
Author: Tee Ngo <email address hidden>
Date: Thu Mar 19 13:54:15 2020 -0400

    Set time limit for filebeat open filehandlers

    In a large system, filebeat can harvest a large number of files
    and with the default file closing policies, many deleted files are
    not freed. Over time, this leads to /var/log partition running out
    of space, services not being able to flush their logs to disk and
    logmgmt process continously rotating logs.

    This commit sets a default time limit for each open file harvester.
    This value can be adjusted as needed via user overrides.

    Closes-Bug: 1865924
    Change-Id: I9dbf9cb2128157834b937357dcc6c4945dc5d2f3
    Signed-off-by: Tee Ngo <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/716137

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (32.3 KiB)

Reviewed: https://review.opendev.org/716137
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=cb4cf4299c2ec10fb2eb03cdee3f6d78a6413089
Submitter: Zuul
Branch: f/centos8

commit 16477935845e1c27b4c9d31743e359b0aa94a948
Author: Steven Webster <email address hidden>
Date: Sat Mar 28 17:19:30 2020 -0400

    Fix SR-IOV runtime manifest apply

    When an SR-IOV interface is configured, the platform's
    network runtime manifest is applied in order to apply the virtual
    function (VF) config and restart the interface. This results in
    sysinv being able to determine and populate the puppet hieradata
    with the virtual function PCI addresses.

    A side effect of the network manifest apply is that potentially
    all platform interfaces may be brought down/up if it is determined
    that their configuration has changed. This will likely be the case
    for a system which configures SR-IOV interfaces before initial
    unlock.

    A few issues have been encountered because of this, with some
    services not behaving well when the interface they are communicating
    over suddenly goes down.

    This commit makes the SR-IOV VF configuration much more targeted
    so that only the operation of setting the desired number of VFs
    is performed.

    Closes-Bug: #1868584
    Depends-On: https://review.opendev.org/715669
    Change-Id: Ie162380d3732eb1b6e9c553362fe68cbc313ae2b
    Signed-off-by: Steven Webster <email address hidden>

commit 45c9fe2d3571574b9e0503af108fe7c1567007db
Author: Zhipeng Liu <email address hidden>
Date: Thu Mar 26 01:58:34 2020 +0800

    Add ipv6 support for novncproxy_base_url.

    For ipv6 address, we need url with below format
    [ip]:port

    Partial-Bug: 1859641

    Change-Id: I01a5cd92deb9e88c2d31bd1e16e5bce1e849fcc7
    Signed-off-by: Zhipeng Liu <email address hidden>

commit d119336b3a3b24d924e000277a37ab0b5f93aae1
Author: Andy Ning <email address hidden>
Date: Mon Mar 23 16:26:21 2020 -0400

    Fix timeout waiting for CA cert install during ansible replay

    During ansible bootstrap replay, the ssl_ca_complete_flag file is
    removed. It expects puppet platform::config::runtime manifest apply
    during system CA certificate install to re-generate it. So this commit
    updated conductor manager to run that puppet manifest even if the CA cert
    has already installed so that the ssl_ca_complete_flag file is created
    and makes ansible replay to continue.

    Change-Id: Ic9051fba9afe5d5a189e2be8c8c2960bdb0d20a4
    Closes-Bug: 1868585
    Signed-off-by: Andy Ning <email address hidden>

commit 24a533d800b2c57b84f1086593fe5f04f95fe906
Author: Zhipeng Liu <email address hidden>
Date: Fri Mar 20 23:10:31 2020 +0800

    Fix rabbitmq could not bind port to ipv6 address issue

    When we use Armada to deploy openstack service for ipv6, rabbitmq
    pod could not start listen on [::]:5672 and [::]:15672.
    For ipv6, we need an override for configuration file.

    Upstream patch link is:
    https://review.opendev.org/#/c/714027/

    Test pass for deploying rabbitmq service on both ipv...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.