StarlingX

Resource monitoring using Influxdb can fill root filesystem

Bug #1905581 reported by Eric MacDonald on 2020-11-25

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Eric MacDonald

Bug Description

The opensource Influx time series database is used to store data samples for host resources such as memory, cpu, filesystem and more.

The Influx database currently resides in the root filesystem under /var/lib and even though the retention policy is set to only 7 days there are reported cases of its stored samples filling up the root filesystem.

Once the root fs is filled other host issues begin to occur including failure of the influxdb process itself making it difficult to manually access the database to drop or delete samples to free occupied space.

There does not appear to be any way of restricting the total database size to a max threshold.

Resolution of this issue should consider the following ...

1. make the collectd database retention policy starlingx configurable
2. move the Influxdb database store location out of the root fs and optionally
   - into another resizable filesystem or
   - into its own resizable filesystem
3. create an audit that monitors the rootfs occupancy and acts on a threshold overage
   - if usage approaches a max threshold then log into influxdb and drop samples
   - the collectd fm_notifier plugin could serve as such audit
     - since it knows when the root filesystem is reaching a major and critical overage threshold
4. reach out to the influxdb support team asking how best to deal with this issue
   - is there a more recent version of influxdb that offers a solution
   - is there a fix plan in their backlog and if so what that fix might look like and when
   - can the the influxdb process dealing with the fs full issue more gracefully than failing the process

Severity
--------
Major that can escalate to Critical if the root filesystem fills up

Steps to Reproduce
------------------
1. Install and provision a large system
2. Force the root filesystem to be at or above 70% occupancy or higher
3. Wait for issue to occur

Expected Behavior
------------------
Root filesystem should not fill up and the influxdb process should not fail

Actual Behavior
----------------
Root filesystem fills up and the influxdb process fails with any attempt to restart it by the starlingx maintenance process monitor (pmond) fails.
Major controller alarm is raised due to persistent influxdb process failure
Critical controller alarm is raised due to critical usage threshold overage
Controller is degraded due to critical usage threshold overage

Reproducibility
---------------
100% reproducible once the root filesystem fills up

System Configuration
--------------------
More likely to occur on larger systems due to all collectd samples from all surrogate hosts being forwarded to the active controller.
Was seen on a 2+2+6 and 2+2+18 Storage Systems.

Branch/Pull Time/Commit
-----------------------
starlingx/master.

Last Pass
---------
Test escape

Timestamp/Logs
--------------

https://files.starlingx.kube.cengn.ca/ under this LP number.

db30cdf535562ff94d1b5bd94f1b2b33 SELECT_NODES_20201124.191620.tar

Test Activity
-------------
Feature Testing

Workaround
----------
Manually ...

1. Swact to in-service inactive controller

2. Log into and find-n-free some space in the root filesystem on the previously active controller

3. Wait for the influxdb process alarm to clear
- continue to free space till the influxdb process recovers
- pmond will auto recover the process on its own

4. once the influxdb process recovers log into influxdb cli and drop the collectd database
> influx -database=collectd -precision=rfc3339
> drop database collectd

5. restart collectd process so that the collectd database is recreated
> sudo pmon-restart collectd

Tags:

Ghada Khalil (gkhalil) on 2020-12-09

tags:

added: stx.metal

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-01-11:

stx.5.0 / high - issue results in filling up the root filesystem which can result in operations failing

Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
tags:	added: stx.5.0
Changed in starlingx:
importance:	Medium → High
assignee:	nobody → Eric MacDonald (rocksolidmtce)

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2021-02-05:

The following updates reduce the amount of collectd sample data by ~80-85%

update : Avoid loading collectd's default plugins
review : https://review.opendev.org/c/starlingx/monitoring/+/772516
commit : https://opendev.org/starlingx/monitoring/commit/a2a2a88887c6f2761d6f6ec1f339cf949207c0b2

update : Change collectd plugin search path
review : https://review.opendev.org/c/starlingx/stx-puppet/+/773951
commit : https://opendev.org/starlingx/stx-puppet/commit/5a555ad98eb4fb978c7b553d463dfedf4d9b3a25

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2021-02-05:

Another work around to this issue is to simply delete the collectd samples database on the active controller using the following commands.

sudo pmon-stop collectd
sudo pmon-stop influxdb
sudo rm -rf /var/lib/influxdb/data/collectd/"collectd samples"
sudo pmon-start influxdb
sudo pmon-start collectd

Of course the last week of samples data is lost but the root fs space is recovered.

Currently assessing if additional measures need to be taken.

There is no plan to move the influxdb into its own filesystem as an update to address this issue.
That sort of fix would be implemented as an enhancement feature outside the scope of this report.

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2021-02-08:

The above 2 updates, which reduce the collectd samples data base consumption by 80-85%, implement the fix for this issue.

Real system measurements extrapolated to account for a 100 node system indicate that the collectd samples database could grow to ~90Gbytes in 1 week compared to less than 10 GBytes over the same period with the above updates applied.

A weeks worth of samples will not amount to more than 10 GBytes in a 100 node system and will therefore not put the rootfs as risk of filling up.

Changed in starlingx:
status:	Triaged → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-19: Fix proposed to monitoring (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/monitoring/+/792244

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-07: Fix merged to monitoring (f/centos8)

Download full text (7.8 KiB)

Reviewed: https://review.opendev.org/c/starlingx/monitoring/+/792244
Committed: https://opendev.org/starlingx/monitoring/commit/fdc0d099fb0d65cbf8f037fe0cc9ac8125410284
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 2ef5451f442482636db3c0c3641e8412821bd8c5
Author: Takamasa Takenaka <email address hidden>
Date: Thu Apr 22 12:28:37 2021 -0300

Format 2 lines ntpq data into 1 lines

    The problem was logic expected one line data for
    ntpq result. But it was 2 lines for each ntp server
    entry. When peer server is selected, script checked
    refid if refid is reliable or not but it could not
    find because refid is in the following line.
    This fix formats 2 lines data into 1 line.

    The minor alarm "minor alarm "NTP cannot reach
    external time source; syncing with peer controller
    only" is removed because NTP does not prioritize
    external time source over peer.

Closes-Bug: 1889101

Signed-off-by: Takamasa Takenaka <email address hidden>
Change-Id: Icc8316bb1a7041bf0351165c671ebf35b97fa3bc

commit d37490b81408ca53b1b8fd61992c6c9337dbcaed
Author: Eric MacDonald <email address hidden>
Date: Tue Apr 20 10:03:07 2021 -0400

Add alarm audit to starlingx collectd fm notifier plugin

    This update adds common plugin support for alarm state auditing.
    The audit is able to detect and correct the following alarm
    state errors:

       Error Case Correction Action
       ----------------------- -----------------
     - stale alarm ; delete alarm
     - missing alarm ; assert alarm
     - alarm severity mismatch ; refresh alarm

The common audit is enabled for the fm_notifier plugin that supports
alarm managment for the following resources.

     - CPU with alarm id 100.101
     - Memory with alarm id 100.103
     - Filesystem with alarm id 100.104

Other plugins may use this common audit in the future but only the
above resources have the audit enabled for them by this update.

Test Plan:

    PASS: Verify stale alarm detection/correction handling
    PASS: Verify missing alarm detection/correction handling
    PASS: Verify alarm severity mismatch detection/correction handling
    PASS: Verify hosts only audits its own specified alarms
    PASS: Verify success path of monitoring a single and mix
          of base and instance alarms of varying severity while
          such alarm conditions come and go
    PASS: Verify alarm audit of mix of base and instance alarms
          over a collectd process restart
    PASS: Verify audit handling of alarm that migrates from
          major to critical to major to clear
    PASS: Verify audit handling transition between alarm and
          no alarm conditions
    PASS: Verify soak of random cpu, memory and filesystem
          overage alarm assertions and clears that also involve
          manual alarm deletions, assertions and severity changes
          that exercise new audit features

Regression:

    PASS: Verify alarm and audit handling over Swact with mounted
          filesystem that has active alarm
  ...

Reviewed:  https://review.opendev.org/c/starlingx/monitoring/+/792244
Committed: https://opendev.org/starlingx/monitoring/commit/fdc0d099fb0d65cbf8f037fe0cc9ac8125410284
Submitter: "Zuul (22348)"
Branch:    f/centos8

commit 2ef5451f442482636db3c0c3641e8412821bd8c5
Author: Takamasa Takenaka <takamasa.takenaka@windriver.com>
Date:   Thu Apr 22 12:28:37 2021 -0300

Format 2 lines ntpq data into 1 lines
    
    The problem was logic expected one line data for
    ntpq result. But it was 2 lines for each ntp server
    entry. When peer server is selected, script checked
    refid if refid is reliable or not but it could not
    find because refid is in the following line.
    This fix formats 2 lines data into 1 line.
    
    The minor alarm "minor alarm "NTP cannot reach
    external time source; syncing with peer controller
    only" is removed because NTP does not prioritize
    external time source over peer.
    
    Closes-Bug: 1889101
    
    Signed-off-by: Takamasa Takenaka <takamasa.takenaka@windriver.com>
    Change-Id: Icc8316bb1a7041bf0351165c671ebf35b97fa3bc

commit d37490b81408ca53b1b8fd61992c6c9337dbcaed
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Tue Apr 20 10:03:07 2021 -0400

Add alarm audit to starlingx collectd fm notifier plugin
    
    This update adds common plugin support for alarm state auditing.
    The audit is able to detect and correct the following alarm
    state errors:
    
       Error Case                Correction Action
       -----------------------   -----------------
     - stale alarm             ; delete alarm
     - missing alarm           ; assert alarm
     - alarm severity mismatch ; refresh alarm
    
    The common audit is enabled for the fm_notifier plugin that supports
    alarm managment for the following resources.
    
     - CPU with alarm id 100.101
     - Memory with alarm id 100.103
     - Filesystem with alarm id 100.104
    
    Other plugins may use this common audit in the future but only the
    above resources have the audit enabled for them by this update.
    
    Test Plan:
    
    PASS: Verify stale alarm detection/correction handling
    PASS: Verify missing alarm detection/correction  handling
    PASS: Verify alarm severity mismatch detection/correction handling
    PASS: Verify hosts only audits its own specified alarms
    PASS: Verify success path of monitoring a single and mix
          of base and instance alarms of varying severity while
          such alarm conditions come and go
    PASS: Verify alarm audit of mix of base and instance alarms
          over a collectd process restart
    PASS: Verify audit handling of alarm that migrates from
          major to critical to major to clear
    PASS: Verify audit handling transition between alarm and
          no alarm conditions
    PASS: Verify soak of random cpu, memory and filesystem
          overage alarm assertions and clears that also involve
          manual alarm deletions, assertions and severity changes
          that exercise new audit features
    
    Regression:
    
    PASS: Verify alarm and audit handling over Swact with mounted
          filesystem that has active alarm
    PASS: Verify collectd logs following a system install and
          while alarms are managed during above soak
    PASS: Verify behavior while FM is killed or stopped/started
    PASS: Verify Standard system install with Sanity and Regression
    PASS: Verify AIO DX/DC systems install with Sanity and Regression
    
    Closes-Bug: 1925210
    Change-Id: I1cafd17ad07ec769240de92ae4e67cb1357f0992
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 14e1a9a82b017fb5d2fa3fd62b5c0943058ef0ee
Author: albailey <Al.Bailey@windriver.com>
Date:   Wed Apr 7 17:52:02 2021 -0400

Bandit should only be installed in py3 env
    
    Running tox for linters fails since the bandit being pulled
    in is python3 only. This is similar to other bugs where a new
    version is released which drops py2 support.
    
    In this env, we only include bandit if we are testing and running
    in py3.
    
    Partial-Bug: 1922590
    Change-Id: I11b7d974ae3b64e7846e1420521dee0d48128fc5
    Signed-off-by: albailey <Al.Bailey@windriver.com>

commit 19460ecbd2c50b2d3fd8436d12066df5925f0fb4
Author: Gerry Kopec <gerry.kopec@windriver.com>
Date:   Tue Apr 6 21:28:53 2021 -0400

Add platform namespaces to collectd
    
    Add missing platform namespaces (armada, cert-manager, portieris, vault
    and notification) to collectd kubernetes system list.
    
    Change-Id: I341d802210388e5e1f3fd2d7a11fa0593c44fa68
    Closes-Bug: 1922629
    Signed-off-by: Gerry Kopec <gerry.kopec@windriver.com>

commit a2a2a88887c6f2761d6f6ec1f339cf949207c0b2
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Tue Jan 26 07:58:05 2021 -0500

Avoid loading collectd's default plugins
    
    The current opensource collectd rpm installs
    several default plugins, some that overlap
    starlingx developed plugins and others that
    simply collect way too much data.
    
    The plugins in question are:
    
    /etc/collectd.d/90-default-plugins-syslog.conf
    /etc/collectd.d/90-default-plugins-memory.conf
    /etc/collectd.d/90-default-plugins-load.conf
    /etc/collectd.d/90-default-plugins-interface.conf
    /etc/collectd.d/90-default-plugins-cpu.conf
    
    This update moves the value added starlingx
    plugins to /etc/collectd.d/starlingx and
    relies another puppet update to change the
    collectd's plugin search path accordingly.
    
    Test Plan:
    
    PASS: Verify default plugins are not loaded
          and they samples are not collected.
    PASS: Verify patch apply and remove.
          Note: is reboot required patch
    PASS: Verify the daily influxdb usage
          drops by 80-85%.
    
    Regression:
    
    PASS: Verify collectd alarm/degrade regression soak
    
    Change-Id: Ic7884ae69014fa274f0bd0515adec90b08747c67
    Closes-Bug: 1905581
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit ea4b515f91f38523a22e877ebba9d552962153b2
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Mon Jan 25 09:40:06 2021 -0500

Add node ready check to collectd plugins
    
    This update adds a second collectd plugin
    initialization enhancement. First update
    added a config complete gate:
    
    https://review.opendev.org/c/starlingx/monitoring/+/736817
    
    Turns out that not all plugins are ready to sample
    immediately following the config complete state.
    One example is FM on the active controller needs
    time to get going before plugins can query their
    alarms on startup. Also, some plugins need more
    time than others.
    
    To account for both cases this update adds a
    thresholded node ready gate that can be tailored
    to a plugin to hold off fm access and sampling
    until its ready threshold is reached.
    
    Test Plan:
    
    PASS: Verify AIO SX and DX system install
    PROG: Verify Storage system install
    PASS: Verify AIO SX node lock and unlock
    PASS: Verify AIO Standby controller lock and unlock
    PASS: Verify Standard controller lock and unlock
    PASS: Verify Compute and Storage node lock and unlock
    PASS: Verify Dead-Office-Recovery (AIO DX)
    PASS: Verify collectd sampling and logs
    
    Partial-Bug: 1872979
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
    Change-Id: I044d812542a4222214c7d13e231ac4024cca9800

commit 23489af038771da85d5443fd822dcae4699acf03
Author: Jim Gauld <james.gauld@windriver.com>
Date:   Wed Nov 4 17:22:34 2020 -0500

Increase field widths of PID for schedtop
    
    This increases field width of TID, PID, and PPID to 7 wide for schedtop
    engineering tool. Newer systems support larger PIDs.
    
    Change-Id: I706b60d83e8ce341a7d07c4c067a74e7049acdad
    Closes-Bug: 1902954
    Signed-off-by: Jim Gauld <james.gauld@windriver.com>

tags:

added: in-f-centos8

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.