Resource monitoring using Influxdb can fill root filesystem
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
High
|
Eric MacDonald |
Bug Description
The opensource Influx time series database is used to store data samples for host resources such as memory, cpu, filesystem and more.
The Influx database currently resides in the root filesystem under /var/lib and even though the retention policy is set to only 7 days there are reported cases of its stored samples filling up the root filesystem.
Once the root fs is filled other host issues begin to occur including failure of the influxdb process itself making it difficult to manually access the database to drop or delete samples to free occupied space.
There does not appear to be any way of restricting the total database size to a max threshold.
Resolution of this issue should consider the following ...
1. make the collectd database retention policy starlingx configurable
2. move the Influxdb database store location out of the root fs and optionally
- into another resizable filesystem or
- into its own resizable filesystem
3. create an audit that monitors the rootfs occupancy and acts on a threshold overage
- if usage approaches a max threshold then log into influxdb and drop samples
- the collectd fm_notifier plugin could serve as such audit
- since it knows when the root filesystem is reaching a major and critical overage threshold
4. reach out to the influxdb support team asking how best to deal with this issue
- is there a more recent version of influxdb that offers a solution
- is there a fix plan in their backlog and if so what that fix might look like and when
- can the the influxdb process dealing with the fs full issue more gracefully than failing the process
Severity
--------
Major that can escalate to Critical if the root filesystem fills up
Steps to Reproduce
------------------
1. Install and provision a large system
2. Force the root filesystem to be at or above 70% occupancy or higher
3. Wait for issue to occur
Expected Behavior
------------------
Root filesystem should not fill up and the influxdb process should not fail
Actual Behavior
----------------
Root filesystem fills up and the influxdb process fails with any attempt to restart it by the starlingx maintenance process monitor (pmond) fails.
Major controller alarm is raised due to persistent influxdb process failure
Critical controller alarm is raised due to critical usage threshold overage
Controller is degraded due to critical usage threshold overage
Reproducibility
---------------
100% reproducible once the root filesystem fills up
System Configuration
-------
More likely to occur on larger systems due to all collectd samples from all surrogate hosts being forwarded to the active controller.
Was seen on a 2+2+6 and 2+2+18 Storage Systems.
Branch/Pull Time/Commit
-------
starlingx/master.
Last Pass
---------
Test escape
Timestamp/Logs
--------------
https:/
db30cdf535562ff
sysadmin@
+------
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+------
| 200.006 | controller-0 is degraded due to the failure of its 'influxdb' | host=controller-0 | major | 2020-11-21T |
Test Activity
-------------
Feature Testing
Workaround
----------
Manually ...
1. Swact to in-service inactive controller
2. Log into and find-n-free some space in the root filesystem on the previously active controller
3. Wait for the influxdb process alarm to clear
- continue to free space till the influxdb process recovers
- pmond will auto recover the process on its own
4. once the influxdb process recovers log into influxdb cli and drop the collectd database
> influx -database=collectd -precision=rfc3339
> drop database collectd
5. restart collectd process so that the collectd database is recreated
> sudo pmon-restart collectd
tags: | added: stx.metal |
stx.5.0 / high - issue results in filling up the root filesystem which can result in operations failing