collectd cpu plugin does not always initialize
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Jim Gauld |
Bug Description
Brief Description
-----------------
Collectd cpu plugin occasionally does not report any values due to plugin initialization problem that causes a traceback. The plugin gets suspended forever. The problem occurs due to timing issue of when collectd plugin starts and when cgroups are created. the docker cgroup is created often much later or not at all, and the cpu plugin does not handle this. Since this is timing dependent, the behaviour is likely hardware dependent.
There is a tested very simple code change to correct this bug.
Severity
--------
Critical: Intermittently lose ability to alarm based on platform cpu usage for the affected host.
Steps to Reproduce
------------------
Lock/unlock or reboot AIO host. Problem presents intermittently.
Expected Behavior
------------------
After collectd cpu plugin starts, it should log output to /var/log/
Actual Behavior
----------------
When problem presents after reboot, get collectd read-function of plugin 'python.cpu' failed. At that point plugin is broken and stays that way. This bug is generic, but the underlying scenario with 'docker' would likely present on AIO or controller.
Reproducibility
---------------
Intermittent, seen many times on my QEMU dev environment, very frequent.
System Configuration
-------
One node system, Two node system.
Branch/Pull Time/Commit
-------
Issue seen in multiple recent loads since this CPU plugin was delivered to stx/monitoring.
This specific exception was missed with LP 1849511 :
- Update collectd breakdown of platform cpu
- Correct collectd cpu and memory plugin exceptions.
Last Pass
---------
Not found with specific test case.
Timestamp/Logs
--------------
When problem presents after reboot, get at traceback in daemon.log. At that point plugin is broken.
2019-11-
2019-11-
2019-11-
2019-11-
2019-11-
2019-11-
2019-11-
2019-11-
2019-11-
Test Activity
-------------
Developer testing of unrelated feature.
Code solution:
Update cpu.py cpuacct delta code to this (eg, 2 liner change):
# Calculate cpuacct delta for cgroup hierarchy, dropping transient cgroups
cpuacct = {}
for i in t1_cpuacct.keys():
cpuacct[i] = {}
for k, v in t1_cpuacct[
if i in obj._t0_cpuacct and k in obj._t0_cpuacct[i]:
else:
stx.4.0 / medium priority - seems to be tied with the qemu virtual env; workaround exists.