collectd cpu plugin does not always initialize

Bug #1855733 reported by Jim Gauld on 2019-12-09
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Medium
Jim Gauld

Bug Description

Brief Description
-----------------
Collectd cpu plugin occasionally does not report any values due to plugin initialization problem that causes a traceback. The plugin gets suspended forever. The problem occurs due to timing issue of when collectd plugin starts and when cgroups are created. the docker cgroup is created often much later or not at all, and the cpu plugin does not handle this. Since this is timing dependent, the behaviour is likely hardware dependent.

There is a tested very simple code change to correct this bug.

Severity
--------
Critical: Intermittently lose ability to alarm based on platform cpu usage for the affected host.

Steps to Reproduce
------------------
Lock/unlock or reboot AIO host. Problem presents intermittently.

Expected Behavior
------------------
After collectd cpu plugin starts, it should log output to /var/log/daemon.log, and push to collectd.

Actual Behavior
----------------
When problem presents after reboot, get collectd read-function of plugin 'python.cpu' failed. At that point plugin is broken and stays that way. This bug is generic, but the underlying scenario with 'docker' would likely present on AIO or controller.

Reproducibility
---------------
Intermittent, seen many times on my QEMU dev environment, very frequent.

System Configuration
--------------------
One node system, Two node system.

Branch/Pull Time/Commit
-----------------------
Issue seen in multiple recent loads since this CPU plugin was delivered to stx/monitoring.
This specific exception was missed with LP 1849511 :
 - Update collectd breakdown of platform cpu
 - Correct collectd cpu and memory plugin exceptions.

Last Pass
---------
Not found with specific test case.

Timestamp/Logs
--------------
When problem presents after reboot, get at traceback in daemon.log. At that point plugin is broken.
2019-11-15T20:32:56.490 controller-0 collectd[102537]: info platform cpu usage plugin cputime initialization error
2019-11-15T20:32:56.490 controller-0 collectd[102537]: info Unhandled python exception in read callback: KeyError: 'pods'
2019-11-15T20:32:56.491 controller-0 collectd[102537]: info Traceback (most recent call last):
2019-11-15T20:32:56.491 controller-0 collectd[102537]: info File "/opt/collectd/extensions/python/cpu.py", line 482, in read_func
2019-11-15T20:32:56.491 controller-0 collectd[102537]: info update_cpu_data()
2019-11-15T20:32:56.491 controller-0 collectd[102537]: info File "/opt/collectd/extensions/python/cpu.py", line 330, in update_cpu_data
2019-11-15T20:32:56.491 controller-0 collectd[102537]: info if k in obj._t0_cpuacct[i]:
2019-11-15T20:32:56.491 controller-0 collectd[102537]: info KeyError: 'pods'
2019-11-15T20:32:56.491 controller-0 collectd[102537]: info read-function of plugin `python.cpu' failed. Will suspend it for 60.000 seconds.

Test Activity
-------------
Developer testing of unrelated feature.

Code solution:
Update cpu.py cpuacct delta code to this (eg, 2 liner change):
    # Calculate cpuacct delta for cgroup hierarchy, dropping transient cgroups
    cpuacct = {}
    for i in t1_cpuacct.keys():
        cpuacct[i] = {}
        for k, v in t1_cpuacct[i].items():
            if i in obj._t0_cpuacct and k in obj._t0_cpuacct[i]:
                cpuacct[i][k] = v - obj._t0_cpuacct[i][k]
            else:
                cpuacct[i][k] = v

Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - seems to be tied with the qemu virtual env; workaround exists.

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Jim Gauld (jgauld)
tags: added: stx.4.0 stx.config
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers