collectd cpu plugin does not always initialize

Bug #1855733 reported by Jim Gauld
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Jim Gauld

Bug Description

Brief Description
-----------------
Collectd cpu plugin occasionally does not report any values due to plugin initialization problem that causes a traceback. The plugin gets suspended forever. The problem occurs due to timing issue of when collectd plugin starts and when cgroups are created. the docker cgroup is created often much later or not at all, and the cpu plugin does not handle this. Since this is timing dependent, the behaviour is likely hardware dependent.

There is a tested very simple code change to correct this bug.

Severity
--------
Critical: Intermittently lose ability to alarm based on platform cpu usage for the affected host.

Steps to Reproduce
------------------
Lock/unlock or reboot AIO host. Problem presents intermittently.

Expected Behavior
------------------
After collectd cpu plugin starts, it should log output to /var/log/daemon.log, and push to collectd.

Actual Behavior
----------------
When problem presents after reboot, get collectd read-function of plugin 'python.cpu' failed. At that point plugin is broken and stays that way. This bug is generic, but the underlying scenario with 'docker' would likely present on AIO or controller.

Reproducibility
---------------
Intermittent, seen many times on my QEMU dev environment, very frequent.

System Configuration
--------------------
One node system, Two node system.

Branch/Pull Time/Commit
-----------------------
Issue seen in multiple recent loads since this CPU plugin was delivered to stx/monitoring.
This specific exception was missed with LP 1849511 :
 - Update collectd breakdown of platform cpu
 - Correct collectd cpu and memory plugin exceptions.

Last Pass
---------
Not found with specific test case.

Timestamp/Logs
--------------
When problem presents after reboot, get at traceback in daemon.log. At that point plugin is broken.
2019-11-15T20:32:56.490 controller-0 collectd[102537]: info platform cpu usage plugin cputime initialization error
2019-11-15T20:32:56.490 controller-0 collectd[102537]: info Unhandled python exception in read callback: KeyError: 'pods'
2019-11-15T20:32:56.491 controller-0 collectd[102537]: info Traceback (most recent call last):
2019-11-15T20:32:56.491 controller-0 collectd[102537]: info File "/opt/collectd/extensions/python/cpu.py", line 482, in read_func
2019-11-15T20:32:56.491 controller-0 collectd[102537]: info update_cpu_data()
2019-11-15T20:32:56.491 controller-0 collectd[102537]: info File "/opt/collectd/extensions/python/cpu.py", line 330, in update_cpu_data
2019-11-15T20:32:56.491 controller-0 collectd[102537]: info if k in obj._t0_cpuacct[i]:
2019-11-15T20:32:56.491 controller-0 collectd[102537]: info KeyError: 'pods'
2019-11-15T20:32:56.491 controller-0 collectd[102537]: info read-function of plugin `python.cpu' failed. Will suspend it for 60.000 seconds.

Test Activity
-------------
Developer testing of unrelated feature.

Code solution:
Update cpu.py cpuacct delta code to this (eg, 2 liner change):
    # Calculate cpuacct delta for cgroup hierarchy, dropping transient cgroups
    cpuacct = {}
    for i in t1_cpuacct.keys():
        cpuacct[i] = {}
        for k, v in t1_cpuacct[i].items():
            if i in obj._t0_cpuacct and k in obj._t0_cpuacct[i]:
                cpuacct[i][k] = v - obj._t0_cpuacct[i][k]
            else:
                cpuacct[i][k] = v

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - seems to be tied with the qemu virtual env; workaround exists.

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Jim Gauld (jgauld)
tags: added: stx.4.0 stx.config
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to monitoring (master)

Fix proposed to branch: master
Review: https://review.opendev.org/738279

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to monitoring (master)

Reviewed: https://review.opendev.org/738279
Committed: https://git.openstack.org/cgit/starlingx/monitoring/commit/?id=1bdd9200bb83d89d5f70457a51788a6e043b43ec
Submitter: Zuul
Branch: master

commit 1bdd9200bb83d89d5f70457a51788a6e043b43ec
Author: Jim Gauld <email address hidden>
Date: Fri Jun 26 16:56:04 2020 -0400

    collectd cpu plugin does not always initialize

    This changes the initialization of per cgroup cpuacct timings
    to account for cgroup directories that may not be present at the time
    the plugin starts. As an example, the docker cgroup is created often
    much later or not at all.

    Change-Id: Iaf279e650cc16966b40c24a9f55f53fa4696a92b
    Closes-Bug: 1855733
    Signed-off-by: Jim Gauld <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.