Collectd platform cpu breakdown has missing time
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Jim Gauld |
Bug Description
Brief Description
-----------------
There is a large discrepancy between the overall Platform cpu occupancy and platform time indicated as the sum of Base + k8s-system. This unaccounted time exaggerated on some systems when there is lots of health probes, and is greatly exaggerated on Debian (vs CentOS). It is unclear where all the time on the Platform cores is going, and this complicates traction against platform cputime alarms.
A better breakdown of the Platform cputime is require so we understand where the time goes so time adds to 100 percent and that makes sense. This must account for containerization overheads, systemd overheads, and kernel threads.
For system engineering tracking and debugging purposes, it is also desirable to have some level of cputime breakdown at the system.slice per-services level.
Eg, Propose new breakdown something like this:
2022-08-
2022-08-
Severity
--------
Major: in some cases we are no alarming platform cpu when lots of time is actually consumed. Platform performance is hard to diagnose and debug when we don't understand where time goes.
Steps to Reproduce
------------------
Fresh install Debian ISO. Monitor /var/log/
Expected Behavior
------------------
Expect the collectd cpu plugin to have matching overall Usage, (eg., see the following when it matches 32.2%).
platform cpu usage plugin Usage: 32.2% (avg per cpu); cpus: 2, Platform: 32.2% (Base: 23.7, k8s-system: 5.4)
Actual Behavior
----------------
Huge unexplained deviation between overall Usage and Platform Usage for the same cpus, especially on Debian, especially with large number of K8S health probes.
Reproducibility
---------------
100%
System Configuration
-------
AIO-SX, AIO-DX
Branch/Pull Time/Commit
-------
BUILD_DATE=
Last Pass
---------
Day one issue. Discrepancy is exaggerated on Debian since Jun8 2022.
Timestamp/Logs
--------------
Example when the discrepancy is small:
/var/log/
2022-06-
Test Activity
-------------
System engineering.
Workaround
----------
none available
Changed in starlingx: | |
assignee: | nobody → Jim Gauld (jgauld) |
tags: | added: stx.8.0 stx.monitor |
Changed in starlingx: | |
importance: | Undecided → Medium |
Fix proposed to branch: master /review. opendev. org/c/starlingx /monitoring/ +/852653
Review: https:/