App power-metrics: telegraf pod hangs after some time running on a stressed preempt-rt system
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Alyson Deives Pereira |
Bug Description
Brief Description
-----------------
When a preempt-rt system is stressed running pods on many isolated
cores (using stress-ng for instance [1]) the telegraf pod hangs after
a long period of time and becomes irresponsible.
[1] https:/
Severity
--------
Provide the severity of the defect.
Critical: System/Feature is not usable due to the defect
Steps to Reproduce
------------------
- Isolate all application cores available
- Stress each isolated core with tasks (stress-ng for instance)
Expected Behavior
------------------
Telegraf should collect metrics without interruption.
Actual Behavior
----------------
Telegraf stops collecting metrics and pod becomes irresponsible (kubectl cannot manage it)
Reproducibility
---------------
Reproducible
System Configuration
-------
AIO-SX
Branch/Pull Time/Commit
-------
Current master branch as of October 9th 2023
Last Pass
---------
N/A
Timestamp/Logs
--------------
A crash dump analysis indicates that telegraf goroutines stales
waiting for a kernel response to MSR read requests:
PID: 55638 TASK: ff466bf3a7ffde00 CPU: 0 COMMAND: "telegraf"
#0 [ff85787a26767c70] __schedule at ffffffffa30ae0c6
#1 [ff85787a26767d00] schedule at ffffffffa30ae7f7
#2 [ff85787a26767d18] schedule_timeout at ffffffffa30b15a4
#3 [ff85787a26767d70] wait_for_completion at ffffffffa30afbc4
#4 [ff85787a26767db8] rdmsr_safe_on_cpu at ffffffffa2afbda8
#5 [ff85787a26767e78] msr_read at ffffffffa2640e55
#6 [ff85787a26767ec8] vfs_read at ffffffffa2903208
#7 [ff85787a26767f00] __x64_sys_pread64 at ffffffffa2904ea1
#8 [ff85787a26767f40] do_syscall_64 at ffffffffa30a6b60
#9 [ff85787a26767f50] entry_SYSCALL_
Test Activity
-------------
Feature Testing
Workaround
-------------
There is no workaround. The system must be rebooted.
Changed in starlingx: | |
assignee: | nobody → Alyson Deives Pereira (adeivesp) |
Changed in starlingx: | |
importance: | Undecided → Medium |
tags: | added: stx.9.0 stx.apps |
Fix proposed to branch: master /review. opendev. org/c/starlingx /app-power- metrics/ +/897839
Review: https:/