App power-metrics: telegraf pod hangs after some time running on a stressed preempt-rt system

Bug #2038927 reported by Alyson Deives Pereira
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Alyson Deives Pereira

Bug Description

Brief Description
-----------------
When a preempt-rt system is stressed running pods on many isolated
cores (using stress-ng for instance [1]) the telegraf pod hangs after
a long period of time and becomes irresponsible.

[1] https://github.com/ColinIanKing/stress-ng

Severity
--------
Provide the severity of the defect.
Critical: System/Feature is not usable due to the defect

Steps to Reproduce
------------------
- Isolate all application cores available
- Stress each isolated core with tasks (stress-ng for instance)

Expected Behavior
------------------
Telegraf should collect metrics without interruption.

Actual Behavior
----------------
Telegraf stops collecting metrics and pod becomes irresponsible (kubectl cannot manage it)

Reproducibility
---------------
Reproducible

System Configuration
--------------------
AIO-SX

Branch/Pull Time/Commit
-----------------------
Current master branch as of October 9th 2023

Last Pass
---------
N/A

Timestamp/Logs
--------------
A crash dump analysis indicates that telegraf goroutines stales
waiting for a kernel response to MSR read requests:

PID: 55638 TASK: ff466bf3a7ffde00 CPU: 0 COMMAND: "telegraf"
#0 [ff85787a26767c70] __schedule at ffffffffa30ae0c6
#1 [ff85787a26767d00] schedule at ffffffffa30ae7f7
#2 [ff85787a26767d18] schedule_timeout at ffffffffa30b15a4
#3 [ff85787a26767d70] wait_for_completion at ffffffffa30afbc4
#4 [ff85787a26767db8] rdmsr_safe_on_cpu at ffffffffa2afbda8
#5 [ff85787a26767e78] msr_read at ffffffffa2640e55
#6 [ff85787a26767ec8] vfs_read at ffffffffa2903208
#7 [ff85787a26767f00] __x64_sys_pread64 at ffffffffa2904ea1
#8 [ff85787a26767f40] do_syscall_64 at ffffffffa30a6b60
#9 [ff85787a26767f50] entry_SYSCALL_64_after_hwframe at ffffffffa3200099

Test Activity
-------------
Feature Testing

Workaround
-------------
There is no workaround. The system must be rebooted.

Changed in starlingx:
assignee: nobody → Alyson Deives Pereira (adeivesp)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to app-power-metrics (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to app-power-metrics (master)

Reviewed: https://review.opendev.org/c/starlingx/app-power-metrics/+/897839
Committed: https://opendev.org/starlingx/app-power-metrics/commit/1078ecbb7b6f8d7318e3e7c740d36dfdc5182985
Submitter: "Zuul (22348)"
Branch: master

commit 1078ecbb7b6f8d7318e3e7c740d36dfdc5182985
Author: Alyson Deives Pereira <email address hidden>
Date: Tue Oct 10 11:07:27 2023 -0300

    telegraf: Add MSR read timeout

    When a preempt-rt system is stressed running pods on many isolated
    cores (using stress-ng for instance [1]) the telegraf pod hangs after
    a long period of time and becomes irresponsible.

    A crash dump analysis indicates that telegraf goroutines stales
    waiting for a kernel response to MSR read requests:

    PID: 55638 TASK: ff466bf3a7ffde00 CPU: 0 COMMAND: "telegraf"
    #0 [ff85787a26767c70] __schedule at ffffffffa30ae0c6
    #1 [ff85787a26767d00] schedule at ffffffffa30ae7f7
    #2 [ff85787a26767d18] schedule_timeout at ffffffffa30b15a4
    #3 [ff85787a26767d70] wait_for_completion at ffffffffa30afbc4
    #4 [ff85787a26767db8] rdmsr_safe_on_cpu at ffffffffa2afbda8
    #5 [ff85787a26767e78] msr_read at ffffffffa2640e55
    #6 [ff85787a26767ec8] vfs_read at ffffffffa2903208
    #7 [ff85787a26767f00] __x64_sys_pread64 at ffffffffa2904ea1
    #8 [ff85787a26767f40] do_syscall_64 at ffffffffa30a6b60
    #9 [ff85787a26767f50] entry_SYSCALL_64_after_hwframe at ffffffffa3200099

    This change includes a timeout (default value of 100ms) for MSR reads
    and avoids telegraf to become irresponsible.

    [1] https://github.com/ColinIanKing/stress-ng

    NOTE: This issue was reported on upstream:
    https://github.com/influxdata/telegraf/issues/14088

    Closes-Bug: 2038927

    TEST PLAN (preempt-rt ISO):
    PASS: Build custom telegraf image with this change
    PASS: Override and apply power-metrics app with custom telegraf image
    PASS: Launch stress pods and confirm the telegraf pod is still stable
          after a long period of time.

    Change-Id: I145da09f5a967e219d0aa2e588d4323e8a2eb1e0
    Signed-off-by: Alyson Deives Pereira <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.apps
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.