StarlingX

Bug #2038927
Comment #2

Comment 2 for bug 2038927

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-10-10: Fix merged to app-power-metrics (master)

Reviewed: https://review.opendev.org/c/starlingx/app-power-metrics/+/897839
Committed: https://opendev.org/starlingx/app-power-metrics/commit/1078ecbb7b6f8d7318e3e7c740d36dfdc5182985
Submitter: "Zuul (22348)"
Branch: master

commit 1078ecbb7b6f8d7318e3e7c740d36dfdc5182985
Author: Alyson Deives Pereira <email address hidden>
Date: Tue Oct 10 11:07:27 2023 -0300

telegraf: Add MSR read timeout

    When a preempt-rt system is stressed running pods on many isolated
    cores (using stress-ng for instance [1]) the telegraf pod hangs after
    a long period of time and becomes irresponsible.

A crash dump analysis indicates that telegraf goroutines stales
waiting for a kernel response to MSR read requests:

    PID: 55638 TASK: ff466bf3a7ffde00 CPU: 0 COMMAND: "telegraf"
    #0 [ff85787a26767c70] __schedule at ffffffffa30ae0c6
    #1 [ff85787a26767d00] schedule at ffffffffa30ae7f7
    #2 [ff85787a26767d18] schedule_timeout at ffffffffa30b15a4
    #3 [ff85787a26767d70] wait_for_completion at ffffffffa30afbc4
    #4 [ff85787a26767db8] rdmsr_safe_on_cpu at ffffffffa2afbda8
    #5 [ff85787a26767e78] msr_read at ffffffffa2640e55
    #6 [ff85787a26767ec8] vfs_read at ffffffffa2903208
    #7 [ff85787a26767f00] __x64_sys_pread64 at ffffffffa2904ea1
    #8 [ff85787a26767f40] do_syscall_64 at ffffffffa30a6b60
    #9 [ff85787a26767f50] entry_SYSCALL_64_after_hwframe at ffffffffa3200099

This change includes a timeout (default value of 100ms) for MSR reads
and avoids telegraf to become irresponsible.

[1] https://github.com/ColinIanKing/stress-ng

NOTE: This issue was reported on upstream:
https://github.com/influxdata/telegraf/issues/14088

Closes-Bug: 2038927

    TEST PLAN (preempt-rt ISO):
    PASS: Build custom telegraf image with this change
    PASS: Override and apply power-metrics app with custom telegraf image
    PASS: Launch stress pods and confirm the telegraf pod is still stable
          after a long period of time.

Change-Id: I145da09f5a967e219d0aa2e588d4323e8a2eb1e0
Signed-off-by: Alyson Deives Pereira <email address hidden>

Reviewed:  https://review.opendev.org/c/starlingx/app-power-metrics/+/897839
Committed: https://opendev.org/starlingx/app-power-metrics/commit/1078ecbb7b6f8d7318e3e7c740d36dfdc5182985
Submitter: "Zuul (22348)"
Branch:    master

commit 1078ecbb7b6f8d7318e3e7c740d36dfdc5182985
Author: Alyson Deives Pereira <alyson.deivespereira@windriver.com>
Date:   Tue Oct 10 11:07:27 2023 -0300

telegraf: Add MSR read timeout
    
    When a preempt-rt system is stressed running pods on many isolated
    cores (using stress-ng for instance [1]) the telegraf pod hangs after
    a long period of time and becomes irresponsible.
    
    A crash dump analysis indicates that telegraf goroutines stales
    waiting for a kernel response to MSR read requests:
    
    PID: 55638  TASK: ff466bf3a7ffde00  CPU: 0   COMMAND: "telegraf"
    #0 [ff85787a26767c70] __schedule at ffffffffa30ae0c6
    #1 [ff85787a26767d00] schedule at ffffffffa30ae7f7
    #2 [ff85787a26767d18] schedule_timeout at ffffffffa30b15a4
    #3 [ff85787a26767d70] wait_for_completion at ffffffffa30afbc4
    #4 [ff85787a26767db8] rdmsr_safe_on_cpu at ffffffffa2afbda8
    #5 [ff85787a26767e78] msr_read at ffffffffa2640e55
    #6 [ff85787a26767ec8] vfs_read at ffffffffa2903208
    #7 [ff85787a26767f00] __x64_sys_pread64 at ffffffffa2904ea1
    #8 [ff85787a26767f40] do_syscall_64 at ffffffffa30a6b60
    #9 [ff85787a26767f50] entry_SYSCALL_64_after_hwframe at ffffffffa3200099
    
    This change includes a timeout (default value of 100ms) for MSR reads
    and avoids telegraf to become irresponsible.
    
    [1] https://github.com/ColinIanKing/stress-ng
    
    NOTE: This issue was reported on upstream:
    https://github.com/influxdata/telegraf/issues/14088
    
    Closes-Bug: 2038927
    
    TEST PLAN (preempt-rt ISO):
    PASS: Build custom telegraf image with this change
    PASS: Override and apply power-metrics app with custom telegraf image
    PASS: Launch stress pods and confirm the telegraf pod is still stable
          after a long period of time.
    
    Change-Id: I145da09f5a967e219d0aa2e588d4323e8a2eb1e0
    Signed-off-by: Alyson Deives Pereira <alyson.deivespereira@windriver.com>