Platform cpu alarm triggered for too short a duration

Bug #1848580 reported by Frank Miller
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Brief Description
-----------------
When performing a significant amount of activity (eg: launch 100 pods, lock/unlock a large # of hosts) a platform cpu alarm is often raised and then immediately clears.

| 2019-10-17T16:12 | set | 100.101 | Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00% | host=compute-6 | critical |
| :54.516468 | | |

| 2019-10-17T16:13 | clear | 100.101 | Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00% | host=compute-6 | critical |
| :34.736361 | | | | | |

Severity
--------
Major

Steps to Reproduce
------------------
Launch a large # of pods

Expected Behavior
------------------
Platform cpu alarm should not be seen if duration is brief (eg: < 2 minutes)

Actual Behavior
----------------
Platform cpu alarm is raised and then cleared shortly after (eg: ~30 seconds)

Reproducibility
---------------
Reproducible

System Configuration
--------------------
All configs

Branch/Pull Time/Commit
-----------------------
Branch/build as of 2019-10-09_20-00-00

Last Pass
---------
unknown

Timestamp/Logs
--------------
n/a

Test Activity
-------------
System level testing

Frank Miller (sensfan22)
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.3.0 / medium priority - alarm seems to be triggered too soon; not working as intended.

tags: added: stx.metal
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.3.0
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Going with a 3 minute debounce.

What is the alarm severity assertion behavior you want to see if the values are bouncing between major and critical over that 3 minute period ?

I'll go with major unless ALL the samples in that time frame are critical.

I assume you also want a debounce on the alarm deassertion ; 3 minutes of no overge before alarm is removed.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Manage Alarm Debounce Algorithm
-------------------------------
General: Alarm state change at DEBOUNCE_THRESHOLD count.
         Effectively 3 mins of persistent severity assertion.

Debounce Conditions:

1. Need >= severity notification for DEBOUNCE_THRESHOLD
   consecutive counts to qualify for an alarm assertion.
2. One ok notification prior to alarm assertion will clear
   all accumulating severity level debounce counts.

Debounce Actions:

Clear: Once alarm is asserted (condition 2 above met) ...
       Every ok reading subtracts 1 from each severity level
       Every severity reading adds 1 to that severity count
       up to a max of DEBOUNCE_THRESHOLD.

Major: Add 1 to major debounce for every severity above
       ok reading until until count reaches
       DEBOUNCE_THRESHOLD and then raise major alarm.
       Any ok reading before alarm assertion clears counts

Critl: Add 1 to critical debounce for every critical
       severity notification reading until count reaches
       DEBOUNCE_THRESHOLD and then raise critical alarm.
       Any ok reading before alarm assertion clears counts

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

The maintenance degrade notifier (mtce_notifier) is being merged with the alarm notifier (fm_notifier) to ensure straight forward and proper alarm/degrade accounting.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/690791

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to monitoring (master)

Fix proposed to branch: master
Review: https://review.opendev.org/690794

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to monitoring (master)

Reviewed: https://review.opendev.org/690794
Committed: https://git.openstack.org/cgit/starlingx/monitoring/commit/?id=b330b1bd5c2356ed210b01f1b562379a85ced5cf
Submitter: Zuul
Branch: master

commit b330b1bd5c2356ed210b01f1b562379a85ced5cf
Author: Eric MacDonald <email address hidden>
Date: Thu Oct 24 00:06:14 2019 -0400

    Add alarm debounce support to collectd alarm notifier

    This update implements a 3 minute alarm debounce feature
    to the existing alarm notifier.

    To ensure proper alarm/degrade accounting the mtce degrade
    notifier was merged with the alarm notifier.

    This update changes the existing 'update_alarm' function
    to 'debounce' which returns True once the resource has
    been debounced the alarm/degrade settings need to be
    updated with the current notification severity.

    Test Plan:

    PASS: Verify debounce from ok to major
    PASS: Verify debounce from ok to critical
    PASS: verify debounce from major to ok
    PASS: Verify debounce from major to critical
    PASS: verify debounce from critical to ok
    PASS: Verify debounce from critical to major
    PASS: Verify major to major alarm persists
    PASS: Verify critical to critical alarm persists
    PASS: Verify handling of major startup alarm that escalates to critical
    PASS: Verify handling of critical startup alarm that drops to major threshold
    PASS: Verify handling of critical startup alarm that drops below alarming threshold
    PASS: Verify clear of major alarmed fs over swact
    PASS: Verify clear of critical alarmed/degraded fs over swact
    PASS: Verify end to end degrade handling with single source
    PASS: Verify end to end degrade handling with multiple sources
    PASS: Verify end to end filesystem alarm/degrade management
    PASS: Verify end to end interface alarm/degrade management
    PASS: Verify debounce handling with random value/wait script loop

    Change-Id: Ibb9461ce027c5ab5accb64507c7141f10f0d1a88
    Partial-Bug: 1848580
    Signed-off-by: Eric MacDonald <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/690791
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=aa50a1a9ef5f516145d93f931ffe666c851f4208
Submitter: Zuul
Branch: master

commit aa50a1a9ef5f516145d93f931ffe666c851f4208
Author: Eric MacDonald <email address hidden>
Date: Wed Oct 23 23:36:56 2019 -0400

    Remove mtce_notifier from collectd configuration

    The alarm debounce support addition to the fm_notifier
    plugin required that the mtce degrade notifier be
    merged into the fm_notifier to ensure proper
    alarm/degrade accounting.

    This merge obsoletes the mtce_notifier.

    Unnecessary mtce port config is dropped.

    Test Plan:

    PASS: Verify collectd configuration after file removal

    Change-Id: I560c093faba5c8e9804d51ac625d5c71db662a36
    Partial-Bug: 1848580
    Depends-On: https://review.opendev.org/#/c/690794/
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.