StarlingX

Platform cpu alarm triggered for too short a duration

Bug #1848580 reported by Frank Miller on 2019-10-17

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Eric MacDonald

Bug Description

Brief Description
-----------------
When performing a significant amount of activity (eg: launch 100 pods, lock/unlock a large # of hosts) a platform cpu alarm is often raised and then immediately clears.

| 2019-10-17T16:12 | set | 100.101 | Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00% | host=compute-6 | critical |
| :54.516468 | | |

| 2019-10-17T16:13 | clear | 100.101 | Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00% | host=compute-6 | critical |
| :34.736361 | | | | | |

Severity
--------
Major

Steps to Reproduce
------------------
Launch a large # of pods

Expected Behavior
------------------
Platform cpu alarm should not be seen if duration is brief (eg: < 2 minutes)

Actual Behavior
----------------
Platform cpu alarm is raised and then cleared shortly after (eg: ~30 seconds)

Reproducibility
---------------
Reproducible

System Configuration
--------------------
All configs

Branch/Pull Time/Commit
-----------------------
Branch/build as of 2019-10-09_20-00-00

Last Pass
---------
unknown

Timestamp/Logs
--------------
n/a

Test Activity
-------------
System level testing

Tags:

Frank Miller (sensfan22) on 2019-10-17

Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-18:

stx.3.0 / medium priority - alarm seems to be triggered too soon; not working as intended.

tags:	added: stx.metal
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
tags:	added: stx.3.0

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-10-21:

Going with a 3 minute debounce.

What is the alarm severity assertion behavior you want to see if the values are bouncing between major and critical over that 3 minute period ?

I'll go with major unless ALL the samples in that time frame are critical.

I assume you also want a debounce on the alarm deassertion ; 3 minutes of no overge before alarm is removed.

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-10-22:

Manage Alarm Debounce Algorithm
-------------------------------
General: Alarm state change at DEBOUNCE_THRESHOLD count.
Effectively 3 mins of persistent severity assertion.

Debounce Conditions:

1. Need >= severity notification for DEBOUNCE_THRESHOLD
consecutive counts to qualify for an alarm assertion.
2. One ok notification prior to alarm assertion will clear
all accumulating severity level debounce counts.

Debounce Actions:

Clear: Once alarm is asserted (condition 2 above met) ...
       Every ok reading subtracts 1 from each severity level
       Every severity reading adds 1 to that severity count
       up to a max of DEBOUNCE_THRESHOLD.

Major: Add 1 to major debounce for every severity above
       ok reading until until count reaches
       DEBOUNCE_THRESHOLD and then raise major alarm.
       Any ok reading before alarm assertion clears counts

Critl: Add 1 to critical debounce for every critical
       severity notification reading until count reaches
       DEBOUNCE_THRESHOLD and then raise critical alarm.
       Any ok reading before alarm assertion clears counts

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-10-24:

The maintenance degrade notifier (mtce_notifier) is being merged with the alarm notifier (fm_notifier) to ensure straight forward and proper alarm/degrade accounting.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-24: Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/690791

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-24: Fix proposed to monitoring (master)

Fix proposed to branch: master
Review: https://review.opendev.org/690794

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-25: Fix merged to monitoring (master)

Reviewed: https://review.opendev.org/690794
Committed: https://git.openstack.org/cgit/starlingx/monitoring/commit/?id=b330b1bd5c2356ed210b01f1b562379a85ced5cf
Submitter: Zuul
Branch: master

commit b330b1bd5c2356ed210b01f1b562379a85ced5cf
Author: Eric MacDonald <email address hidden>
Date: Thu Oct 24 00:06:14 2019 -0400

Add alarm debounce support to collectd alarm notifier

This update implements a 3 minute alarm debounce feature
to the existing alarm notifier.

To ensure proper alarm/degrade accounting the mtce degrade
notifier was merged with the alarm notifier.

    This update changes the existing 'update_alarm' function
    to 'debounce' which returns True once the resource has
    been debounced the alarm/degrade settings need to be
    updated with the current notification severity.

Test Plan:

    PASS: Verify debounce from ok to major
    PASS: Verify debounce from ok to critical
    PASS: verify debounce from major to ok
    PASS: Verify debounce from major to critical
    PASS: verify debounce from critical to ok
    PASS: Verify debounce from critical to major
    PASS: Verify major to major alarm persists
    PASS: Verify critical to critical alarm persists
    PASS: Verify handling of major startup alarm that escalates to critical
    PASS: Verify handling of critical startup alarm that drops to major threshold
    PASS: Verify handling of critical startup alarm that drops below alarming threshold
    PASS: Verify clear of major alarmed fs over swact
    PASS: Verify clear of critical alarmed/degraded fs over swact
    PASS: Verify end to end degrade handling with single source
    PASS: Verify end to end degrade handling with multiple sources
    PASS: Verify end to end filesystem alarm/degrade management
    PASS: Verify end to end interface alarm/degrade management
    PASS: Verify debounce handling with random value/wait script loop

    Change-Id: Ibb9461ce027c5ab5accb64507c7141f10f0d1a88
    Partial-Bug: 1848580
    Signed-off-by: Eric MacDonald <email address hidden>

Reviewed:  https://review.opendev.org/690794
Committed: https://git.openstack.org/cgit/starlingx/monitoring/commit/?id=b330b1bd5c2356ed210b01f1b562379a85ced5cf
Submitter: Zuul
Branch:    master

commit b330b1bd5c2356ed210b01f1b562379a85ced5cf
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Thu Oct 24 00:06:14 2019 -0400

Add alarm debounce support to collectd alarm notifier
    
    This update implements a 3 minute alarm debounce feature
    to the existing alarm notifier.
    
    To ensure proper alarm/degrade accounting the mtce degrade
    notifier was merged with the alarm notifier.
    
    This update changes the existing 'update_alarm' function
    to 'debounce' which returns True once the resource has
    been debounced the alarm/degrade settings need to be
    updated with the current notification severity.
    
    Test Plan:
    
    PASS: Verify debounce from ok to major
    PASS: Verify debounce from ok to critical
    PASS: verify debounce from major to ok
    PASS: Verify debounce from major to critical
    PASS: verify debounce from critical to ok
    PASS: Verify debounce from critical to major
    PASS: Verify major to major alarm persists
    PASS: Verify critical to critical alarm persists
    PASS: Verify handling of major startup alarm that escalates to critical
    PASS: Verify handling of critical startup alarm that drops to major threshold
    PASS: Verify handling of critical startup alarm that drops below alarming threshold
    PASS: Verify clear of major alarmed fs over swact
    PASS: Verify clear of critical alarmed/degraded fs over swact
    PASS: Verify end to end degrade handling with single source
    PASS: Verify end to end degrade handling with multiple sources
    PASS: Verify end to end filesystem alarm/degrade management
    PASS: Verify end to end interface alarm/degrade management
    PASS: Verify debounce handling with random value/wait script loop
    
    Change-Id: Ibb9461ce027c5ab5accb64507c7141f10f0d1a88
    Partial-Bug: 1848580
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-25: Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/690791
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=aa50a1a9ef5f516145d93f931ffe666c851f4208
Submitter: Zuul
Branch: master

commit aa50a1a9ef5f516145d93f931ffe666c851f4208
Author: Eric MacDonald <email address hidden>
Date: Wed Oct 23 23:36:56 2019 -0400

Remove mtce_notifier from collectd configuration

    The alarm debounce support addition to the fm_notifier
    plugin required that the mtce degrade notifier be
    merged into the fm_notifier to ensure proper
    alarm/degrade accounting.

This merge obsoletes the mtce_notifier.

Unnecessary mtce port config is dropped.

Test Plan:

PASS: Verify collectd configuration after file removal

    Change-Id: I560c093faba5c8e9804d51ac625d5c71db662a36
    Partial-Bug: 1848580
    Depends-On: https://review.opendev.org/#/c/690794/
    Signed-off-by: Eric MacDonald <email address hidden>

Eric MacDonald (rocksolidmtce) on 2019-10-28

Changed in starlingx:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1848541

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.