platform memory usage over 100% usage not alarmed

Bug #1940875 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Eric MacDonald

Bug Description

Brief Description
-----------------
It is possible to over subscribe the usage of platform memory.
If platform memory usage exceeds 100% then the platform memory usage alarm is not raised.
The type attribute is set to 'percent' which only accepts values of 100 or less.
Values greater than 100 are reported as 'nan' (Not A Number)

Severity
--------
Minor

Steps to Reproduce
------------------
run stress-ng pod to over-subscribe platform memory usage

Expected Behavior
------------------
critical platform memory usage alarm is raised

Actual Behavior
----------------
no alarm is raised at all

Reproducibility
---------------
100%

System Configuration
--------------------
Any system

Branch/Pull Time/Commit
-----------------------
Aug, 23, 2021

Last Pass
---------
Test escape.

Timestamp/Logs
--------------
<date>: controller-1 collectd[108183]: info platform memory usage: Usage: 134.9%; Reserved: 24500.0 MiB, Platform: 33057.2 MiB (Base: 32542.7, k8s-system: 514.5), k8s-addon: 0.0

Test Activity
-------------
Feature Testing

Workaround
----------
none

Tags: stx.metal
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to monitoring (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/monitoring/+/805741

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.metal
Changed in starlingx:
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to monitoring (master)

Reviewed: https://review.opendev.org/c/starlingx/monitoring/+/805741
Committed: https://opendev.org/starlingx/monitoring/commit/fcc8ddda66b507e747a6e5f32c2300b84e4f7ad6
Submitter: "Zuul (22348)"
Branch: master

commit fcc8ddda66b507e747a6e5f32c2300b84e4f7ad6
Author: Eric MacDonald <email address hidden>
Date: Mon Aug 23 19:37:13 2021 -0400

    Change platform memory usage instance type to 'memory'

    The platform memory data-set type is currently set to 'percent'.

    It is possible to over subscribe platform memory usage to more
    than 100%.

    Collectd drops sample values that are greater than 100 when its
    data-set type is 'percent'. Collectd considers a percent value
    greater than 100 to be an invalid value.

    This update changes the data-set type for platform memory usage
    from 'percent' to 'memory' to allow memory usage values greater
    than 100 to be handled.

    Test Plan:

    PASS: Verify that platform memory overage alarm value is reported
                 as the 'actual' value in the alarm Reason Text.
    PASS: Verify platform memory usage values that exceed the major
                 threshold are alarmed 'major'.
    PASS: Verify platform memory usage values that exceed the critical
                 threshold are alarmed 'critical', even if the
                 debounced value exceeds 100.
    PASS: Verify ridiculously large values are still alarmed and that
                 value is still included in the alarm Reason Text.

    Change-Id: I7189671e20c92656f820fda74c4871504d89e73a
    Closes-Bug: 1940875
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.