Host can get stuck degraded from critical memory alarm

Bug #1903731 reported by Eric MacDonald
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

The maintenance system provides collectd plugins for resource utilization monitoring and an alarm/degrade management plugin for failure notification.

The 'memory' resource plugin reports memory resource utilization as samples that the alarm/degrade plugin monitors and manages memory alarm assertion and clear as well as host degrade state assertion and clear.

A case has been observed where a host gets stuck in the degraded state after the memory plugin reports a critical overage.

Investigation revealed that if the collectd process is restarted while a memory alarm is asserted then the host can get stuck degraded.

Severity
--------
Major: Host gets stuck in degraded state that can affect patching or upgrade

Steps to Reproduce
------------------
Step 1: Consume host memory to produce a critical memory overage alarm.
Step 2: Restart collectd process
Step 3: Free host memory so that the memory alarm clears

Expected Behavior
------------------
degrade and alarms should clear over collectd process restart

Actual Behavior
----------------
degrade state does not clear

Reproducibility
---------------
100% for instance based alarms

System Configuration
--------------------
Any system or IP type

Branch/Pull Time/Commit
-----------------------
starlingx/master as of Nov 10, 2020

Last Pass
---------
Unknown, likely a regression test escape.

Timestamp/Logs
--------------

<date> controller-0 collectd[105092]: info degrade notifier:
{"service":"collectd_notifier","hostname":"controller-0","degrade":"assert","resource":"memory"}
 [ repeated ]

<date> controller-0 mtcAgent hbs nodeClass.cpp (5229) collectd_notify_handler : Info : controller-0 collectd degrade state change ; clear -> assert (due to memory)
 [ no clear log ; even after alarms are cleared ]

Test Activity
-------------
[Other - Acceptance Testing]

Workaround
----------
Restart collectd after the alarms are cleared

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.5.0 stx.metal
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to monitoring (master)

Fix proposed to branch: master
Review: https://review.opendev.org/762880

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fault (master)

Fix proposed to branch: master
Review: https://review.opendev.org/762881

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fault (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/fault/+/792254

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/fault/+/793428

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fault (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/fault/+/792254

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fault (f/centos8)
Download full text (6.9 KiB)

Reviewed: https://review.opendev.org/c/starlingx/fault/+/793428
Committed: https://opendev.org/starlingx/fault/commit/d17dd2a196d07500797895ebba4adb020b8a3498
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 3280e6cd5b28809b51ea45e369c069f76f165c44
Author: Pedro Henrique Linhares <email address hidden>
Date: Thu May 6 18:41:57 2021 -0300

    Adding Kubernetes alarm type for PV migration errors during AIO-SX to AIO-DX

    This commit adds a new alarm type for Kubernetes Persistent Volume
    patching errors during AIO-SX to AIO-DX migration.

    Partial-Bug: 1927224
    Signed-off-by: Pedro Henrique Linhares <email address hidden>
    Change-Id: I8f64280394999249c829372d1748a9c26fdb9ced

commit a64e88bf43012d5558826442b98b26847370eeb3
Author: Jerry Sun <email address hidden>
Date: Tue May 4 15:46:52 2021 -0400

    Better repair action for alarm 100.104

    This commit adds a better proposed repair action for filesystem
    threshold alarm 100.104.

    Closes-Bug: 1927155
    Signed-off-by: Jerry Sun <email address hidden>
    Change-Id: Id2d1d4c23d343455d1f0c2e359cf380cc23229cd

commit 03090ca2bb77edb8a01c9a08a716aa3d1a5f4595
Author: Charles Short <email address hidden>
Date: Mon Apr 26 10:50:20 2021 -0400

    Fix pep8 gate failures

    Set hacking to < 4.0.1 in test-requirements.txt so that
    the pep8 gate passes again.

    Test:
    Ran tox -e pep8 command to validate the flake8 job and result.

    Related-Bug: 1926172

    Signed-off-by: Charles Short <email address hidden>
    Change-Id: I5b27a89d0e078912814ca2999bf28e6602980fd0

commit 581495082a5a0a9456065b3d3bb8b5f015747fd8
Author: Eric MacDonald <email address hidden>
Date: Tue Apr 6 09:02:39 2021 -0400

    Make small modification to fm's logrotation configuration file

    This update makes the following changes to the fm logrotation config file

     - add 'create' with permissions to each tuple
     - add 'delaycompress' as a local setting to each log entry
     - remove 'nodateext' global and local setting

    Test Plan:

    PASS: Verify fm logs rotation behavior
    PASS: Verify fm logs delaycompress setting behavior
    PASS: Verify log permissions after rotate

    Change-Id: Ibe8bd8107501df947b5091e928de202378ef4ea8
    Partial-Bug: 1918979
    Depends-On: https://review.opendev.org/c/starlingx/config-files/+/784943
    Signed-off-by: Eric MacDonald <email address hidden>

commit 63fcc33bbca0bc07719c070a8fa7c2a3d3f084b9
Author: Enzo Candotti <email address hidden>
Date: Thu Apr 1 11:37:45 2021 -0300

    Update events.yaml with DM-Monitor alarms

    Add a new alarm definition under the 260.001 id,
    created when resources reconciled status were false.

    Closes-Bug: 1922238

    Signed-off-by: Enzo Candotti <email address hidden>
    Change-Id: I96c05aaaf914bb253f7a71a7bfc79924c8da7857

commit 4639f7dfff972f2b3e2cd61df11ebaf31afc89ee
Author: albailey <email address hidden>
Date: Wed Nov 18 13:36:04 2020 -0600

    Add log and alarm support for vim orchestrated kube-upgrade

    A...

Read more...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.