Host can get stuck degraded from critical memory alarm
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Eric MacDonald |
Bug Description
The maintenance system provides collectd plugins for resource utilization monitoring and an alarm/degrade management plugin for failure notification.
The 'memory' resource plugin reports memory resource utilization as samples that the alarm/degrade plugin monitors and manages memory alarm assertion and clear as well as host degrade state assertion and clear.
A case has been observed where a host gets stuck in the degraded state after the memory plugin reports a critical overage.
Investigation revealed that if the collectd process is restarted while a memory alarm is asserted then the host can get stuck degraded.
Severity
--------
Major: Host gets stuck in degraded state that can affect patching or upgrade
Steps to Reproduce
------------------
Step 1: Consume host memory to produce a critical memory overage alarm.
Step 2: Restart collectd process
Step 3: Free host memory so that the memory alarm clears
Expected Behavior
------------------
degrade and alarms should clear over collectd process restart
Actual Behavior
----------------
degrade state does not clear
Reproducibility
---------------
100% for instance based alarms
System Configuration
-------
Any system or IP type
Branch/Pull Time/Commit
-------
starlingx/master as of Nov 10, 2020
Last Pass
---------
Unknown, likely a regression test escape.
Timestamp/Logs
--------------
<date> controller-0 collectd[105092]: info degrade notifier:
{"service"
[ repeated ]
<date> controller-0 mtcAgent hbs nodeClass.cpp (5229) collectd_
[ no clear log ; even after alarms are cleared ]
Test Activity
-------------
[Other - Acceptance Testing]
Workaround
----------
Restart collectd after the alarms are cleared
Changed in starlingx: | |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
Changed in starlingx: | |
importance: | Undecided → Medium |
status: | New → Triaged |
tags: | added: stx.5.0 stx.metal |
Changed in starlingx: | |
status: | In Progress → Fix Released |
Fix proposed to branch: master /review. opendev. org/762880
Review: https:/