collectd alarm clear request failure leads to stuck alarm
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Eric MacDonald |
Bug Description
A Starling-X collectd alarm notification handler FM (fault manager) call to clear an alarm can lead to a stuck alarm if that FM request fails, say due to a concurrent swact operation.
The alarm will remain stuck until there is another same alarm assertion and then successful clear.
Severity:
---------
Major
Steps to Reproduce
------------------
Not easy. Cause a collectd monitored alarm condition and then cause that condition to clear where the next audit and clear attempt is concurrent with a swact or FM process restart.
Expected Behavior
------------------
Alarm clear is retried by the collectd alarm handler on next audit interval.
Actual Behavior
----------------
No subsequent clear is retried until the next assertion, negation and then next clear.
Reproducibility
---------------
Intermittent/Rare - see steps to reproduce for explanation for reproducibility understanding.
System Configuration
-------
AIO DX.
Branch/Pull Time/Commit
-------
BUILD_DATE=
Timestamp/Logs
--------------
From /var/log/daemon.log on controller-1
2018-09-
2018-09-
2018-09-
2018-09-
tags: |
added: stx.1.0 removed: stx.2018.10 |
There is already a launchpad reporting this issue: /bugs.launchpad .net/starlingx/ +bug/1793314
https:/