collectd alarm clear request failure leads to stuck alarm

Bug #1793574 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

A Starling-X collectd alarm notification handler FM (fault manager) call to clear an alarm can lead to a stuck alarm if that FM request fails, say due to a concurrent swact operation.

The alarm will remain stuck until there is another same alarm assertion and then successful clear.

Severity:
---------
Major

Steps to Reproduce
------------------
Not easy. Cause a collectd monitored alarm condition and then cause that condition to clear where the next audit and clear attempt is concurrent with a swact or FM process restart.

Expected Behavior
------------------
Alarm clear is retried by the collectd alarm handler on next audit interval.

Actual Behavior
----------------
No subsequent clear is retried until the next assertion, negation and then next clear.

Reproducibility
---------------
Intermittent/Rare - see steps to reproduce for explanation for reproducibility understanding.

System Configuration
--------------------
AIO DX.

Branch/Pull Time/Commit
-----------------------
BUILD_DATE="2018-09-19 20:19:35 -0400"

Timestamp/Logs
--------------
From /var/log/daemon.log on controller-1

2018-09-20T13:05:10.838 controller-1 collectd[167073]: info WARNING:root:fm_python_extension: Failed to connect to FM manager
2018-09-20T13:05:10.838 controller-1 collectd[167073]: info alarm notifier 100.101:host=controller-1 clear_fault failed
2018-09-20T13:05:10.838 controller-1 collectd[167073]: info alarm notifier cpu no warnings
2018-09-20T13:05:10.838 controller-1 collectd[167073]: info alarm notifier clear alarm 100.101:okay cpu:host=controller-1 thld:90.00 value:69.22

Revision history for this message
Ghada Khalil (gkhalil) wrote :

There is already a launchpad reporting this issue:
https://bugs.launchpad.net/starlingx/+bug/1793314

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
importance: Undecided → Medium
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking this as a duplicate

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Duplicate of https://bugs.launchpad.net/starlingx/+bug/1793314
Marking as Fix Released; same as the duplicate bug report. Fix was merged on 2018-09-20

tags: added: stx.2018.10 stx.metal
Changed in starlingx:
status: New → Fix Released
Ken Young (kenyis)
tags: added: stx.1.0
removed: stx.2018.10
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.