add an action 'ack-sensor' to ignore bad sensor record

Bug #1993977 reported by Linda Guo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
hw-health-charm
Won't Fix
Undecided
Unassigned

Bug Description

`ack-sel` action supports to filter out SEL entries older than a specific date. We found some hardware alerted the disk failure in ipmi-sensors rather than ipmi SEL, for example:

 $ sudo ipmi-sensors |grep DISK12
 149 | DISK12 | Drive Slot | N/A | N/A | 'Drive Presence' 'Predictive Failure'

hw-health charm currently doesn't support to ignore ipmi sensor entry on a unit, so IPMI alert could not be cleared unitl the hardware issue was fixed. If there is further hw failure, we won't be able to receive alert. There is `ipmi_check_options` in hw-health config, a sensor number can be ignored by setting something likes:

ipmi_check_options="-O --exclude-record-ids=149"

But this will apply to all hw-health units. It'd be better to add an action like 'ack-sensor <record-id>` to ignore the ipmi sensor record.

Linda Guo (lihuiguo)
description: updated
summary: - add an action 'ack-sensor'
+ add an action 'ack-sensor' to ignore bad sensor record
Linda Guo (lihuiguo)
description: updated
Revision history for this message
Andrea Ieri (aieri) wrote :

If I understand correctly, the objective here is to be able to continue receiving alerts for failures even when a specific one is known and acknowledged. The idea of using an `ack-sensor` action is just a suggestion to work around the juju limitation of application-wide configurations.

Considering the impending need to transition to the COS and therefore use metrics-based alerting, I could imagine us exporting individual ipmi sensor readings and therefore be able to alert on individual entries, removing the need for more complex workflows.

As we need to start this transition as soon as possible, I will tentatively close this as won't fix. Please set it back to new if a workaround is urgently needed.

Changed in charm-hw-health:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.