add an action 'ack-sensor' to ignore bad sensor record
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
hw-health-charm |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
`ack-sel` action supports to filter out SEL entries older than a specific date. We found some hardware alerted the disk failure in ipmi-sensors rather than ipmi SEL, for example:
$ sudo ipmi-sensors |grep DISK12
149 | DISK12 | Drive Slot | N/A | N/A | 'Drive Presence' 'Predictive Failure'
hw-health charm currently doesn't support to ignore ipmi sensor entry on a unit, so IPMI alert could not be cleared unitl the hardware issue was fixed. If there is further hw failure, we won't be able to receive alert. There is `ipmi_check_
ipmi_check_
But this will apply to all hw-health units. It'd be better to add an action like 'ack-sensor <record-id>` to ignore the ipmi sensor record.
description: | updated |
summary: |
- add an action 'ack-sensor' + add an action 'ack-sensor' to ignore bad sensor record |
description: | updated |
If I understand correctly, the objective here is to be able to continue receiving alerts for failures even when a specific one is known and acknowledged. The idea of using an `ack-sensor` action is just a suggestion to work around the juju limitation of application-wide configurations.
Considering the impending need to transition to the COS and therefore use metrics-based alerting, I could imagine us exporting individual ipmi sensor readings and therefore be able to alert on individual entries, removing the need for more complex workflows.
As we need to start this transition as soon as possible, I will tentatively close this as won't fix. Please set it back to new if a workaround is urgently needed.