Comment 1 for bug 1876931

Revision history for this message
Alvaro Uria (aluria) wrote :

Thank you Drew. I agree this issue causes alert fatigue and at the operator will, it could be decided to change the threshold of the alert.

When such error occurs, it seems wrong to return:
1) OK: there is something wrong if it lasts forever
2) WARNING: it is not related to the hardware but to the IPMI interface

I would suggest to implement a clock (time in seconds configurable via Juju) to monitor for how long the "internal IPMI error" message is returned. It could then happen that 2 hours in a row is OK for the check to return OK, but more than that would trigger:
"""
UNKNOWN: Repeated for {{time}} seconds. ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-${hostname_fqdn}: internal IPMI error
"""

Whenever a different message is returned, the clock is reset.

Would you agree on this approach?