Lenovo IPMI intermittent access issues should be able to be silenced

Bug #1876931 reported by Drew Freiberger
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Charm Helpers
New
Undecided
Unassigned
hw-health-charm
Won't Fix
Medium
Unassigned

Bug Description

On some Lenovo hardware, we are seeing temporary outages of 30 minutes to 2 hours for IPMI connectivity from the host. The resultant service output is:

UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-${hostname_fqdn}: internal IPMI error

We might want to find a way to allow the charm a setting to ignore "internal IPMI error" responses related to ipmi_sdr_cache_open calls. This is not an issue determined through IPMI log/query, but an issue querying the IPMI interface of the BMC and can be noisy depending on hardware/firmware combinations for some environments.

description: updated
Revision history for this message
Alvaro Uria (aluria) wrote :

Thank you Drew. I agree this issue causes alert fatigue and at the operator will, it could be decided to change the threshold of the alert.

When such error occurs, it seems wrong to return:
1) OK: there is something wrong if it lasts forever
2) WARNING: it is not related to the hardware but to the IPMI interface

I would suggest to implement a clock (time in seconds configurable via Juju) to monitor for how long the "internal IPMI error" message is returned. It could then happen that 2 hours in a row is OK for the check to return OK, but more than that would trigger:
"""
UNKNOWN: Repeated for {{time}} seconds. ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-${hostname_fqdn}: internal IPMI error
"""

Whenever a different message is returned, the clock is reset.

Would you agree on this approach?

Changed in charm-hw-health:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Drew Freiberger (afreiberger) wrote :

I think if we're going to have an "alarm after X hours" setting, we should just try to set that as the alert threshold within nagios when we add the check rather than coding it into the check script itself.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

max_check_attempts and retry_interval should be tuned for this check to perhaps set to 15 minute retry interval and max_check_attempts=8 would give a 2 hour window before sending a notification.

Revision history for this message
Drew Freiberger (afreiberger) wrote :
Revision history for this message
Drew Freiberger (afreiberger) wrote :

Here is the part of charmhelpers that sets up the service template. This would need to be updated to support adding in these additional variables with a call to charmhelpers.contrib.charmsupport.nrpe.add_check. Maybe add in some additional template_kwargs fo r future expandability into charmhelpers for use by various charms that might need to setup such tuning?

https://github.com/juju/charm-helpers/blob/master/charmhelpers/contrib/charmsupport/nrpe.py#L131-L143

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

+1 for adding template kwargs to charmhelpers, this would be useful for other checks as well. Going to add "affects" for charmhelpers here

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Also got this for a Cisco system (just a few mins) ftr.:

Manufacturer: Cisco Systems Inc
Product Name: UCSC-C240-M5S

Revision history for this message
Drew Freiberger (afreiberger) wrote :

This will require charm-nagios to grow a new feature. See bug https://bugs.launchpad.net/charm-nrpe/+bug/1877400 which will need to be worked and closed before this specific check can get the options added.

Revision history for this message
Eric Chen (eric-chen) wrote :

This charm is no longer being actively maintained. Please consider using the new hardware-observer-operator instead. (https://github.com/canonical/hardware-observer-operator)
Therefore, I mark this issue as won't fix

Changed in charm-hw-health:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.