hw-health charm can hold onto bad sensor data

Bug #1942261 reported by Adam Dyess
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
hw-health-charm
Expired
Undecided
Unassigned

Bug Description

I found a case where sensor data on a disk was reporting fault:

$ sudo ipmi-sensors | egrep -i 'DISK45|DISK44'
52 | SSD Disk45 Temp | Temperature | 0.00 | C | 'At or Below (<=) Lower Non-Recoverable Threshold'
53 | SSD Disk44 Temp | Temperature | 78.00 | C | 'OK'
176 | DISK44 | Drive Slot | N/A | N/A | 'Drive Presence'
177 | DISK45 | Drive Slot | N/A | N/A | 'Drive Presence'

Notice the Disk45 temperature at an icy 0.00 C. the bmc reported a more moderate 55 C for both drives actually, and it turned out the ipmi sensor data is cached at

/root/.freeipmi/sdr-cache/sdr-cache-$(hostname)

The charm should evaluate if this sensor data is changing over time similar to the way it detects stale data in /var/lib/nagios/ipmi_sensors.out and clear the cache if necessary.

Secondly, the charm should provide an action to clear this cache on demand

Workaround:
`sudo ipmi-sensors -f`

Revision history for this message
Andrea Ieri (aieri) wrote :

This is odd, the check_ipmi_sensor script we rely on is already invoking ipmi-sensors with the --sdr-cache-recreate option (see https://github.com/thomas-krenn/check_ipmi_sensor_v3/blob/master/check_ipmi_sensor#L753), so a stale cache should automatically be detected and handled.

Could this be a bug in freeipmi?
Is the cache recreated if `sudo ipmi-sensors --sdr-cache-recreate` is run manually?

Changed in charm-hw-health:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for hw-health-charm because there has been no activity for 60 days.]

Changed in charm-hw-health:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.