hw-health charm can hold onto bad sensor data
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
hw-health-charm |
Expired
|
Undecided
|
Unassigned |
Bug Description
I found a case where sensor data on a disk was reporting fault:
$ sudo ipmi-sensors | egrep -i 'DISK45|DISK44'
52 | SSD Disk45 Temp | Temperature | 0.00 | C | 'At or Below (<=) Lower Non-Recoverable Threshold'
53 | SSD Disk44 Temp | Temperature | 78.00 | C | 'OK'
176 | DISK44 | Drive Slot | N/A | N/A | 'Drive Presence'
177 | DISK45 | Drive Slot | N/A | N/A | 'Drive Presence'
Notice the Disk45 temperature at an icy 0.00 C. the bmc reported a more moderate 55 C for both drives actually, and it turned out the ipmi sensor data is cached at
/root/.
The charm should evaluate if this sensor data is changing over time similar to the way it detects stale data in /var/lib/
Secondly, the charm should provide an action to clear this cache on demand
Workaround:
`sudo ipmi-sensors -f`
This is odd, the check_ipmi_sensor script we rely on is already invoking ipmi-sensors with the --sdr-cache- recreate option (see https:/ /github. com/thomas- krenn/check_ ipmi_sensor_ v3/blob/ master/ check_ipmi_ sensor# L753), so a stale cache should automatically be detected and handled.
Could this be a bug in freeipmi? recreate` is run manually?
Is the cache recreated if `sudo ipmi-sensors --sdr-cache-