check_ipmi for huawei iBMC doesn't show all SEL entries
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
hw-health-charm |
Expired
|
Undecided
|
Unassigned |
Bug Description
On a customer cloud with Huawei machines, the nrpe check check_ipmi is incorrectly alerting for SEL entries. From the check we see the following response showing one SEL entry:
<user>@
CRITICAL: IPMI Status: Critical [1 system event log (SEL) entry present] | 'Inlet Temp'=29.
However, checking the iBMC directly, we see two entries: the latest is from twelve days ago. It doesn't make sense that we're just now getting an alert for it( around 2021-07-27 19:12:07 UTC):
iBMC security log has reached 90% space capacity. 2021-07-15 10:58:18 Deasserted
iBMC event records are cleared. 2021-07-09 09:27:22 Asserted
Note that I verified via Nagios that we don't have any downtimes that recently expired that could have caused this.
In summary:
1) our ipmi configuration currently doesn't retrieve all SEL events from the BMC
2) it's incorrectly alerting well after SEL entries are being generated on the BMC.
It might also be nice while were working here to increase the cron job frequency so that alerts aren't fired until we have multiple checks to compare.
Are the extra SEL entries visible when running ipmi-sel manually on the host?
This could be a bug in the kernel driver or in freeipmi.