check_ipmi for huawei iBMC doesn't show all SEL entries

Bug #1938216 reported by Garrett Neugent
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
hw-health-charm
Expired
Undecided
Unassigned

Bug Description

On a customer cloud with Huawei machines, the nrpe check check_ipmi is incorrectly alerting for SEL entries. From the check we see the following response showing one SEL entry:

<user>@<Redacted>:/etc/nagios/nrpe.d$ sudo -u nagios /usr/local/lib/nagios/plugins/check_ipmi.py
CRITICAL: IPMI Status: Critical [1 system event log (SEL) entry present] | 'Inlet Temp'=29.00;~:46.00;~:48.00 'Outlet Temp'=40.00;~:75.00; 'PCH Temp'=61.00;~:86.00; 'CPU1 Core Rem'=50.00 'CPU2 Core Rem'=47.00 'CPU1 DTS'=-45.00;~:-1.00; 'CPU2 DTS'=-48.00;~:-1.00; 'Cpu1 Margin'=-36.00 'Cpu2 Margin'=-38.00 'CPU1 MEM Temp'=40.00;~:95.00; 'CPU2 MEM Temp'=39.00;~:95.00; 'SYS 3.3V'=3.28;;2.96:3.62 'SYS 5V'=5.16;;4.50:5.49 'SYS 12V_1'=12.18;;10.80:13.20 'SYS 12V_2'=12.18;;10.80:13.20 'CPU1 DDR VPP1'=2.54;;2.24:2.74 'CPU1 DDR VPP2'=2.52;;2.24:2.74 'CPU2 DDR VPP1'=2.54;;2.24:2.74 'CPU2 DDR VPP2'=2.54;;2.24:2.74 'FAN1 Speed'=6240.00 'FAN2 Speed'=6240.00 'FAN3 Speed'=6120.00 'FAN4 Speed'=6120.00 'Power'=216.00 'Disks Temp'=35.00 'RAID Temp'=63.00;~:105.00; 'Raid BBU Temp'=33.00;~:65.00; 'Power1'=72.00 'PS1 VIN'=52.00 'PS1 Inlet Temp'=40.00 'Power2'=144.00 'PS2 VIN'=52.00 'PS2 Inlet Temp'=36.00 'CPU1 VCore'=1.78;;1.23:2.04 'CPU2 VCore'=1.78;;1.23:2.04 'CPU1 DDR VDDQ'=1.22;;1.14:1.26 'CPU1 DDR VDDQ2'=1.22;;1.14:1.26 'CPU2 DDR VDDQ'=1.22;;1.14:1.26 'CPU2 DDR VDDQ2'=1.22;;1.14:1.26 'CPU1 VDDQ Temp'=39.00;~:120.00; 'CPU2 VDDQ Temp'=39.00;~:120.00; 'CPU1 VRD Temp'=46.00;~:120.00; 'CPU2 VRD Temp

However, checking the iBMC directly, we see two entries: the latest is from twelve days ago. It doesn't make sense that we're just now getting an alert for it( around 2021-07-27 19:12:07 UTC):

iBMC security log has reached 90% space capacity. 2021-07-15 10:58:18 Deasserted
iBMC event records are cleared. 2021-07-09 09:27:22 Asserted

Note that I verified via Nagios that we don't have any downtimes that recently expired that could have caused this.

In summary:

1) our ipmi configuration currently doesn't retrieve all SEL events from the BMC
2) it's incorrectly alerting well after SEL entries are being generated on the BMC.

It might also be nice while were working here to increase the cron job frequency so that alerts aren't fired until we have multiple checks to compare.

Revision history for this message
Andrea Ieri (aieri) wrote :

Are the extra SEL entries visible when running ipmi-sel manually on the host?
This could be a bug in the kernel driver or in freeipmi.

Changed in charm-hw-health:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for hw-health-charm because there has been no activity for 60 days.]

Changed in charm-hw-health:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.