Ipmi check is never going into Critical status in Nagios

Bug #1882978 reported by Márton Kiss
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
hw-health-charm
Fix Released
Undecided
Unassigned

Bug Description

The refactor https://git.launchpad.net/charm-hw-health/commit/?id=6c29dde4a3f29fc7137619dea3cfe399f196dd79 breaks the logic of /var/lib/nagios/ipmi_sensors.out file creation in the /usr/local/lib/nagios/plugins/cron_ipmi_sensors.py scheduled job.

As a result Nagios is always showing the OK status, even if the content of ipmi_sensors.out is having a Critical one:

$ /usr/local/lib/nagios/plugins/check_ipmi_sensor
IPMI Status: Critical [149 system event log (SEL) entries present] | 'Current Power'=216 'CPU1 Temp'=43.00;10.00:91.00;5.00:96.00 'CPU2 Temp'=47.00;10.00:91.00;5.00:96.00 'PCH Temp'=54.00;10.00:85.00;5.00:90.00 'System Temp'=24.00;10.00:80.00;5.00:85.00 'Peripheral Temp'=40.00;...

$ /usr/local/lib/nagios/plugins/cron_ipmi_sensors.py --noentityabsent <- generates the ipmi_sensors.out file here

$ /usr/local/lib/nagios/plugins/check_ipmi.py
OK: IPMI Status: Critical [149 system event log (SEL) entries present] | 'Current Power'=253 'CPU1...

for the last output we would expect the following:
CRITICAL: IPMI Status: Critical [149 system event log (SEL) entries present] | 'Current Power'=253 'CPU1...

the root cause of the problem is the following logic in cron_ipmi_sensors.py:
1, output = subprocess.check_output(cmdline).decode('utf8') - returns 2, error code
2, going to exception and writes the fd.write('{}: {}'.format(NAGIOS_ERRORS[error.returncode], output))
3, then the exception block ends (but no sysexit, etc.)
4, writes out the file *again* without the Nagios Error string based on returncode
line 48 in the original code: https://pastebin.canonical.com/p/F4gHYGMmgS/

I suggest to write the ipmi_sensors.out content without error warning when no error happened.

Revision history for this message
Márton Kiss (marton-kiss) wrote :

This is the patch to fix the file write logic:
https://pastebin.canonical.com/p/5R58vjsNNh/

I'm going to create a proper merge request.

Revision history for this message
Márton Kiss (marton-kiss) wrote :
Revision history for this message
Paul Goins (vultaire) wrote :

I think the intended merge request link was: https://code.launchpad.net/~marton-kiss/charm-hw-health/+git/charm-hw-health/+merge/385516

The fix should now be available via the promulgated charm, i.e. cs:hw-health-2.

Changed in charm-hw-health:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.