Comment 37 for bug 1318551

Whenever facing NMIs on Proliant Servers check NMI code under ILO:

Translate the code:

00h (0x00000000) No source found
01h (0x00000001) Uncorrectable Memory Error
1Bh (0x0000001B) ASR NMI
20h (0x00000020) PCI Parity Error
27h (0x00000027) NMI Button Press
28h (0x00000028) SB_BUS_NMI
29h (0x00000029) ILO Doorbell NMI
2Ah (0x0000002A) ILO IOP NMI
2Bh (0x0000002B) ILO Watchdog NMI
2Ch (0x0000002C) Proc Throt NMI
2Dh (0x0000002D) Front Side Bus NMI
2Fh (0x0000002F) PCI Express Error
30h (0x00000030) DMA controller NMI
31h (0x00000031) Hypertransport/CSI Error

If you are getting something like:

"76 CriticalSystem Error03/12/2015 12:4203/12/2015 12:072 An Unrecoverable System Error (NMI) has occurred (System error code 0x0000002B, 0x00000000)"

You are facing a ILO Watchdog NMI meaning that you triggered the ILO watchdog countdown and it has not been updated for sometime.

HPWDT triggers the ILO Watchdog countdown whenever /dev/watchdog is opened (like corosync/pacemer do, for example) and ILO will send NMIs after the watchdog has zerod (not updating ILO timer properly, for example).

Workaround (other than using the HP-ASRD daemon that frequently updates the counter) is to blacklist hpwdt module:

# echo "blacklist hpwdt" >> /etc/modprobe.d/blacklist-hp.conf
# update-initramfs -k all -u
# upgrade-grub
# reboot

Give feedback please.