HP Proliant Servers - Kernel Panic - NMI - DL360 & DL380 - HPWDT module loaded
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
High
|
Andy Whitcroft | |||
Precise |
High
|
Andy Whitcroft | |||
Trusty |
High
|
Andy Whitcroft | |||
Utopic |
High
|
Andy Whitcroft |
Bug Description
It was brought to me several situations where users where facing kernel panics when machine was apparently idling (for some HP Proliant Servers like DL 360, DL 380).
ILO:
"76 CriticalSystem Error03/12/2015 12:4203/12/2015 12:072 An Unrecoverable System Error (NMI) has occurred (System error code 0x0000002B, 0x00000000)"
Examples:
PID: 0 TASK: ffffffff81c1a480 CPU: 0 COMMAND: "swapper/0"
#0 [ffff88085fc05c88] machine_kexec at ffffffff8104eac2
#1 [ffff88085fc05cd8] crash_kexec at ffffffff810f26a3
#2 [ffff88085fc05da0] panic at ffffffff8175b3f2
#3 [ffff88085fc05e20] sched_clock at ffffffff8101c3b9
#4 [ffff88085fc05e30] nmi_handle at ffffffff810170e8
#5 [ffff88085fc05e90] io_check_error at ffffffff8101758e
#6 [ffff88085fc05eb0] default_do_nmi at ffffffff810176a9
#7 [ffff88085fc05ed8] do_nmi at ffffffff810177d8
#8 [ffff88085fc05ef0] end_repeat_nmi at ffffffff8176da21
[exception RIP: native_safe_halt+6]
RIP: ffffffff81055186 RSP: ffffffff81c03e90 RFLAGS: 00000246
RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000246
RDX: ffffffff81c03e90 RSI: 0000000000000018 RDI: 0000000000000001
RBP: ffffffff81055186 R8: ffffffff81055186 R9: 0000000000000018
R10: ffffffff81c03e90 R11: 0000000000000246 R12: ffffffffffffffff
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 0000000000000000 CS: 0010 SS: 0018
--- <DOUBLEFAULT exception stack> ---
#9 [ffffffff81c03e90] native_safe_halt at ffffffff81055186
#10 [ffffffff81c03e98] default_idle at ffffffff8101d37f
#11 [ffffffff81c03eb8] arch_cpu_idle at ffffffff8101dcaf
#12 [ffffffff81c03ec8] cpu_startup_entry at ffffffff810b5325
#13 [ffffffff81c03f40] rest_init at ffffffff81751a37
#14 [ffffffff81c03f50] start_kernel at ffffffff81d320b7
#15 [ffffffff81c03f90] x86_64_
#16 [ffffffff81c03fa0] x86_64_start_kernel at ffffffff81d31733
OR
PID: 0 TASK: ffffffff81c14440 CPU: 0 COMMAND: "swapper/0"
#0 [ffff880fffa07c40] machine_kexec at ffffffff8104b391
#1 [ffff880fffa07cb0] crash_kexec at ffffffff810d5fb8
#2 [ffff880fffa07d80] panic at ffffffff81730335
#3 [ffff880fffa07e00] hpwdt_pretimeout at ffffffffa02378b5 [hpwdt]
#4 [ffff880fffa07e20] nmi_handle at ffffffff8174a76a
#5 [ffff880fffa07ea0] default_do_nmi at ffffffff8174aacd
#6 [ffff880fffa07ed0] do_nmi at ffffffff8174abe0
#7 [ffff880fffa07ef0] end_repeat_nmi at ffffffff81749c81
[exception RIP: intel_idle+204]
RIP: ffffffff813f07ec RSP: ffffffff81c01d88 RFLAGS: 00000046
RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000046
RDX: ffffffff81c01d88 RSI: 0000000000000018 RDI: 0000000000000001
RBP: ffffffff813f07ec R8: ffffffff813f07ec R9: 0000000000000018
R10: ffffffff81c01d88 R11: 0000000000000046 R12: ffffffffffffffff
R13: 0000000001c0d000 R14: ffffffff81c01fd8 R15: 0000000000000000
ORIG_RAX: 0000000000000000 CS: 0010 SS: 0018
--- <NMI exception stack> ---
#8 [ffffffff81c01d88] intel_idle at ffffffff813f07ec
#9 [ffffffff81c01dc0] cpuidle_enter_state at ffffffff815e76cf
It turned out that after investigating all idling situations and diverse kernel dump files - where we had most of the CPUs either MWAITing and or "relaxing", we discovered that HPWDT was loaded and corosync was opening /dev/watchdog file, triggering the ILO watchdog timer and not updating frequently enough as ILO expected.
As described in /etc/modprobe.
"""
# Watchdog drivers should not be loaded automatically, but only if a
# watchdog daemon is installed.
"""
We should blacklist module "hpwdt" by default for all Ubuntu versions.
tags: | added: cts |
Changed in linux (Ubuntu): | |
assignee: | nobody → Rafael David Tinoco (inaddy) |
assignee: | Rafael David Tinoco (inaddy) → nobody |
status: | New → Incomplete |
status: | Incomplete → Confirmed |
summary: |
- HP Proliant Servers should not have HPWDT module loaded automatically + HP Proliant Servers - Kernel Panic NMI - DL360 & DL380 - HPWDT module + loaded |
summary: |
- HP Proliant Servers - Kernel Panic NMI - DL360 & DL380 - HPWDT module + HP Proliant Servers - Kernel Panic - NMI - DL360 & DL380 - HPWDT module loaded |
description: | updated |
description: | updated |
Changed in linux (Ubuntu): | |
status: | Confirmed → In Progress |
importance: | Undecided → High |
assignee: | nobody → Andy Whitcroft (apw) |
milestone: | none → ubuntu-15.03 |
Changed in linux (Ubuntu Precise): | |
status: | New → In Progress |
Changed in linux (Ubuntu Trusty): | |
status: | New → In Progress |
Changed in linux (Ubuntu Utopic): | |
status: | New → In Progress |
Changed in linux (Ubuntu Precise): | |
importance: | Undecided → High |
Changed in linux (Ubuntu Trusty): | |
importance: | Undecided → High |
Changed in linux (Ubuntu Utopic): | |
importance: | Undecided → High |
Changed in linux (Ubuntu Precise): | |
assignee: | nobody → Andy Whitcroft (apw) |
Changed in linux (Ubuntu Trusty): | |
assignee: | nobody → Andy Whitcroft (apw) |
Changed in linux (Ubuntu Utopic): | |
assignee: | nobody → Andy Whitcroft (apw) |
Changed in linux (Ubuntu): | |
status: | In Progress → Fix Committed |
Rafael David Tinoco (inaddy) wrote : | #2 |
Andy Whitcroft (apw) wrote : | #3 |
Put together a generic solution which blacklists all WDT modules by default as they all suffer the same issue wrt to NMIs if the user does not know exactly what they are doing. Pushed patches to kernel-team@ for review.
Changed in linux (Ubuntu Utopic): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Trusty): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Precise): | |
status: | In Progress → Fix Committed |
Edward Bustos (edward-bustos) wrote : | #4 |
Per Linda Knippers (HP) :
Why are you blacklisting the watchdog timers? It seems like if corosync wants to use them, which is why it would open /dev/watchdog, then there's either a corosync bug or there's something in the configuration that isn't right. Is anyone looking into that?
For cluster configurations, you probably really do want a watchdog so that hung systems can crash, reboot and rejoin the cluster.
And what about non-corosync configurations? Other distros run the watchdog timers just fine. Blacklisting the watchdog timer just hides underlying problems.
Edward Bustos (edward-bustos) wrote : | #5 |
Per Dan Zink (HP FW/BIOS):
I agree with Linda. There is a problem with the application or how the application uses the watchdog. Removing the watchdog is not a proper solution. I would think this issue is for Canonical to investigate.
In addition, I think there is a second problem here. When the watchdog fires, I expect that we should provide the end user good diagnostic information so they know it was a watchdog timeout. That did not happen here which has delayed root cause. This seems to be a kernel/
Rafael David Tinoco (inaddy) wrote : | #6 |
Sorry, there is a misunderstanding regarding the case and this bug.
This is not the ANSWER for the reported bug, just a clarification on
what the kernel team has decided to do way before this case. All
watchdogs are blacklisted by default in Ubuntu and can be enabled if
needed (like for example a case where corosync wants to rely on HW
watchdog for making sure that there are no split brains and things
like that).
Per kernel team comments (on kernel-team mailing list):
"""
We have been seeing random crashs from various HP systems, this has
been tracked to loading of the hpwdt watchdog modules. Basically these
modules are a loaded gun and unless you know exactly what you are doing
you are likely to take off your own head. For this reason we already
blacklist "all" of these modules in kmod/module-
Unfortuantly these have not been kept in sync with the kernel leading to
the module loading.
"""
This is actually not a resolution for this particular case, but a bug
(from a previous decision of blacklisting them all).
Of course we shall recommend the HW watchdog interface for 2 node
cluster setups, for example, when we can't rely on quorum policies and
fencing mechanisms are not available (like external network for
powering nodes down and things like that).
Regarding the usage of watchdog on top of corosync and
synchronization, yes I agree... this is something I'll pursue.
Launchpad Janitor (janitor) wrote : | #7 |
This bug was fixed in the package linux - 3.19.0-10.10
---------------
linux (3.19.0-10.10) vivid; urgency=low
[ Andy Whitcroft ]
* [Packaging] control -- make element ordering deterministic
* [Config] allow dracult to support initramfs as well
- LP: #1109029
* [Packaging] generate live watchdog blacklists
- LP: #1432837
[ Leann Ogasawara ]
* [Config] CONFIG_
- LP: #1397860
* rebase to v3.19.2
[ Upstream Kernel Changes ]
* thinkpad_acpi: support new BIOS version string pattern
- LP: #1417915
* arm64: Invalidate the TLB corresponding to intermediate page table
levels
- LP: #1432546
* perf tools: Support parsing parameterized events
- LP: #1430341
* perf tools: Extend format_alias() to include event parameters
- LP: #1430341
* perf Documentation: Add event parameters
- LP: #1430341
* perf tools: Document parameterized and symbolic events
- LP: #1430341
* perf: provide sysfs_show for struct perf_pmu_
- LP: #1430341
* perf: add PMU_EVENT_
- LP: #1430341
* perf: define EVENT_DEFINE_
- LP: #1430341
* powerpc/
- LP: #1430341
* powerpc/
annotated
- LP: #1430341
* powerpc/
- LP: #1430341
* powerpc/
- LP: #1430341
* powerpc/iommu: Remove IOMMU device references via bus notifier
- LP: #1425202
* powerpc/pseries: Fix endian problems with LE migration
- LP: #1428351
* intel_idle: support additional Broadwell model
- LP: #1400970
* tools/power turbostat: support additional Broadwell model
- LP: #1400970
* KVM: x86: flush TLB when D bit is manually changed.
- LP: #1397860
* Optimize TLB flush in kvm_mmu_
- LP: #1397860
* KVM: Add generic support for dirty page logging
- LP: #1397860
* KVM: x86: switch to kvm_get_
- LP: #1397860
* KVM: Rename kvm_arch_
log dirty
- LP: #1397860
* KVM: MMU: Add mmu help functions to support PML
- LP: #1397860
* KVM: MMU: Explicitly set D-bit for writable spte.
- LP: #1397860
* KVM: x86: Change parameter of kvm_mmu_
- LP: #1397860
* KVM: x86: Add new dirty logging kvm_x86_ops for PML
- LP: #1397860
* KVM: VMX: Add PML support in VMX
- LP: #1397860
* HID: multitouch: add support of clickpads
* HID: multitouch: Add support for button type usage
[ Upstream Kernel Changes ]
* rebase to v3.19.2
- LP: #1428947
-- Andy Whitcroft <email address hidden> Mon, 23 Mar 2015 15:28:16 +0000
Changed in linux (Ubuntu): | |
status: | Fix Committed → Fix Released |
Brad Figg (brad-figg) wrote : | #8 |
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.
See https:/
tags: | added: verification-needed-precise |
tags: | added: verification-needed-trusty |
Brad Figg (brad-figg) wrote : | #9 |
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.
See https:/
tags: | added: verification-needed-utopic |
Brad Figg (brad-figg) wrote : | #10 |
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.
See https:/
Rafael David Tinoco (inaddy) wrote : | #11 |
Doing verification right now... will update the case soon.
Rafael David Tinoco (inaddy) wrote : | #12 |
Checked /lib/modprobe.
Checking in a Proliant Server:
root@hertz:~# dmidecode | grep -i proliant
Product Name: ProLiant DL360e Gen8
Family: ProLiant
root@hertz:~# uname -a
Linux hertz 3.16.0-31-generic #41-Ubuntu SMP Tue Feb 10 15:24:04 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
root@hertz:~# lsmod | grep hpwdt
hpwdt 14257 0
root@hertz:~# uname -a
Linux hertz 3.16.0-34-generic #45-Ubuntu SMP Mon Mar 23 17:21:27 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
root@hertz:~# lsmod | grep -i hpwdt
Awesome job. Thank you!!!
tags: |
added: verification-done removed: verification-needed-precise verification-needed-trusty verification-needed-utopic |
Launchpad Janitor (janitor) wrote : | #13 |
This bug was fixed in the package linux - 3.13.0-49.81
---------------
linux (3.13.0-49.81) trusty; urgency=low
[ Kamal Mostafa ]
* Release Tracking Bug
- LP: #1436016
[ Alex Hung ]
* SAUCE: ACPI / blacklist: blacklist Win8 OSI for HP Pavilion dv6
- LP: #1416940
[ Andy Whitcroft ]
* [Packaging] generate live watchdog blacklists
- LP: #1432837
[ Ben Widawsky ]
* SAUCE: i915_bdw: drm/i915/bdw: enable eDRAM.
- LP: #1430855
[ Chris J Arges ]
* [Config] Add ibmvfc to d-i
- LP: #1416001
[ Seth Forshee ]
* [Config] updateconfigs - enable X86_UP_APIC_MSI
[ Upstream Kernel Changes ]
* net: add sysfs helpers for netdev_adjacent logic
- LP: #1410852
* net: Mark functions as static in core/dev.c
- LP: #1410852
* net: rename sysfs symlinks on device name change
- LP: #1410852
* btrfs: fix null pointer dereference in clone_fs_devices when name is
null
- LP: #1429804
* cdc-acm: add sanity checks
- LP: #1413992
* x86: thinkpad_acpi.c: fixed spacing coding style issue
- LP: #1417915
* thinkpad_acpi: support new BIOS version string pattern
- LP: #1417915
* net: sctp: fix slab corruption from use after free on INIT collisions
- LP: #1416506
- CVE-2015-1421
* ipv4: try to cache dst_entries which would cause a redirect
- LP: #1420027
- CVE-2015-1465
* x86, mm/ASLR: Fix stack randomization on 64-bit systems
- LP: #1423757
- CVE-2015-1593
* net: llc: use correct size for sysctl timeout entries
- LP: #1425271
- CVE-2015-2041
* net: rds: use correct size for max unacked packets and bytes
- LP: #1425274
- CVE-2015-2042
* Btrfs: clear compress-force when remounting with compress option
- LP: #1434183
* ext4: merge uninitialized extents
- LP: #1430184
* btrfs: filter invalid arg for btrfs resize
- LP: #1435441
* Bluetooth: Add firmware update for Atheros 0cf3:311f
* Bluetooth: btusb: Add IMC Networks (Broadcom based)
* Bluetooth: sort the list of IDs in the source code
* Bluetooth: append new supported device to the list [0b05:17d0]
* Bluetooth: Add support for Intel bootloader devices
* Bluetooth: Ignore isochronous endpoints for Intel USB bootloader
* Bluetooth: Add support for Acer [13D3:3432]
* Bluetooth: Add support for Broadcom device of Asus Z97-DELUXE
motherboard
* Add a new PID/VID 0227/0930 for AR3012.
* Bluetooth: Add support for Acer [0489:e078]
* Bluetooth: Add USB device 04ca:3010 as Atheros AR3012
* x86: mm: move mmap_sem unlock from mm_fault_error() to caller
* vm: add VM_FAULT_SIGSEGV handling support
* vm: make stack guard page errors return VM_FAULT_SIGSEGV rather than
SIGBUS
* spi/pxa2xx: Clear cur_chip pointer before starting next message
* spi: dw: Fix detecting FIFO depth
* spi: dw-mid: fix FIFO size
* ASoC: wm8960: Fix capture sample rate from 11250 to 11025
* regulator: core: fix race condition in regulator_put()
* ASoC: omap-mcbsp: Correct CBM_CFS dai format configuration
* can: c_can: end pending transmission on network stop (ifdown)
* nfs: fix dio deadlock when O_DIRECT flag is flipped
* NFSv4.1: Fix an Oops in nfs41_...
Changed in linux (Ubuntu Trusty): | |
status: | Fix Committed → Fix Released |
Launchpad Janitor (janitor) wrote : | #14 |
This bug was fixed in the package linux - 3.2.0-80.116
---------------
linux (3.2.0-80.116) precise; urgency=low
[ Brad Figg ]
* Release Tracking Bug
- LP: #1435392
[ Andy Whitcroft ]
* [Packaging] generate live watchdog blacklists
- LP: #1432837
[ Upstream Kernel Changes ]
* Drivers: hv: vmbus: incorrect device name is printed when child device
is unregistered
- LP: #1417313
* x86, mm/ASLR: Fix stack randomization on 64-bit systems
- LP: #1423757
- CVE-2015-1593
* net: llc: use correct size for sysctl timeout entries
- LP: #1425271
- CVE-2015-2041
* net: rds: use correct size for max unacked packets and bytes
- LP: #1425274
- CVE-2015-2042
* PCI: quirks: Fix backport of quirk_io()
- LP: #1434639
* MIPS: IRQ: Fix disable_irq on CPU IRQs
- LP: #1434639
* ASoC: atmel_ssc_dai: fix start event for I2S mode
- LP: #1434639
* ALSA: ak411x: Fix stall in work callback
- LP: #1434639
* lib/checksum.c: fix carry in csum_tcpudp_nofold
- LP: #1434639
* lib/checksum.c: fix build for generic csum_tcpudp_nofold
- LP: #1434639
* caif: remove wrong dev_net_set() call
- LP: #1434639
* MIPS: Fix kernel lockup or crash after CPU offline/online
- LP: #1434639
* gpio: sysfs: fix memory leak in gpiod_export_link
- LP: #1434639
* gpio: sysfs: fix memory leak in gpiod_sysfs_
- LP: #1434639
* net: sctp: fix passing wrong parameter header to param_type2af in
sctp_
- LP: #1434639
* mm: pagewalk: call pte_hole() for VM_PFNMAP during walk_page_range
- LP: #1434639
* nilfs2: fix deadlock of segment constructor over I_SYNC flag
- LP: #1434639
* staging: comedi: cb_pcidas64: fix incorrect AI range code handling
- LP: #1434639
* media/rc: Send sync space information on the lirc device
- LP: #1434639
* sched/rt: Reduce rq lock contention by eliminating locking of
non-feasible target
- LP: #1434639
* time: adjtimex: Validate the ADJ_FREQUENCY values
- LP: #1434639
* ntp: Fixup adjtimex freq validation on 32-bit systems
- LP: #1434639
* ipv6: fib: fix fib dump restart
- LP: #1434639
* ipv6: fib: fix fib dump restart
- LP: #1434639
* Bluetooth: ath3k: workaround the compatibility issue with xHCI
controller
- LP: #1400215, #1434639
* Linux 3.2.68
- LP: #1434639
* KVM: nVMX: Fix content of MSR_IA32_
- LP: #1431473
-- Brad Figg <email address hidden> Mon, 23 Mar 2015 08:41:45 -0700
Changed in linux (Ubuntu Precise): | |
status: | Fix Committed → Fix Released |
Launchpad Janitor (janitor) wrote : | #15 |
This bug was fixed in the package linux - 3.16.0-34.45
---------------
linux (3.16.0-34.45) utopic; urgency=low
[ Luis Henriques ]
* Release Tracking Bug
- LP: #1435400
[ Andy Whitcroft ]
* [Packaging] generate live watchdog blacklists
- LP: #1432837
[ Chris J Arges ]
* [Config] Add ibmvfc to d-i
- LP: #1416001
[ John Johansen ]
* SAUCE: (no-up): apparmor: fix mediation of fs unix sockets
- LP: #1408833
[ Seth Forshee ]
* [Config] updateconfigs - enable X86_UP_APIC_MSI
[ Upstream Kernel Changes ]
* cdc-acm: add sanity checks
- LP: #1413992
* x86: thinkpad_acpi.c: fixed spacing coding style issue
- LP: #1417915
* thinkpad_acpi: support new BIOS version string pattern
- LP: #1417915
* powernv: Use _GLOBAL_TOC for opal wrappers
- LP: #1431196
* Btrfs: clear compress-force when remounting with compress option
- LP: #1434183
* Btrfs: send, don't delay dir move if there's a new parent inode
- LP: #1434223
* [media] em28xx: fix em28xx-input removal
- LP: #1434595
* [media] em28xx: ensure "closing" messages terminate with a newline
- LP: #1434595
* [media] em28xx-input: fix missing newlines
- LP: #1434595
* [media] em28xx-core: fix missing newlines
- LP: #1434595
* [media] em28xx-audio: fix missing newlines
- LP: #1434595
* [media] em28xx-audio: fix missing newlines
- LP: #1434595
* [media] em28xx-dvb: fix missing newlines
- LP: #1434595
* [media] em28xx-video: fix missing newlines
- LP: #1434595
* ARM: pxa: add regulator_
- LP: #1434595
* ARM: pxa: add regulator_
- LP: #1434595
* ARM: pxa: add regulator_
- LP: #1434595
* hx4700: regulator: declare full constraints
- LP: #1434595
* HID: input: fix confusion on conflicting mappings
- LP: #1434595
* HID: fixup the conflicting keyboard mappings quirk
- LP: #1434595
* ARM: dts: tegra20: fix GR3D, DSI unit and reg base addresses
- LP: #1434595
* megaraid_sas: disable interrupt_mask before enabling hardware
interrupts
- LP: #1434595
* PCI: Generate uppercase hex for modalias var in uevent
- LP: #1434595
* usb: core: buffer: smallest buffer should start at ARCH_DMA_MINALIGN
- LP: #1434595
* tty/serial: at91: enable peripheral clock before accessing I/O
registers
- LP: #1434595
* tty/serial: at91: fix error handling in atmel_serial_
- LP: #1434595
* axonram: Fix bug in direct_access
- LP: #1434595
* btrfs: fix leak of path in btrfs_find_item
- LP: #1434595
* ksoftirqd: Enable IRQs and call cond_resched() before poking RCU
- LP: #1434595
* TPM: Add new TPMs to the tail of the list to prevent inadvertent change
of dev
- LP: #1434595
* char: tpm: Add missing error check for devm_kzalloc
- LP: #1434595
* tpm_tis: verify interrupt during init
- LP: #1434595
* tpm: Fix NULL return in tpm_ibmvtpm_
- LP: #1434595
* tpm/tpm_
- LP: #1434595
* tpm/tpm_
Changed in linux (Ubuntu Utopic): | |
status: | Fix Committed → Fix Released |
Fixing typo from previous comment:
I developed a small tool based on inotify to help users to check if their watchdog is being used.
Anyone can find instructions on how to run it here:
https:/ /github. com/inaddy/ notifymydog
Small Example:
inaddy@host:~$ wget https:/ /raw.githubuser content. com/inaddy/ notifymydog/ master/ notifymydog. c host:~/ notifymydog$ gcc -Wall -D_DEBUG=0 -D_SYSLOG=1 notifymydog.c -o notifymydog host:~/ notifymydog$ sudo ./notifymydog &
inaddy@
inaddy@
inaddy@host:~$ sudo tail -f /var/log/syslog
Mar 16 17:36:26 inaddygueto WATCHMYDOG[15766]: OK: WATCHDOG UPDATED
Mar 16 17:36:40 inaddygueto WATCHMYDOG[15766]: OK: WATCHDOG UPDATED
Mar 16 17:36:44 inaddygueto WATCHMYDOG[15766]: WARNING: WATCHDOG WAS CLOSED
Mar 16 17:36:49 inaddygueto WATCHMYDOG[15766]: WARNING: WATCHDOG WAS OPENED
So if you ever got a kernel panic on a HP Proliant Server DL360 and/or DL380 with no apparent reason and the stack trace shows NMIs generate, confirm if none of your userland programs have opened /dev/watchdog on purpose (not updating it frequent enough) and by accident (causing the watchdog HW to be triggered and panic'ing the machine after some time).
Workaround:
# echo "blacklist hpwdt" >> /etc/modprobe. d/blacklist- hp.conf
# update-initramfs -k all -u
# update-grub
# reboot