After analyzing the incomplete dump we got, for this particular case, and analyzing kernel code changes and Intel firmware erratas, and talking with HP ROM engineers (providing them errata also) we believe that, for this stack trace, we have triggered the following microcode problem:

###

http://www.intel.com.br/content/dam/www/public/us/en/documents/specification-updates/xeon-e7-v2-spec-update.pdf (Intel® Xeon® Processor E7 v2 Product Family Specification Update January 2015)

CF140 Performance Monitoring IA32_PERF_GLOBAL_STATUS.CondChgd Bit Not Cleared by Reset

Problem: The IA32_PERF_GLOBAL_STATUS MSR (38EH) should be cleared by reset. Due to this erratum, CondChgd (bit 63) of the IA32_PERF_GLOBAL_STATUS MSR may not be cleared.

Implication: When this erratum occurs, performance monitoring software may behave unexpectedly.
Workaround: It is possible for the BIOS to contain a workaround for this erratum. --> HP is probably working on this.

###

*believe because we can't check the PMU registers from the core dump we got, but everything points in that direction

This means that in x86 Linux the NMI (Non Maskable Interrupts) watchdog (hard-lockup_detector) uses PMU (Performance Counters) registers to signal who was responsible to generate the NMI.

Obs: Our intention when talking to HP was to make sure their power management firmware was not touching those registers (and they said they only read registers and there is no such thing as a "clear" after read when reads are made by firmware).

The NMI handler (kernel function responsible to handle NMIs) identifies who was responsible for the NMI by looking into PMU registers. Intel microcode does not clear BIT 63 (CondChgd) when the CPU is reset and it makes the NMI handler to misbehave (trying to handle NMIs that should not be handled by this particular kernel code).

This was seen recently by a kernel developer in the following commit:

commit b292d7a10487aee6e74b1c18b8d95b92f40d4a4f

And in Intel errata document (above).

This following commit is applied in Trusty kernel from version 3.13.0-35 up to the latest one:

inaddy@workstation:/kernel/ubuntu-trusty$ git tag --contains=ffb4bbaa2bf1ad9d79cf4d62d625499a7271f88e
Ubuntu-3.13.0-35.61
...
Ubuntu-3.13.0-45.74

User was using kernel 3.13.0-34 and it does not contain such fix.

STEP 1) To upgrade all HP Proliant Servers to latest Ubuntu Trusty kernel version.

STEP 2)

Together with HP we concluded that, for now, the best for the HP Proliant Servers is to have the following cmdline:

" ... intremap=no_x2apic_optout intel_idle.max_cstate=0 nmi_watchdog=0 ..."

intremap=no_x2apic_optout -> tells the OS that despite firmware asking for the kernel to opt out in using x2apic... it can use (Gen8 and beyond support that feature and have the advantages from x2apic (over xapic) such as supporting more CPUs and IRQ remapping).

intel_idle.max_cstate=0 -> tells the OS to disable intel_idle module and activate acpi_idle module. (HP uses ACPI heavily for their firmware power management features and intel_idle might put CPUs in a deeper state than the firmware would like it to be, causing bigger latencies and NMIs)

nmi_watchdog=0 -> tells the OS to use HP watchdog driver (due to the nature of this problem, being intermittent, HP feels like your systems should be more stable with this option. They don't recommend the usage of this option for all setups, only those with similar workload of this which suffered from NMI.

This should solve NMIs problems we've seen so far for these servers.

PS: We are still working together with HP on providing feedback regarding NMIs and their firmware behavior.