Trusty + Intel E5-26xx + NMI handler (perf_event_nmi_handler) took too long to run
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
Rafael David Tinoco |
Bug Description
It was brought to my attention the following case:
Hardware name: HP ProLiant DL380p Gen8, BIOS P70 12/20/2013
Kernel: 3.13.0-34
Stack trace:
2189823.168958] INFO: NMI handler (perf_event_
[2189823.168974] Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details.
[2189823.184283] CPU: 0 PID: 60396 Comm: ceph-osd Not tainted 3.13.0-34-generic #60-Ubuntu
[2189823.194371] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 12/20/2013
[2189823.202794] 0007c7a1f01b74a3 ffff88081fa06dd0 ffffffff8171bd94 ffffffffa01672d8
[2189823.212421] ffff88081fa06e48 ffffffff81714f95 0000000000000008 ffff88081fa06e58
[2189823.221889] ffff88081fa06df8 ffffffff81c1c4c0 ffffc90006278072 0000000000000001
[2189823.231361] Call Trace:
[2189823.234597] <NMI> [<ffffffff8171b
[2189823.241996] [<ffffffff81714
[2189823.248152] [<ffffffffa0166
[2189823.256251] [<ffffffff8101b
[2189823.263054] [<ffffffff81725
[2189823.270500] [<ffffffff81725
[2189823.276867] [<ffffffff81724
[2189823.283888] [<ffffffff810d7
[2189823.291874] [<ffffffff810d7
[2189823.299966] [<ffffffff810d7
After analyzing the incomplete dump we got, for this particular case, and analyzing kernel code changes and Intel firmware erratas, and talking with HP ROM engineers (providing them errata also) we believe that, for this stack trace, we have triggered the following microcode problem:
###
http:// www.intel. com.br/ content/ dam/www/ public/ us/en/documents /specification- updates/ xeon-e7- v2-spec- update. pdf (Intel® Xeon® Processor E7 v2 Product Family Specification Update January 2015)
CF140 Performance Monitoring IA32_PERF_ GLOBAL_ STATUS. CondChgd Bit Not Cleared by Reset
Problem: The IA32_PERF_ GLOBAL_ STATUS MSR (38EH) should be cleared by reset. Due to this erratum, CondChgd (bit 63) of the IA32_PERF_ GLOBAL_ STATUS MSR may not be cleared.
Implication: When this erratum occurs, performance monitoring software may behave unexpectedly.
Workaround: It is possible for the BIOS to contain a workaround for this erratum. --> HP is probably working on this.
###
*believe because we can't check the PMU registers from the core dump we got, but everything points in that direction
This means that in x86 Linux the NMI (Non Maskable Interrupts) watchdog (hard-lockup_ detector) uses PMU (Performance Counters) registers to signal who was responsible to generate the NMI.
Obs: Our intention when talking to HP was to make sure their power management firmware was not touching those registers (and they said they only read registers and there is no such thing as a "clear" after read when reads are made by firmware).
The NMI handler (kernel function responsible to handle NMIs) identifies who was responsible for the NMI by looking into PMU registers. Intel microcode does not clear BIT 63 (CondChgd) when the CPU is reset and it makes the NMI handler to misbehave (trying to handle NMIs that should not be handled by this particular kernel code).
This was seen recently by a kernel developer in the following commit:
commit b292d7a10487aee 6e74b1c18b8d95b 92f40d4a4f
And in Intel errata document (above).
This following commit is applied in Trusty kernel from version 3.13.0-35 up to the latest one:
inaddy@ workstation: /kernel/ ubuntu- trusty$ git tag --contains= ffb4bbaa2bf1ad9 d79cf4d62d62549 9a7271f88e
Ubuntu-3.13.0-35.61
...
Ubuntu-3.13.0-45.74
User was using kernel 3.13.0-34 and it does not contain such fix.
STEP 1) To upgrade all HP Proliant Servers to latest Ubuntu Trusty kernel version.
STEP 2)
Together with HP we concluded that, for now, the best for the HP Proliant Servers is to have the following cmdline:
" ... intremap= no_x2apic_ optout intel_idle. max_cstate= 0 nmi_watchdog=0 ..."
intremap= no_x2apic_ optout -> tells the OS that despite firmware asking for the kernel to opt out in using x2apic... it can use (Gen8 and beyond support that feature and have the advantages from x2apic (over xapic) such as supporting more CPUs and IRQ remapping).
intel_idle. max_cstate= 0 -> tells the OS to disable intel_idle module and activate acpi_idle module. (HP uses ACPI heavily for their firmware power management features and intel_idle might put CPUs in a deeper state than the firmware would like it to be, causing bigger latencies and NMIs)
nmi_watchdog=0 -> tells the OS to use HP w...