HP Proliant Servers Advices for Ubuntu Linux (cmdline, panics, firmware options)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Invalid
|
Undecided
|
Rafael David Tinoco |
Bug Description
This bug will try to consolidate all HP Proliant Bugs related to kernel panics. Please do not use this to attach cores and/or files. Just to provide feedback on the cmdline and its explanations.
We had several talks (in the last 2 weeks) with HP ROM Engineering Team regarding NMIs (non maskable interrupts) being generated in some situations:
1 - NMIs caused during MWAIT instruction (caused by intel_idle module):
(https:/
HP strongly uses ACPI for its power management features. HP is one of the most active members in the ACPI specification group and several features for their servers, available through their firmware, are heavily ACPI dependent. In the process of solving this and other bugs we have discovered that intel_idle module did not use ACPI tables (a way of firmware to say to OS what are the p-state/c-state values available) but queried processor directly for the available c-states. This is, probably, leading the OS to set a c-state (or sub-state) when the firmware is not "prepared" to handle. We have provided the following cmdline to be used: " intel_idle.
Update:
1.1 - This case might not be related to MWAIT and/or C-STATES after all. It looks like some users were using /dev/watchdog (from HPWDT module) by accident and/or without even knowing.
(like for example running corosync and using watchdog).
Check: https:/
2 - Recently discovered NMIs caused by a BUG in Intel microcode
(https:/
I've discovered that there is a recent microcode problem in some Intel Ivy Bridge microcode regarding a specific BIT not being cleared from the PMU (performance counter) register. This can lead to a NMI being wrongly handled (like if the PMU register was overflowed, without being) and a kernel panic. We have backported the fix to Ubuntu-
3 - X2APIC support for HP Proliant Servers
(https:/
During this investigation we had to clarify with HP ROM Engineering Team whether this servers support X2APIC or not. It was told to us, by HP, that all Gen8 (and more recent generations) do support X2APIC but they still "ask" the OS to opt-out from X2APIC (not to use X2APIC). Running the PIC (programmable interrupt controller) in XAPIC mode might not be compatible with firmware if the CPU supports X2APIC because of one of the only features that differs XAPIC from X2APIC: IRQ remapping (for virtualization, basically). So it is recommended that on all HP Proliant Servers Gen8, or newer, to use the following cmdline: " intremap=
4 - Proliant servers can fail to initialize cpufreq for some cores when having collaborative power control enabled into firmware
(https:/
=======
Anyone affected, please provide proper feedback in this bug regarding the use of those cmdlines (and kernel version) and tell me if new kernel panics (regarding NMIs and/or APIC) happened on this Server Family. We are getting feedback from community that these options are being enough to avoid the Proliant Server Family to have kernel panics and they might be released as a "public recommendation" for HP HW compatibility soon.
Thank you
Rafael Tinoco
Changed in linux (Ubuntu): | |
assignee: | nobody → Rafael David Tinoco (inaddy) |
tags: | added: cts |
description: | updated |
description: | updated |
summary: |
- HP Proliant Servers should use proper cmdline to avoid kernel panics + HP Proliant Servers Advices for Ubuntu Linux (cmdline, panics, firmware + options) |
description: | updated |
Changed in linux (Ubuntu): | |
status: | Confirmed → Invalid |
Changed in linux (Ubuntu): | |
status: | Invalid → Incomplete |
Changed in linux (Ubuntu): | |
status: | Incomplete → Invalid |
Status changed to 'Confirmed' because the bug affects multiple users.