HP Proliant Serverrs - DL360 and DL380 Gen8 - Precise Kernel Panic - General Protection Fault and X2APIC/XAPIC boot parameters

Bug #1398497 reported by Rafael David Tinoco
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
High
Unassigned
Precise
Expired
High
Unassigned

Bug Description

It was brought to my attention the following situation:

"""
We massively upgraded our Ubuntu 12.04 servers (most of them are HP
DL360p Gen8 or DL380 Gen8) to 3.2.0-67 kernel And in the last 2-3
days we already had to reboot 5 of them because they completely hang

Some of them had the following messages under syslog :
kernel: [384707.675479] general protection fault: 0000 [#5666] SMP

others had :
kernel: [950725.612724] BUG: unable to handle kernel paging request

All of them have this also :
your BIOS is broken and requested that x2apic be disabled
"""

Comments bellow

tags: added: cts precise
Changed in linux (Ubuntu):
assignee: nobody → Rafael David Tinoco (inaddy)
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1398497

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Precise):
status: New → Incomplete
Chris J Arges (arges)
Changed in linux (Ubuntu Precise):
assignee: nobody → Rafael David Tinoco (inaddy)
Changed in linux (Ubuntu):
assignee: Rafael David Tinoco (inaddy) → nobody
assignee: nobody → Rafael David Tinoco (inaddy)
Chris J Arges (arges)
Changed in linux (Ubuntu):
status: Incomplete → In Progress
Changed in linux (Ubuntu Precise):
status: Incomplete → In Progress
tags: added: bot-stop-nagging kernel-key
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Precise):
importance: Undecided → High
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote : Re: HP Proliant Serverrs - DL360 and DL380 Gen8 - Precise Kernel Panic - General Protection Fault

Just got confirmation from HP regarding X2APIC and PROLIANT SERVERS using Ubuntu Linux:

Before GEN8.. all Proliant Servers did NOT support X2APIC. They had firmware saying OS to NOT use X2APIC (and OS started supporting this by commit: 41750d3, already included in kernel 3.2). The thing is.. for these servers, opting out from X2APIC made them to use XAPIC IRQ remapping (not supported).

-----

So, for Proliant Servers BEFORE GEN8 the recommended cmdline is this:

"nox2apic intermap=off"

Obs: nox2apic might not be needed since firmware is saying for Linux to optout from using x2apic. Anyway I prefer to recommend this flag to make sure kernels before 3.2 still work (not the case here).

-----

AFTER GEN8 (inclusive), the firmware STILL says that X2APIC must not be used, but they ARE indeed supported by GEN8 (DL360, DL380). So proper cmdlines are:

"intremap=no_x2apic_optout" # let X2APIC enabled with IRQ remapping

OR

"nox2apic intermap=off" # disable X2APIC AND IRQ remapping

X2APIC for these machines ONLY differs from XAPIC in question of IRQ remapping (easier to implement). The other difference, where x2apic is capable of addressing more CPUs, is not needed cause # of CPUs is low enough.

This points out to my last finding, regarding:

*** cdcd629869fabcd38ebd24a03b0a05ec1cbcafb0 x86: Fix and improve cmpxchg_double{,_local}()
|__> !!!!!!!!!!!!!!!! fix several problems related to cmpxchg and 64bits !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

cmpxchg is where instruction pointer is at (looks like) on the analyzed dump.

Problem here is that kernel version from the core was erased from ddebs.ubuntu and I'm using the "next" ddeb version. So I'm saying this is the "most likely" to be happening since I can't objdump binary used by analyzed system.

Things to be done:

1) Provide hotfix with this fix: cdcd629869f
2) Wait for intel_idle problem observation (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1318551)
3) Recommend users either:

"intremap=no_x2apic_optout" OR "nox2apic intermap=off"

TOGETHER with "intel_idle.max_cstate=0"

until we fix intel_idle.

Thank you

Rafael Tinoco

summary: HP Proliant Serverrs - DL360 and DL380 Gen8 - Precise Kernel Panic -
- General Protection Fault
+ General Protection Fault and X2APIC/XAPIC boot parameters
Revision history for this message
Esel (glumpad) wrote : apport information

ApportVersion: 2.14.1-0ubuntu3.6
Architecture: amd64
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] Datei oder Verzeichnis nicht gefunden
DistroRelease: Ubuntu 14.04
HibernationDevice: RESUME=UUID=30721826-134d-4032-8d87-9213511577b5
InstallationDate: Installed on 2014-11-07 (34 days ago)
InstallationMedia: Ubuntu-Server 14.04.1 LTS "Trusty Tahr" - Release amd64 (20140722.3)
MachineType: HP ProLiant ML330 G6
Package: linux (not installed)
ProcEnviron:
 LANGUAGE=de_AT:de
 TERM=xterm
 PATH=(custom, no user)
 LANG=de_AT.UTF-8
 SHELL=/bin/bash
ProcFB: 0 radeondrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-39-generic root=UUID=ac0b8a1b-5d1a-4043-98c2-82819f497f51 ro nomdmonddf nomdmonisw
ProcVersionSignature: Ubuntu 3.13.0-39.66-generic 3.13.11.8
PulseList:
 Error: command ['pacmd', 'list'] failed with exit code 1: Home directory not accessible: Permission denied
 No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-39-generic N/A
 linux-backports-modules-3.13.0-39-generic N/A
 linux-firmware 1.127.10
RfKill:

Tags: trusty trusty
Uname: Linux 3.13.0-39-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 07/02/2013
dmi.bios.vendor: HP
dmi.bios.version: W07
dmi.chassis.type: 7
dmi.chassis.vendor: HP
dmi.modalias: dmi:bvnHP:bvrW07:bd07/02/2013:svnHP:pnProLiantML330G6:pvr:cvnHP:ct7:cvr:
dmi.product.name: ProLiant ML330 G6
dmi.sys.vendor: HP

tags: added: apport-collected trusty
Revision history for this message
Esel (glumpad) wrote : AlsaInfo.txt

apport information

Revision history for this message
Esel (glumpad) wrote : BootDmesg.txt

apport information

Revision history for this message
Esel (glumpad) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Esel (glumpad) wrote : IwConfig.txt

apport information

Revision history for this message
Esel (glumpad) wrote : Lspci.txt

apport information

Revision history for this message
Esel (glumpad) wrote : Lsusb.txt

apport information

Revision history for this message
Esel (glumpad) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Esel (glumpad) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Esel (glumpad) wrote : ProcModules.txt

apport information

Revision history for this message
Esel (glumpad) wrote : UdevDb.txt

apport information

Revision history for this message
Esel (glumpad) wrote : UdevLog.txt

apport information

Revision history for this message
Esel (glumpad) wrote : WifiSyslog.txt

apport information

Revision history for this message
Esel (glumpad) wrote :

In my case it's not happening when I boot the Server but when it's already running. From time to time it freezes with the NMI-Error and I can't do anything except hard power-off and power-on. We're running virtualbox on it with 2 guests.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Revision history for this message
Esel (glumpad) wrote :

Thanks for your comment. Added only the "nox2apic intermap=off" and the error occoured again. Now added also the intel command, so the cmdline looks like "nox2apic intermap=off intel_idle.max_cstate=0". Is this right for my DL360 Gen6?

Revision history for this message
Esel (glumpad) wrote :

Hi Rafael,

sry but this is not my server, its one of a customer who has an important program running on it, so I'm sorry but I can't test around. I'm now running the right cmdline (entered in /etc/default/grub under GRUB_CMDLINE_LINUX=).
Because the edit needed a restart i installed linux-crashdump before so if the error throws up again I'd like to accept your offer to have a look at it! FYI: I havn't had to do a lot with things this deep in kernel, so keywords like "linux-crashdump" helps me to find the right thing :).

Greetings from Austria

Revision history for this message
Dave (xwebsubs) wrote :

Hello,

Sorry if my posting is not appropriate... I'm very new to Linux, but inquisitive.
I have just installed an Ubuntu 14.04 based desktop, Linux Lite 2.2 kernel 3.13.?
H/W is Intel CPU i7-4790S = 4 Core +HT and I have VT enabled so I can run Virtualbox.

I have found the following in dmesg log file:

[ 0.022633] Your BIOS is broken and requested that x2apic be disabled.
[ 0.022633] This will slightly decrease performance.
[ 0.022633] Use 'intremap=no_x2apic_optout' to override BIOS request.
[ 0.022745] Enabled IRQ remapping in xapic mode
[ 0.022746] x2apic not enabled, IRQ remapping is in xapic mode

Is this also related to the issue in this thread.?
If it it is,
I can provide more information if you tell me what I need to do.?

David

Revision history for this message
Esel (glumpad) wrote :

So with the correct cmdline the system is now running since 6 days. Will report if somethings happening.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Dave,

This message is being given to let you know that your BIOS (EFI/firmware/..) is "asking" the OS to NOT USE x2apic (and this happens on HP Proliant Servers). This is because, before Gen7, Proliant servers did not support X2APIC (a new Programmable Interrupt Controller) and had to operate into "XAPIC"/"APIC" compatible mode (even if their CPU supported X2APIC).

For Proliant Servers >= Gen8 you can say to Linux to "ignore" this request from firmware. To do that you change

GRUB_CMDLINE_LINUX_DEFAULT="..." to
GRUB_CMDLINE_LINUX_DEFAULT="... intremap=no_x2apic_optout"

inside /etc/default/grub and run "update-grub" command (rebooting your server later on).

This is because Gen8 are X2APIC capable but the firmware still asks OS not to enable it (as I was told by HP).

If you use HP Proliant Servers < Gen8 you have to use:

"nox2apic intermap=off"

For not having problems regarding IRQ handling, deactivating x2apic.

Note that if you are using HP Proliant Servers you have to be aware of the following bug also:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1318551

adding "intel_idle.max_cstate=0" to the cmdline also.

Revision history for this message
Dave (xwebsubs) wrote :

Hello Rafael,

Many thanks, O.K
So I can ignore this if I am "not" using Proliant Server.?

Dave

Revision history for this message
Esel (glumpad) wrote :

Hi Rafael,

error happened again after about 3 weeks without kernel panic. Can you please tell me how to get the crashdump out of the machine so I can Post it here?

Thanks!

Revision history for this message
Esel (glumpad) wrote :

or do you have a Link where I can look? Had a look at crashdump but I'm not familiar with and don't know what you neeed...

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Esel,

Could you please follow instructions from:

https://sites.google.com/site/inaddyorg/texts/installing-ubuntu-crash-dump

Taking the opportunity to leave comments if you find anything wrong in the procedure ?

Tks, looking forward to receiving core from you.

Tks

Changed in linux (Ubuntu):
status: In Progress → Incomplete
Changed in linux (Ubuntu Precise):
status: In Progress → Incomplete
tags: removed: kernel-key
Changed in linux (Ubuntu):
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
Changed in linux (Ubuntu Precise):
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu Precise) because there has been no activity for 60 days.]

Changed in linux (Ubuntu Precise):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.