mcelog errors and server freeze with qemu-kvm 0.12.3 and linux-image-2.6.32-32-server
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
High
|
Unassigned |
Bug Description
Hello,
we run several hundred servers with Ubuntu 10.04 as virtualisation nodes (with about 20-30 virtual machines each) with qemu-kvm 0.12.3 and had to find out the hard way that some kernel regression was introduced in linux-image-
Once every few days they just froze randomly and showed nothing but a black screen. We then changed back to linux-image-
After booting a crashed machine, we found entries like the following in syslog. Opposed to what it says, I'm quite sure that it's not a hardware bug, as the same machines just run fine with 2.6.32-31 kernel.
We also tried several upstream kernels from 2.6.36, 2.6.37, 2.6.38 and even 2.6.39 series - all with the same problem.
------------
mcelog: failed to prefill DIMM database from DMI data
mcelog: Kernel does not support page offline interface
mcelog: HARDWARE ERROR. This is *NOT* a software problem!
mcelog: Please contact your hardware vendor
mcelog: MCE 0
mcelog: CPU 0 BANK 5
mcelog: MISC 7fff ADDR 3fff81024ae8
mcelog: TIME 1310033176 Thu Jul 7 12:06:16 2011
mcelog: MCG status:
mcelog: MCi status:
mcelog: Error overflow
mcelog: Uncorrected error
mcelog: Error enabled
mcelog: MCi_MISC register valid
mcelog: MCi_ADDR register valid
mcelog: Processor context corrupt
mcelog: MCA: Internal Timer error
mcelog: STATUS fe00000000800400 MCGSTATUS 0
mcelog: MCGCAP 1c09 APICID 0 SOCKETID 0
mcelog: CPUID Vendor Intel Family 6 Model 26
mcelog: HARDWARE ERROR. This is *NOT* a software problem!
mcelog: Please contact your hardware vendor
mcelog: MCE 1
mcelog: CPU 1 BANK 5
mcelog: MISC 7fff ADDR 3fffa003b652
mcelog: TIME 1310033176 Thu Jul 7 12:06:16 2011
mcelog: MCG status:
mcelog: MCi status:
mcelog: Error overflow
mcelog: Uncorrected error
mcelog: Error enabled
mcelog: MCi_MISC register valid
mcelog: MCi_ADDR register valid
mcelog: Processor context corrupt
mcelog: MCA: Internal Timer error
mcelog: STATUS fe00000000800400 MCGSTATUS 0
mcelog: MCGCAP 1c09 APICID 2 SOCKETID 0
mcelog: CPUID Vendor Intel Family 6 Model 26
------------
Cheers,
David
ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-image-
Regression: Yes
Reproducible: No
ProcVersionSign
Uname: Linux 2.6.32-32-server x86_64
NonfreeKernelMo
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.21.
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/by-path', '/dev/snd/
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info: Error: [Errno 2] No such file or directory
Card0.Amixer.
Date: Tue Jul 12 14:02:14 2011
Frequency: Once every few days.
HibernationDevice: RESUME=
InstallationMedia:
IwConfig: Error: [Errno 2] No such file or directory
MachineType: MSI MS-7522
ProcCmdLine: root=/dev/
ProcEnviron:
LANG=en_US.UTF-8
SHELL=/bin/bash
RelatedPackageV
RfKill: Error: [Errno 2] No such file or directory
SourcePackage: linux
WifiSyslog:
dmi.bios.date: 11/02/2010
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: V8.14
dmi.board.
dmi.board.name: MSI X58 Pro-E (MS-7522)
dmi.board.vendor: MSI
dmi.board.version: 3.0
dmi.chassis.
dmi.chassis.type: 3
dmi.chassis.vendor: MICRO-STAR INTERNATIONAL CO.,LTD
dmi.chassis.
dmi.modalias: dmi:bvnAmerican
dmi.product.name: MS-7522
dmi.product.
dmi.sys.vendor: MSI
Changed in linux (Ubuntu): | |
status: | New → Confirmed |
Changed in linux (Ubuntu): | |
importance: | Medium → High |
Changed in linux (Ubuntu): | |
status: | Confirmed → Fix Released |
We are also affected by this:
From a pool from 100 Servers running
2.6.38-10-server #46-Ubuntu SMP Tue Jun 28 16:31:00 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
we constantly see random machines crashing (always other machines) with a black screen and no entries in the kern.log.
Mcelog shows the following error:
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 0
CPU 0 BANK 5
MISC 7fff ADDR 3fff81035b70
TIME 1313834489 Sat Aug 20 12:01:29 2011
MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS fe00000000800400 MCGSTATUS 0
MCGCAP 1c09 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 26