Bug #809313 “mcelog errors and server freeze with qemu-kvm 0.12....” : Bugs : linux package : Ubuntu

Revision history for this message

David Mayr (n-launchpad-davey-de) wrote on 2011-07-12:

#1

AlsaDevices.txt Edit (287 bytes, text/plain; charset="utf-8")
BootDmesg.txt Edit (55.1 KiB, text/plain; charset="utf-8")
Card0.Codecs.codec.0.txt Edit (761 bytes, text/plain; charset="utf-8")
CurrentDmesg.txt Edit (31.2 KiB, text/plain; charset="utf-8")
Lspci.txt Edit (32.8 KiB, text/plain; charset="utf-8")
Lsusb.txt Edit (503 bytes, text/plain; charset="utf-8")
PciMultimedia.txt Edit (1.5 KiB, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (6.3 KiB, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (3.1 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (3.9 KiB, text/plain; charset="utf-8")
UdevDb.txt Edit (117.4 KiB, text/plain; charset="utf-8")
UdevLog.txt Edit (231.9 KiB, text/plain; charset="utf-8")

Brad Figg (brad-figg) on 2011-07-19

Changed in linux (Ubuntu):
status:	New → Confirmed

Revision history for this message

Thibaut (t-britz) wrote on 2011-08-20:

#2

We are also affected by this:

From a pool from 100 Servers running

2.6.38-10-server #46-Ubuntu SMP Tue Jun 28 16:31:00 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

we constantly see random machines crashing (always other machines) with a black screen and no entries in the kern.log.

Mcelog shows the following error:

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 0
CPU 0 BANK 5
MISC 7fff ADDR 3fff81035b70
TIME 1313834489 Sat Aug 20 12:01:29 2011
MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS fe00000000800400 MCGSTATUS 0
MCGCAP 1c09 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 26

Revision history for this message

Thibaut (t-britz) wrote on 2011-08-20:

#3

We don't run any virtualisation, but heavy IO.

Revision history for this message

Sebastian Nickel (sebastian-nickel) wrote on 2011-11-14:

#4

This issue is still present in "linux-image-2.6.32-34-server". Is there any update to this bug?

Revision history for this message

Vladimir Popovski (vladimir.p) wrote on 2012-01-18:

#5

Have anybody found a root cause of this issue? We are running Natty 2.6.38-8 and experienced it as well.
Any applicable workarounds? We tried to force Linux to generate a core dump in such situations, but were unsuccessful.

Revision history for this message

Dan Kegel (dank) wrote on 2012-02-01:

#6

I can repeat a very similar panic at will by running a tiny test program that just retrieves the manufacturer field of the inserted DVD using a scsi passthrough command in wine (with either win32 or win64). This is on an updated ubuntu 11.10 x86_64.
I only have a photo of my kernel panic, but here's the full mcelog:

mcelog: failed to prefill DIMM database from DMI data
Kernel does not support page offline interface
mcelog: mcelog read: No such device
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 5
MISC 7fff ADDR 3fff812fa524
TIME 1328052557 Tue Jan 31 15:29:17 2012
MCG status:
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 0
MCGCAP 1c09 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 26
Hardware event. This is not a software error.
MCE 1
CPU 2 BANK 5
MISC 7fff ADDR 3fff8102fbe0
TIME 1328052557 Tue Jan 31 15:29:17 2012
MCG status:
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 0
MCGCAP 1c09 APICID 4 SOCKETID 0
CPUID Vendor Intel Family 6 Model 26

Revision history for this message

Dan Kegel (dank) wrote on 2012-02-01:

#7

See also bug 924596, which is somebody else's similar but less reproducible bug.

Revision history for this message

Dan Kegel (dank) wrote on 2012-02-01:

#8

Sigh. Bug 924596 is my bug report with more info about the mcelog I pasted above. I was distracted in my previous post.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2012-02-01:

#9

Would it be possible for you to test the latest upstream kernel? It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.3 kernel[1] (Not a kernel in the daily directory). Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag(Only that one tag, please leave the other tags). This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

If this bug is fixed by the mainline kernel, please add the following tag 'kernel-fixed-upstream-KERNEL-VERSION'. For example, if kernel version 3.3-rc2 fixed the issue, the tag would be: 'kernel-fixed-upstream-v3.3-rc2'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.

Thanks in advance.

[1] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-rc2-precise/

Changed in linux (Ubuntu):
importance:	Undecided → Medium
tags:	added: kernel-da-key

Revision history for this message

David Mayr (n-launchpad-davey-de) wrote on 2012-02-03:

#10

Thanks for you time, Joseph. Since yesterday we are running kernel 3.3-rc2 on four of our nodes for testing - I will report any instability here.

Revision history for this message

Markus Schade (lp-markusschade) wrote on 2012-02-16:

#11

Hi Joseph,

David asked me to follow up on this bug.
Previously we had also tested the oneiric-lts-backports-kernel (3.0.0-15.26~lucid1-server). This had the same issue, crashing within a day or two.

We have been running 3.3-rc2 on these four nodes for 13 days now without any incident. So it seems, this kernel is stable. The downside is, that after a managedsave/restore cycle, the VMs are dead. While the VM is restored from the saved state, it is no longer responding to any inputs from the console or via network.
There was an issue that a restored VM crashed immediatly after being restored, but that was fixed in 3.0. Now the VMs does not crash, but is simply stuck.

http://permalink.gmane.org/gmane.linux.kernel.commits.head/304494

Revision history for this message

Markus Schade (lp-markusschade) wrote on 2012-03-30:

#12

Just to follow up on this.

The 3.3-rc2 ran for almost 2 months without any issues.
We have also re-tested the more recent lucid kernels on a coouple of machines and as of 2.6.32-38-server #83-Ubuntu, the crashes have not ocurred anymore. We will give it some more time on more machines, but right now it looks like this bug could be closed as fixed.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2012-03-30:

#13

Hi Markus,

So does the issue not happen with the officially supported Ubuntu Linux kernels? Or just the 3.3-rc2 upstream kernel?

Revision history for this message

Markus Schade (lp-markusschade) wrote on 2012-03-30:

#14

We had re-tested the regular Ubuntu kernels. With so many changes since -32, we may never know, which commit had triggered this and was reverted or fixed in later versions. It also may or may not have been a Ubuntu change.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2012-03-30:

#15

Are the Ubuntu kernels not exhibiting this bug now?

Revision history for this message

Markus Schade (lp-markusschade) wrote on 2012-04-05:

#16

dmesg.txt Edit (4.0 KiB, text/plain)

They were. But as of -40, the crashes are back. -38 was very stable. -39 did not get much field testing, but was showing some crashes.
I have attached the dmesg error we have been getting.

In parallel, we are also testing the -oneiric-lts-backport kernel. That is also stable in version 3.0.0-15-server #26~lucid1-Ubuntu. But as of 3.0.0-17-server #30~lucid1-Ubuntu the crashes and mce errors are back. The systems barely running a couple of hours.

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 0
CPU 0 BANK 5
MISC 7fff ADDR 3fff81031d60
TIME 1333609850 Thu Apr 5 09:10:50 2012
MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS fe00000000800400 MCGSTATUS 0
MCGCAP 1c09 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 26
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 1
CPU 1 BANK 5
MISC 7fff ADDR 3fff81031d60
TIME 1333609850 Thu Apr 5 09:10:50 2012
MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS fe00000000800400 MCGSTATUS 0
MCGCAP 1c09 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 26
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 2
CPU 2 BANK 5
MISC 7fff ADDR 3fff81031d60
TIME 1333609850 Thu Apr 5 09:10:50 2012
MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS fe00000000800400 MCGSTATUS 0
MCGCAP 1c09 APICID 4 SOCKETID 0
CPUID Vendor Intel Family 6 Model 26
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 3
CPU 3 BANK 5
MISC 7fff ADDR 3fff81031d60
TIME 1333609850 Thu Apr 5 09:10:50 2012
MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS fe00000000800400 MCGSTATUS 0
MCGCAP 1c09 APICID 6 SOCKETID 0
CPUID Vendor Intel Family 6 Model 26

They were. But as of -40, the crashes are back. -38 was very stable. -39 did not get much field testing, but was showing some crashes.
I have attached the dmesg error we have been getting.

In parallel, we are also testing the -oneiric-lts-backport kernel. That is also stable in version 3.0.0-15-server #26~lucid1-Ubuntu. But as of 3.0.0-17-server #30~lucid1-Ubuntu the crashes and mce errors are back. The systems barely running a couple of hours.

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 0
CPU 0 BANK 5 
MISC 7fff ADDR 3fff81031d60 
TIME 1333609850 Thu Apr  5 09:10:50 2012
MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS fe00000000800400 MCGSTATUS 0
MCGCAP 1c09 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 26
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 1
CPU 1 BANK 5 
MISC 7fff ADDR 3fff81031d60 
TIME 1333609850 Thu Apr  5 09:10:50 2012
MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS fe00000000800400 MCGSTATUS 0
MCGCAP 1c09 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 26
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 2
CPU 2 BANK 5 
MISC 7fff ADDR 3fff81031d60 
TIME 1333609850 Thu Apr  5 09:10:50 2012
MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS fe00000000800400 MCGSTATUS 0
MCGCAP 1c09 APICID 4 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 26
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 3
CPU 3 BANK 5 
MISC 7fff ADDR 3fff81031d60 
TIME 1333609850 Thu Apr  5 09:10:50 2012
MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS fe00000000800400 MCGSTATUS 0
MCGCAP 1c09 APICID 6 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 26

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2012-04-05:

#17

Can you run a memory check on that system?

Revision history for this message

Markus Schade (lp-markusschade) wrote on 2012-04-05:

#18

We always do a burn in and memory test before putting systems in production. Should a system exhibit multiple crashes, we revert back to a know stable kernel version. If that doesn't solve the problem, we move the virtual machines off and re-test the hardware. We are also talking about dozens of identical systems, which exhibit the same behaviour, which is also the reason why it is so frustrating. We never know if the new version will be stable or not.
I know that memory errors are a common cause for such problems and I would completely agree to do a memtest86 again, but
the point is, before we upgraded to the kernels mentioned above, the systems have been stable, some for more than a year (if you discount the necessary reboot after 208 days because of the overflow of the sched clock).

I haven't given up on -40 yet, but the best uptime we have ome up so far is 8 days, and we can only have so many crashes before have to go back to a stable version (currently 2.6.32-38 or 3.0.0-15) in order to keep customers happy.

Revision history for this message

Markus Schade (lp-markusschade) wrote on 2012-04-05:

#19

the overflow bug, I mentioned, is this one: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/805341
which is fixed in 2.6.32-38.

Joseph Salisbury (jsalisbury) on 2012-04-05

Changed in linux (Ubuntu):
importance:	Medium → High

Revision history for this message

Markus Schade (lp-markusschade) wrote on 2012-04-10:

#20

We have traced most of the 2.6.32-40 crashes to memory errors, but I will wait a couple of days to be sure.
Also we are slowly moving towards the -oneiric-backports kernel in preparation for the precise release, the 3.0.0-17 kernels has almost definitely introduced a regression (mce error in comment #16). I think the uptimes of the systems speak for themselves:

     # Uptime | System Boot up
----------------------------+---------------------------------------------------
     1 60 days, 09:08:35 | Linux 3.0.0-15-server Tue Jan 31 09:43:58 2012
-> 2 5 days, 05:23:23 | Linux 3.0.0-15-server Thu Apr 5 09:22:03 2012
     3 3 days, 14:32:26 | Linux 3.0.0-17-server Sun Apr 1 18:47:57 2012
     4 0 days, 10:05:16 | Linux 3.0.0-17-server Sat Mar 31 21:12:03 2012
     5 0 days, 09:48:18 | Linux 3.0.0-17-server Sun Apr 1 07:57:30 2012

Similar on one of the systems, we had tested the generic kernel:

     # Uptime | System Boot up
----------------------------+---------------------------------------------------
     1 56 days, 01:37:30 | Linux 3.3.0-030300rc2-ge Fri Feb 3 10:59:03 2012
     2 5 days, 23:50:19 | Linux 3.0.0-17-server Fri Mar 30 13:37:49 2012
     3 3 days, 13:07:19 | Linux 3.0.0-17-server Thu Apr 5 15:05:29 2012
-> 4 1 day , 00:14:27 | Linux 3.0.0-15-server Mon Apr 9 14:34:48 2012
     5 0 days, 09:40:52 | Linux 3.0.0-17-server Mon Apr 9 04:52:40 2012

To give a better impression of the kernel versions and number of affected systems, I have put together some stats:

Kernel | max uptime | #hosts |
---------------------------------------------
2.6.32-31-server | 208 | 100++
2.6.32-39-server | 28 | >10
3.0.0-15-server | 70 | <50
3.0.0-17-server | 12 | <10
2.6.32-38-server | 48 | <50
2.6.32-40-server | 15 | >10

We have traced most of the 2.6.32-40 crashes to memory errors, but I will wait a couple of days to be sure.
Also we are slowly moving towards the -oneiric-backports kernel in preparation for the precise release,  the 3.0.0-17 kernels  has almost definitely introduced a regression (mce error in comment #16). I think the uptimes of the systems speak for themselves:

#               Uptime | System                                     Boot up
----------------------------+---------------------------------------------------
     1    60 days, 09:08:35 | Linux 3.0.0-15-server     Tue Jan 31 09:43:58 2012
->   2     5 days, 05:23:23 | Linux 3.0.0-15-server     Thu Apr  5 09:22:03 2012
     3     3 days, 14:32:26 | Linux 3.0.0-17-server     Sun Apr  1 18:47:57 2012
     4     0 days, 10:05:16 | Linux 3.0.0-17-server     Sat Mar 31 21:12:03 2012
     5     0 days, 09:48:18 | Linux 3.0.0-17-server     Sun Apr  1 07:57:30 2012

Similar on one of the systems, we had tested the generic kernel:

#               Uptime | System                                     Boot up
----------------------------+---------------------------------------------------
     1    56 days, 01:37:30 | Linux 3.3.0-030300rc2-ge  Fri Feb  3 10:59:03 2012
     2     5 days, 23:50:19 | Linux 3.0.0-17-server     Fri Mar 30 13:37:49 2012
     3     3 days, 13:07:19 | Linux 3.0.0-17-server     Thu Apr  5 15:05:29 2012
->   4     1 day , 00:14:27 | Linux 3.0.0-15-server     Mon Apr  9 14:34:48 2012
     5     0 days, 09:40:52 | Linux 3.0.0-17-server     Mon Apr  9 04:52:40 2012

To give a better impression of the kernel versions and number of affected systems, I have put together some stats:

Kernel                   | max uptime | #hosts |
---------------------------------------------
2.6.32-31-server | 208               | 100++
2.6.32-39-server | 28                 | >10
3.0.0-15-server   | 70                 | <50
3.0.0-17-server   | 12                 | <10
2.6.32-38-server | 48                 | <50
2.6.32-40-server | 15                 | >10

Revision history for this message

Markus Schade (lp-markusschade) wrote on 2012-05-14:

#21

Download full text (4.8 KiB)

Although we have moved to the oneiric backports kernel, we are still seeing these MCE error. In some cases replacing RAM seems to solve the issue. But we are still not quite certain, that this is actually the cause because we have found the following:

May 13 15:03:39 node8 kernel: [260532.465745] BUG: unable to handle kernel paging request at ffff8806514603d0
May 13 15:03:39 node8 kernel: [260532.465803] IP: [<ffffffff81156327>] remove_rmap_item_from_tree+0xe7/0x150
May 13 15:03:39 node8 kernel: [260532.465845] PGD 1c04063 PUD 0
May 13 15:03:39 node8 kernel: [260532.465875] Oops: 0002 [#1] SMP
May 13 15:03:39 node8 kernel: [260532.465906] CPU 1
May 13 15:03:39 node8 kernel: [260532.465913] Modules linked in: cls_u32 sch_sfq sch_htb xt_physdev xt_mac ip6table_filter ip6_tables ib_iser
rdma_cm ib_cm ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 iw_cm ib_sa xt_state nf_conntrack ib_mad ib_core ipt_REJECT
xt_tcpudp ib_addr iptable_filter ip_tables iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi x_tables nouveau ttm drm_kms_helper drm i2c_
algo_bit mxm_wmi lp kvm_intel wmi kvm speedstep_lib serio_raw parport i7core_edac edac_core video bridge stp multipath linear 3w_9xxx 3w_xxxx
raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 e1000 r8169 aacraid ahci libahci sata_nv sata_sil sata_via
May 13 15:03:39 node8 kernel: [260532.466384]
May 13 15:03:39 node8 kernel: [260532.466407] Pid: 50, comm: ksmd Not tainted 3.0.0-19-server #33~lucid1-Ubuntu MSI MS-7522/MSI X58 Pro-E (MS-7522)
May 13 15:03:39 node8 kernel: [260532.466470] RIP: 0010:[<ffffffff81156327>] [<ffffffff81156327>] remove_rmap_item_from_tree+0xe7/0x150
May 13 15:03:39 node8 kernel: [260532.466527] RSP: 0018:ffff88061aaafdd0 EFLAGS: 00010202
May 13 15:03:39 node8 kernel: [260532.466557] RAX: 0000000000022200 RBX: ffff88034830a6c0 RCX: 0000000000000034
May 13 15:03:39 node8 kernel: [260532.466607] RDX: 0000000000000000 RSI: ffffea001523e498 RDI: ffff8806514603a8
May 13 15:03:39 node8 kernel: [260532.466659] RBP: ffff88061aaafdf0 R08: 0000000000000000 R09: 24c0000000000000
May 13 15:03:39 node8 kernel: [260532.466710] R10: ffff8805fc06f000 R11: 0000000000000001 R12: ffff8806173c79d8
May 13 15:03:39 node8 kernel: [260532.466758] R13: ffffea001523e498 R14: ffff880617775b68 R15: 0000000000000000
May 13 15:03:39 node8 kernel: [260532.466807] FS: 0000000000000000(0000) GS:ffff88063fc20000(0000) knlGS:0000000000000000
May 13 15:03:39 node8 kernel: [260532.466858] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
May 13 15:03:39 node8 kernel: [260532.466889] CR2: ffff8806514603d0 CR3: 0000000001c03000 CR4: 00000000000026e0
May 13 15:03:39 node8 kernel: [260532.466938] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 13 15:03:39 node8 kernel: [260532.466986] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
May 13 15:03:39 node8 kernel: [260532.467036] Process ksmd (pid: 50, threadinfo ffff88061aaae000, task ffff8806174eae40)
May 13 15:03:39 node8 kernel: [260532.467085] Stack:
May 13 15:03:39 node8 kernel: [260532.467107] ffffea001512c9b8 ffff8802f6f7c180 ffff88034830a6c0 ffff88...

Although we have moved to the oneiric backports kernel, we are still seeing these MCE error. In some cases replacing RAM seems to solve the issue. But we are still not quite certain, that this is actually the cause because we have found the following:

May 13 15:03:39 node8 kernel: [260532.465745] BUG: unable to handle kernel paging request at ffff8806514603d0
May 13 15:03:39 node8 kernel: [260532.465803] IP: [<ffffffff81156327>] remove_rmap_item_from_tree+0xe7/0x150
May 13 15:03:39 node8 kernel: [260532.465845] PGD 1c04063 PUD 0 
May 13 15:03:39 node8 kernel: [260532.465875] Oops: 0002 [#1] SMP 
May 13 15:03:39 node8 kernel: [260532.465906] CPU 1 
May 13 15:03:39 node8 kernel: [260532.465913] Modules linked in: cls_u32 sch_sfq sch_htb xt_physdev xt_mac ip6table_filter ip6_tables ib_iser
 rdma_cm ib_cm ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 iw_cm ib_sa xt_state nf_conntrack ib_mad ib_core ipt_REJECT
 xt_tcpudp ib_addr iptable_filter ip_tables iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi x_tables nouveau ttm drm_kms_helper drm i2c_
algo_bit mxm_wmi lp kvm_intel wmi kvm speedstep_lib serio_raw parport i7core_edac edac_core video bridge stp multipath linear 3w_9xxx 3w_xxxx
 raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 e1000 r8169 aacraid ahci libahci sata_nv sata_sil sata_via
May 13 15:03:39 node8 kernel: [260532.466384] 
May 13 15:03:39 node8 kernel: [260532.466407] Pid: 50, comm: ksmd Not tainted 3.0.0-19-server #33~lucid1-Ubuntu MSI MS-7522/MSI X58 Pro-E (MS-7522)
May 13 15:03:39 node8 kernel: [260532.466470] RIP: 0010:[<ffffffff81156327>]  [<ffffffff81156327>] remove_rmap_item_from_tree+0xe7/0x150
May 13 15:03:39 node8 kernel: [260532.466527] RSP: 0018:ffff88061aaafdd0  EFLAGS: 00010202
May 13 15:03:39 node8 kernel: [260532.466557] RAX: 0000000000022200 RBX: ffff88034830a6c0 RCX: 0000000000000034
May 13 15:03:39 node8 kernel: [260532.466607] RDX: 0000000000000000 RSI: ffffea001523e498 RDI: ffff8806514603a8
May 13 15:03:39 node8 kernel: [260532.466659] RBP: ffff88061aaafdf0 R08: 0000000000000000 R09: 24c0000000000000
May 13 15:03:39 node8 kernel: [260532.466710] R10: ffff8805fc06f000 R11: 0000000000000001 R12: ffff8806173c79d8
May 13 15:03:39 node8 kernel: [260532.466758] R13: ffffea001523e498 R14: ffff880617775b68 R15: 0000000000000000
May 13 15:03:39 node8 kernel: [260532.466807] FS:  0000000000000000(0000) GS:ffff88063fc20000(0000) knlGS:0000000000000000
May 13 15:03:39 node8 kernel: [260532.466858] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
May 13 15:03:39 node8 kernel: [260532.466889] CR2: ffff8806514603d0 CR3: 0000000001c03000 CR4: 00000000000026e0
May 13 15:03:39 node8 kernel: [260532.466938] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 13 15:03:39 node8 kernel: [260532.466986] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
May 13 15:03:39 node8 kernel: [260532.467036] Process ksmd (pid: 50, threadinfo ffff88061aaae000, task ffff8806174eae40)
May 13 15:03:39 node8 kernel: [260532.467085] Stack:
May 13 15:03:39 node8 kernel: [260532.467107]  ffffea001512c9b8 ffff8802f6f7c180 ffff88034830a6c0 ffff88061147c480
May 13 15:03:39 node8 kernel: [260532.467166]  ffff88061aaafe10 ffffffff811563bf ffff88061aaafea8 ffff880617775b00
May 13 15:03:39 node8 kernel: [260532.467225]  ffff88061aaafe60 ffffffff811567d1 ffff88061aaafe60 ffff880617775b58
May 13 15:03:39 node8 kernel: [260532.467283] Call Trace:
May 13 15:03:39 node8 kernel: [260532.467310]  [<ffffffff811563bf>] remove_trailing_rmap_items+0x2f/0x60
May 13 15:03:39 node8 kernel: [260532.467343]  [<ffffffff811567d1>] scan_get_next_rmap_item+0x81/0x450
May 13 15:03:39 node8 kernel: [260532.467376]  [<ffffffff8115702e>] ksm_scan_thread+0xfe/0x2d0
May 13 15:03:39 node8 kernel: [260532.467409]  [<ffffffff81084c30>] ? wake_up_bit+0x40/0x40
May 13 15:03:39 node8 kernel: [260532.467440]  [<ffffffff81156f30>] ? cmp_and_merge_page+0x390/0x390
May 13 15:03:39 node8 kernel: [260532.467473]  [<ffffffff810846d6>] kthread+0x96/0xa0
May 13 15:03:39 node8 kernel: [260532.467505]  [<ffffffff8160e064>] kernel_thread_helper+0x4/0x10
May 13 15:03:39 node8 kernel: [260532.467537]  [<ffffffff81084640>] ? kthread_worker_fn+0x190/0x190
May 13 15:03:39 node8 kernel: [260532.467569]  [<ffffffff8160e060>] ? gs_change+0x13/0x13
May 13 15:03:39 node8 kernel: [260532.467599] Code: 89 ef 48 89 53 30 48 89 43 38 e8 05 91 fb ff 4c 89 ef e8 2d 56 fc ff 49 83 7c 24 18 00 74 4d 48 83 2d bd 51 d9 00 01 48 8b 7b 08 <f0> ff 4f 28 0f 94 c0 84 c0 74 05 e8 69 95 fe ff 48 81 63 18 00 
May 13 15:03:39 node8 kernel: [260532.467850] RIP  [<ffffffff81156327>] remove_rmap_item_from_tree+0xe7/0x150
May 13 15:03:39 node8 kernel: [260532.467888]  RSP <ffff88061aaafdd0>
May 13 15:03:39 node8 kernel: [260532.467913] CR2: ffff8806514603d0
May 13 15:03:39 node8 kernel: [260532.468195] ---[ end trace f2b245c86376149e ]---

Revision history for this message

Markus Schade (lp-markusschade) wrote on 2012-06-12:

#22

Download full text (6.2 KiB)

it doesn't get much better. Recent call trace with 3.0.0-19 from oneric backports

Jun 11 02:10:13 node2 kernel: [2387197.859644] ------------[ cut here ]------------
Jun 11 02:10:13 node2 kernel: [2387197.859653] WARNING: at /build/buildd/linux-lts-backport-oneiric-3.0.0/net/sched/sch_generic.c:255 dev_wat
chdog+0x24d/0x260()
Jun 11 02:10:13 node2 kernel: [2387197.859656] Hardware name: MS-7522
Jun 11 02:10:13 node2 kernel: [2387197.859658] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
Jun 11 02:10:13 node2 kernel: [2387197.859660] Modules linked in: cls_u32 sch_sfq sch_htb xt_physdev xt_mac ip6table_filter ip6_tables ipt_MA
SQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp nouveau iptable_filter ip_tables x_ta
bles kvm_intel kvm ib_iser speedstep_lib rdma_cm ttm ib_cm iw_cm ib_sa ib_mad ib_core drm_kms_helper ib_addr drm iscsi_tcp libiscsi_tcp psmou
se libiscsi i2c_algo_bit scsi_transport_iscsi mxm_wmi wmi serio_raw i7core_edac edac_core lp parport video bridge stp multipath linear 3w_9xx
x 3w_xxxx raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 e1000 r8169 aacraid ahci libahci
sata_nv sata_sil sata_via
Jun 11 02:10:13 node2 kernel: [2387197.859708] Pid: 2285, comm: kvm Not tainted 3.0.0-19-server #33~lucid1-Ubuntu
Jun 11 02:10:13 node2 kernel: [2387197.859710] Call Trace:
Jun 11 02:10:13 node2 kernel: [2387197.859712] <IRQ> [<ffffffff81061ccf>] warn_slowpath_common+0x7f/0xc0
Jun 11 02:10:13 node2 kernel: [2387197.859720] [<ffffffff81061dc6>] warn_slowpath_fmt+0x46/0x50
Jun 11 02:10:13 node2 kernel: [2387197.859724] [<ffffffff81049e89>] ? sched_slice+0x59/0xa0
Jun 11 02:10:13 node2 kernel: [2387197.859728] [<ffffffff8152376d>] dev_watchdog+0x24d/0x260
Jun 11 02:10:13 node2 kernel: [2387197.859731] [<ffffffff81523520>] ? __netdev_watchdog_up+0x80/0x80
Jun 11 02:10:13 node2 kernel: [2387197.859735] [<ffffffff81071bd9>] call_timer_fn+0x49/0x130
Jun 11 02:10:13 node2 kernel: [2387197.859738] [<ffffffff81523520>] ? __netdev_watchdog_up+0x80/0x80
Jun 11 02:10:13 node2 kernel: [2387197.859741] [<ffffffff81071ff9>] run_timer_softirq+0x149/0x280
Jun 11 02:10:13 node2 kernel: [2387197.859745] [<ffffffff812fd9f0>] ? timerqueue_add+0x60/0xb0
Jun 11 02:10:13 node2 kernel: [2387197.859749] [<ffffffff8102911d>] ? lapic_next_event+0x1d/0x30
Jun 11 02:10:13 node2 kernel: [2387197.859753] [<ffffffff81068c0f>] __do_softirq+0xbf/0x200
Jun 11 02:10:13 node2 kernel: [2387197.859757] [<ffffffff81089167>] ? hrtimer_interrupt+0x127/0x210
Jun 11 02:10:13 node2 kernel: [2387197.859761] [<ffffffff8160e15c>] call_softirq+0x1c/0x30
Jun 11 02:10:13 node2 kernel: [2387197.859764] [<ffffffff8100d415>] do_softirq+0x65/0xa0
Jun 11 02:10:13 node2 kernel: [2387197.859767] [<ffffffff81068a0d>] irq_exit+0xbd/0xe0
Jun 11 02:10:13 node2 kernel: [2387197.859770] [<ffffffff8160ea9e>] smp_apic_timer_interrupt+0x6e/0x99
Jun 11 02:10:13 node2 kernel: [2387197.859772] [<ffffffff8160d913>] apic_timer_interrupt+0x13/0x20
Jun 11 02:10:13 node2 kernel: [2387197.859774] <EOI> [<ffffffffa02a1e97>] ? start_apic_timer+0x57/0x80 [kvm]
Jun 11 02:10:13 node2 ker...

Ubuntu
linux package

mcelog errors and server freeze with qemu-kvm 0.12.3 and linux-image-2.6.32-32-server

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Changed in linux (Ubuntu):
status:	Confirmed → Fix Released

Ubuntulinux package

mcelog errors and server freeze with qemu-kvm 0.12.3 and linux-image-2.6.32-32-server

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package