xen dom0 crashes with "BUG: soft lockup - CPU#0 stuck for 11s!"

Bug #238549 reported by Sergio Tosti
62
This bug affects 6 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

Binary package hint: linux-image-xen

sb_release -rd
Description: Ubuntu 8.04
Release: 8.04

relevant syslog:
[...]
Jun 9 08:07:21 morpheus kernel: [29522.326860] BUG: soft lockup - CPU#0 stuck for 11s! [savelog:11194]
Jun 9 08:07:21 morpheus kernel: [29522.326875]
Jun 9 08:07:21 morpheus kernel: [29522.326881] Pid: 11194, comm: savelog Tainted: G B D (2.6.24-19-xen #2)
Jun 9 08:07:21 morpheus kernel: [29522.326888] EIP: 0061:[dm_mod:_spin_lock+0x7/0x10] EFLAGS: 00000282 CPU: 0
Jun 9 08:07:21 morpheus kernel: [29522.326903] EIP is at _spin_lock+0x7/0x10
Jun 9 08:07:21 morpheus kernel: [29522.326908] EAX: c1daf2ec EBX: 00000000 ECX: 17097000 EDX: 00000000
Jun 9 08:07:21 morpheus kernel: [29522.326912] ESI: 38297067 EDI: 00000000 EBP: d7097f30 ESP: d70d5e7c
Jun 9 08:07:21 morpheus kernel: [29522.326916] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
Jun 9 08:07:21 morpheus kernel: [29522.326928] CR0: 8005003b CR2: 08062614 CR3: 1927b000 CR4: 00000660
Jun 9 08:07:21 morpheus kernel: [29522.326935] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Jun 9 08:07:21 morpheus kernel: [29522.326942] DR6: ffff0ff0 DR7: 00000400
Jun 9 08:07:21 morpheus kernel: [29522.326947] [copy_page_range+0x4c1/0x950] copy_page_range+0x4c1/0x950
Jun 9 08:07:21 morpheus kernel: [29522.327107] [copy_process+0x8c1/0x1200] copy_process+0x8c1/0x1200
Jun 9 08:07:21 morpheus kernel: [29522.327168] [do_page_fault+0x366/0xe90] do_page_fault+0x366/0xe90
Jun 9 08:07:21 morpheus kernel: [29522.327248] [do_fork+0x40/0x260] do_fork+0x40/0x260
Jun 9 08:07:21 morpheus kernel: [29522.327284] [<c011df90>] default_wake_function+0x0/0x10
Jun 9 08:07:21 morpheus kernel: [29522.327307] [sys_clone+0x36/0x40] sys_clone+0x36/0x40
Jun 9 08:07:21 morpheus kernel: [29522.327331] [syscall_call+0x7/0x0b] syscall_call+0x7/0xb
Jun 9 08:07:21 morpheus kernel: [29522.327372] [vcc_def_wakeup+0x10/0x60] vcc_def_wakeup+0x10/0x60
Jun 9 08:07:21 morpheus kernel: [29522.327410] =======================
[...]

uname -r:
2.6.24-18-xen

/proc/cpuinfo:
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping : 1
cpu MHz : 3200.712
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl cid xtpr
bogomips : 6406.44
clflush size : 64

processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping : 1
cpu MHz : 3200.712
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl cid xtpr
bogomips : 6401.50
clflush size : 64

lspci:
00:00.0 Host bridge: Intel Corporation 82875P/E7210 Memory Controller Hub (rev 02)
00:03.0 PCI bridge: Intel Corporation 82875P/E7210 Processor to PCI to CSA Bridge (rev 02)
00:06.0 System peripheral: Intel Corporation 82875P/E7210 Processor to I/O Memory Interface (rev 02)
00:1c.0 PCI bridge: Intel Corporation 6300ESB 64-bit PCI-X Bridge (rev 02)
00:1d.0 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller (rev 02)
00:1d.1 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller (rev 02)
00:1d.4 System peripheral: Intel Corporation 6300ESB Watchdog Timer (rev 02)
00:1d.5 PIC: Intel Corporation 6300ESB I/O Advanced Programmable Interrupt Controller (rev 02)
00:1d.7 USB Controller: Intel Corporation 6300ESB USB2 Enhanced Host Controller (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 0a)
00:1f.0 ISA bridge: Intel Corporation 6300ESB LPC Interface Controller (rev 02)
00:1f.2 IDE interface: Intel Corporation 6300ESB SATA Storage Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 6300ESB SMBus Controller (rev 02)
02:01.0 Ethernet controller: Intel Corporation 82547GI Gigabit Ethernet Controller
04:02.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY [Radeon 7000/VE]
04:08.0 Multimedia audio controller: Ensoniq ES1371 [AudioPCI-97] (rev 06)

The system freezes in a random time and the syslog reports the informations above after reboot.

Thanks
--Sergio

Tags: kj-expired
Revision history for this message
Sergio Tosti (zeno979) wrote :

By the way, no problem with 2.6.18-generic

Revision history for this message
Sergio Tosti (zeno979) wrote :

sorry, 2.6.24-18-generic

Revision history for this message
Darik Horn (dajhorn) wrote :

I can confirm this bug with 2.6.24-18-xen and 2.6.24-19-xen on a late Pentium D. It usually happens within two days of runtime when the system is loaded. My cpuinfo is attached.

This glitch on my computer usually trashes I/O and results in several "faulty removed" components in MD arrays.

Revision history for this message
Nick Ellery (nick.ellery) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. Unfortunately we can't fix it, because your description does not yet have enough information.

Please include the following additional information, if you have not already done so (pay attention to lspci's additional options), as required by the Ubuntu Kernel Team:
1. Please include the output of the command "uname -a" in your next response. It should be one, long line of text which includes the exact kernel version you're running, as well as the CPU architecture.
2. Please run the command "dmesg > dmesg.log" after a fresh boot and attach the resulting file "dmesg.log" to this bug report.
3. Please run the command "sudo lspci -vvnn > lspci-vvnn.log" and attach the resulting file "lspci-vvnn.log" to this bug report.

For your reference, the full description of procedures for kernel-related bug reports is available at https://wiki.ubuntu.com/KernelTeamBugPolicies Thanks in advance!

Changed in linux-meta:
status: New → Incomplete
Revision history for this message
Alex Dehnert (adehnert) wrote :

I think I may have seen the same bug. At least, the Xen dom0 will occasionally completely freeze up. Moving the mouse does not cause the cursor to move in X11, pressing keys doesn't produce any visible result (in X11 or console), ssh sessions freeze up (to host and guest), other services are not responded to, and so forth. I haven't tried SysRq, though I will next time it goes down.

Nothing appears in the log files (that I've seen).

I haven't figured out a reliable way to reproduce the problem. However, it does seem to happen more when doing disk intensive operations (eg, rebuilding a RAID array).

I've been seeing this problem fairly frequently since around when I got home for the summer and upgraded to 8.04 and installed ubuntu-desktop (previously, the machine was only serving as a server). I don't recall hitting the issue before the upgrade, but I may be mistaken.

The bug occurs with 2.6.24-18-xen and 2.6.24-19-xen, although does not seem to occur when using the non-Xen kernels (admittedly, while I've tried to stress the non-Xen kernels, without a reliable way to cause the bug I'm not sure that means it isn't there, and I've used the Xen kernels much more).

Let me know if there is a way to attach several files to one comment. I don't see one, so I'll submit a few more comments.

Also, please let me know if you need anything else.

Revision history for this message
Alex Dehnert (adehnert) wrote :

One Athlon64.

Revision history for this message
Alex Dehnert (adehnert) wrote :
Revision history for this message
Alex Dehnert (adehnert) wrote :
Revision history for this message
Alex Dehnert (adehnert) wrote :
Revision history for this message
Leonardo Silva Amaral (leleobhz) wrote :

Like https://bugs.launchpad.net/bugs/240903 im reporting here my hardware too.

Revision history for this message
Leonardo Silva Amaral (leleobhz) wrote :

Like https://bugs.launchpad.net/bugs/240903 im reporting here my hardware too.

Revision history for this message
Bastian Mäuser (mephisto-mephis) wrote :

See https://bugs.launchpad.net/ubuntu/+source/linux-meta/+bug/259487 (duplicate i think).

This is really a bad problem and needs to be addressed very very soon..

Revision history for this message
Tim Gardner (timg-tpi) wrote :

James Troup (Canonical IS) reports that using an amd64 dom0 kernel works around the problem.

Revision history for this message
John Edwards (john-cornerstonelinux) wrote :

I have been getting the same crash on a Pentium D machine running kernel 2.6.24-19-xen when I try to start a domU using 'xm create'. I have just stopped AppArmor and 'xm create' no longer causes the crash.

It may be worth testing with AppArmor not running on the dom0.

Revision history for this message
Claudinei Matos (claudineimatos) wrote :

In my case is domU which is crashing.
It's a fresh Ubuntu server install with kernel is 2.6.24-21-xen and the hardware is a Xeon Core Duo (cpuinfo attached).

Also tried to disable AppArmor but still the same problem.

It's interesting that I have two domU running (one for mysql and other for apache)
but just the webserver machine crashes periodically.
Database machine (as well as dom0) never crashed since the last sunday but webserver
machine already crashed around 6 times.

Revision history for this message
Lily (starlily) wrote :

I have a Dell 6650 running whatever the latest xen server image is (and I run update/upgrade/dist-upgrade frequently). One DomU locks up pretty regularly with "BUG: soft lockup - CPU#1 stuck for 11s!", usually during large file transfers. DomU Kernel version is 2.6.24-19. The bug *requires* destroying the DomU and restarting it.

Its pretty clear after reading many bug reports about this that it is in the Kernel somewhere (and the kernel team has responded by changing their policy about bug reporting). It is clearly NOT hardware or application specific, as this is reported on many platforms and appears to not have a consistent trigger.

Potentially, this is related to SMP, or PAE, although I find that listing these as an area of issue is an easy scapegoat, even if it may be true.

Id really like someone who KNOWS what this bug is caused by to provide a definitive answer somewhere that can be seen by the public, and if possible provide a workaround or targeted date for release of fix.

Thanks!
Lily

Revision history for this message
Fran Boon (flavour) wrote :

I have the same problem in my DomUs (Dom0 is fine).
I need to destroy/create the VMs to recover.

Host:
# uname -a
Linux sahana 2.6.24-23-xen #1 SMP Mon Jan 26 03:12:59 UTC 2009 i686 GNU/Linux

Guests:
# uname -a
Linux sahana1 2.6.24-23-xen #1 SMP Mon Jan 26 03:12:59 UTC 2009 i686 GNU/Linux

dmesg.log & lspci-vvnn.log from host attached

Revision history for this message
Fran Boon (flavour) wrote :
Revision history for this message
Leonardo Silva Amaral (leleobhz) wrote : Re: [Bug 238549] Re: xen dom0 crashes with "BUG: soft lockup - CPU#0 stuck for 11s!"

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This problem does not appear in same hardware reported by me when i
use Debian Lenny dom0 with Xen 3.3.1 and kernel 2.6.27 hg version.
Maybe a version issue?

Fran Boon escreveu:
> ** Attachment added: "lspci-vvnn.log.bz2"
> http://launchpadlibrarian.net/23302922/lspci-vvnn.log.bz2
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkmsYfwACgkQFfwtwYMnBqQTSwCcDK40LDTf784dnioZ/bdeBjCc
l/kAnAhZhUJtnIOy7kb06l3MQiiS/qRG
=WaaH
-----END PGP SIGNATURE-----

Revision history for this message
Christiaan Ottow (chris-6core) wrote :

I'm having the same problems with linux 2.6.24-23-xen on a dual P4 Xeon system in my dom0.

Apart from the amd64 kernel, has anybody found a workaround for this bug?

Revision history for this message
Leonardo Silva Amaral (leleobhz) wrote :

My workarround is: Dont use 2.6.24. Im now using Xen 3.3.1 + Linux 2.6.27-hg (From HG repository) and i dont have any kind of problems with Xen now (Even DMRAID over Xen, now works well).

Revision history for this message
Andy Whitcroft (apw) wrote :

This is not a bug in the linux-meta package, moving to the linux package.

affects: linux-meta (Ubuntu) → linux (Ubuntu)
Revision history for this message
Darik Horn (dajhorn) wrote :

This bug persists in the latest 2.6.24-24-xen kernel package that was recently released for Hardy running in a domU on an Intel Xeon CPU L5335.

I can provoke the crash by downloading something with wget that pins the ethernet interface.

Revision history for this message
stiV (stefan-wehinger) wrote :

This bug persists - ubuntu 8.04.3 running on xenserver 5.5u1
interestingly the bug only happens on i386 machines (system completely up2date with full-upgrade), the amd64 machines don't crash - at least for several days now they didn't.

Revision history for this message
Cinquero (cinquero) wrote :

It seems that the same problem existed for Windows HVM guests:

http://old.nabble.com/BSOD-%22A-clock-interrupt-was-not-recevied-ona-secondary-processor-within-the-allocated-time-interval%22-td21200747.html

A solution could possibly be to disable the cpu watchdog timer, but I have not tested it yet and I also don't know if that can be done without rebuilding the kernel.

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

This bug report was marked as Incomplete and has not had any updated comments for quite some time. As a result this bug is being closed. Please reopen if this is still an issue in the current Ubuntu release http://www.ubuntu.com/getubuntu/download . Also, please be sure to provide any requested information that may have been missing. To reopen the bug, click on the current status under the Status column and change the status back to "New". Thanks.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kj-expired
Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.