Virtual machine soft lockup - CPU gets stuck for XX seconds

Bug #333201 reported by Stephan
28
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

I am running a virtual machine using KVM (managed via libvirt, installed with vmbuilder).

For one particular virtual machine only on the host (so far), it randomly stops responding occasionally and needs a reboot. On the screen (when I VNC to it) I get messages like "BUG: soft lockup - CPU#1 stuck for 61s! [fcheck:7246]"
but it doesn't have to be fcheck, it can be any process.

The other virtual machine on the host doesn't crash at the same time - it hasn't actually crashed yet at all since it was installed, although it uses the same kernel, was installed with the same commands and has the same versions of everything. The virtual machine that crashes is fine for days, even weeks, but then crashes.

I am graphing both the virtual machine and the host via Munin and nothing unusual is happening around the times of these crashes, the load on both is very low. I have attached the logs (/var/log/syslog) for two example crashes.

Revision history for this message
Stephan (stephan-fishycam) wrote :
Revision history for this message
Stephan (stephan-fishycam) wrote :
Revision history for this message
Stephan (stephan-fishycam) wrote :
Revision history for this message
Stephan (stephan-fishycam) wrote :
Revision history for this message
Stephan (stephan-fishycam) wrote :
Revision history for this message
Stephan (stephan-fishycam) wrote :

Sorry if you got lots of e-mails there, I couldn't see how to add multiple attachments.

I have more information on this.

For the virtual server host where none of our virtual machines suffer from this, we are running the 2.6.27-7-server kernel.
For the virtual server host where just one of our virtual machines suffer from this, we are running the 2.6.27-11-server kernel.

Are there any changes between the two versions that could cause something like this? Would you recommend I try the older kernel?

Revision history for this message
Stephan (stephan-fishycam) wrote :
Download full text (10.9 KiB)

This problem just happened again. Here is an extract from /var/log/syslog on the machine affected.
As before, the other virtual machine on this host was ok.

Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] BUG: soft lockup - CPU#1 stuck for 61s! [fcheck:21720]
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] Modules linked in: ipv6 evdev psmouse serio_raw button ext3 jbd mbcache sr_mod cdrom sd_mod crc_t10dif sg ata_generic uhci_
hcd ata_piix e1000 usbcore libata scsi_mod dock thermal processor fan
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] CPU 1:
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] Modules linked in: ipv6 evdev psmouse serio_raw button ext3 jbd mbcache sr_mod cdrom sd_mod crc_t10dif sg ata_generic uhci_
hcd ata_piix e1000 usbcore libata scsi_mod dock thermal processor fan
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] Pid: 21720, comm: fcheck Not tainted 2.6.27-11-server #1
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] RIP: 0010:[<ffffffff802abeb7>] [<ffffffff802abeb7>] find_get_pages+0x77/0x110
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] RSP: 0000:ffff880003d37948 EFLAGS: 00000293
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] RAX: ffff880006aaa5e8 RBX: ffff880003d37988 RCX: ffff880006aaa5e8
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffe20000126900
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] RBP: ffff880003d378f8 R08: 0000000000000002 R09: 0000000000000001
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] R10: 0000000000000002 R11: ffff880003d37a78 R12: ffffffff802b6d84
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] R13: ffff880003d37988 R14: 0000000000000001 R15: 0000000000000800
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] FS: 00007f1991ff36e0(0000) GS:ffff88000f495180(0000) knlGS:0000000000000000
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] CR2: 0000000000dfb190 CR3: 000000000f15e000 CR4: 00000000000006e0
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322]
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] Call Trace:
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] [<ffffffff802abe83>] ? find_get_pages+0x43/0x110
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] [<ffffffff802b6a04>] ? pagevec_lookup+0x24/0x30
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] [<ffffffff802b825b>] ? __invalidate_mapping_pages+0x8b/0x1a0
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] [<ffffffff802b4874>] ? get_dirty_limits+0x14/0x2b0
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] [<ffffffff803024ae>] ? generic_forget_inode+0x4e/0x190
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] [<ffffffff802b8380>] ? invalidate_mapping_pages+0x10/0x20
Mar 3 22:32:25 gla1-mailman1 kernel: [740722.950322] [<ffffffff...

Revision history for this message
Stephan (stephan-fishycam) wrote :

Hi, sorry to be a pain, but this is a big deal for us.

It hasn't happened since switching to the other kernel. (so it did happen with 2.6.27-11-server but hasn't happened yet with 2.6.27-7-server)

Can I get any more information to help you out? Please let me know.

Revision history for this message
Ali Ross (gnu2tux) wrote :

confirm error. Seems to be just 2.6.27-7.

Perhaps hardware specific? Dell Poweredge server here.

Changed in linux:
status: New → Confirmed
Revision history for this message
Ali Ross (gnu2tux) wrote :

I meant 2.6.27-11.

Revision history for this message
Stephan (stephan-fishycam) wrote :

The "soft lockup" has now also happened with the previous kernel (2.6.27-7-server).
So it's happening with 2.6.27-11-server and also 2.6.27-7-server now.

I will upgrade the kernel on the host and virtual machine and wait.

Revision history for this message
Stephan (stephan-fishycam) wrote :

This is now fixed, we upgraded the BIOS on the server (Dell Poweredge 1950).

Revision history for this message
Bryan McLellan (btm) wrote :

I occasionally experience this error on a 9.10 guest running 2.6.31-14-server on a 9.10 host with 2.6.31-14-generic and kvm=1:84+dfsg-0ubuntu16+0.11.0+0ubuntu6.3 on an HP DL360 G6.

Revision history for this message
Bryan McLellan (btm) wrote :

Still seeing this issue.

Host:
  Updated to the latest BIOS / Firmware on the DL360 G6 to date
  2.6.31-20-generic
  qemu-kvm=0.11.0-0ubuntu6.3
Guest:
  2.6.31-20-server

Installing cpuburn and running two instances of BurnP6 on the guest produced cpu soft lockups within 24 hours.

Revision history for this message
Marcus Bointon (marcus-synchromedia) wrote :

I'm getting this on 10.04 beta2 with 2.6.32-19-virtual (in the VM, built with vmbuilder) and 2.6.32-19-server on the host on boot of a VM, rendering virtualization completely inoperable, 100% failure rate. I'm running qemu-kvm 0.12.3+noroms-0ubuntu5

Revision history for this message
Bryan McLellan (btm) wrote :

Marcus, what hardware are you experiencing this on?

Revision history for this message
Marcus Bointon (marcus-synchromedia) wrote :

Sorry I didn't see your question earlier.
I'm now running the release version of 10.04:
Linux 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux
This particular machine isn't very high powered, but it should be at least usable. It has a single quad-core L5320 Xeon (with vmx), 10Gb RAM, SATA soft raid-1, no other major processes (not even apache), load average < 0.1. FWIW the host OS reboots in under a minute.

Here's CPUinfo on one core:

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU L5320 @ 1.86GHz
stepping : 7
cpu MHz : 1866.966
cache size : 4096 KB
physical id : 1
siblings : 4
core id : 0
cpu cores : 4
apicid : 4
initial apicid : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx tm2 ssse3 cx16 xtpr pdcm dca lahf_lm tpr_shadow
bogomips : 3733.93
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

I've attached a screenshot of the kind of thing it's doing. It takes about 20 minutes to even get as far as this screen though! It's completely unusable - after 50 mins it reached a login prompt, but typed input eitehr doesn't work or is too slow to tell.

This is the command I gave to vmbuilder (which can't get much more vanilla!)

vmbuilder kvm ubuntu \
--suite lucid \
--flavour virtual \
--arch amd64 \
--libvirt qemu:///system \
--hostname vm1 \
--user user \
--name user \
--pass default \
--ip 192.168.0.100 \
--dest /root/vm1

and kvm is run from its generated run.sh file with this:

exec kvm -m 128 -smp 1 -drive file=tmpkJv9OP.qcow2 "$@"

I don't know if this problem is because vmbuilder built a bad image, the config is bad, or because kvm isn't working right.

Revision history for this message
Marcus Bointon (marcus-synchromedia) wrote :

I managed to find a KVM appliance image (there don't seem to be many around) here:
http://ica-atom.org/docs/index.php?title=ICA-AtoM_virtual_appliance
This VM works with respectable performance and no errors on my server, so it looks like kvm is in the clear. The only thing that was different in the config is RAM allocation (128 vs 256m) in the run script, so I increased it in my generated vm and it still had the same problems, so it looks like vmbuilder is at fault (or JeOS itself).

Revision history for this message
Stephan (stephan-fishycam) wrote :

I still get this error from time to time. I'm not on the servers I originally reported the issue with as that company went into administration, so I can't say if they are still doing it.

I get the error on Ubuntu 9.10 and also 10.04. The guest and the host are both 10.04 now. The guests are JeOS.
The host server is a quad core 2.4 Ghz with 8 GB or RAM and tons of spare capacity. It's an ASUS P5K-VM motherboard.

I've attached the latest log file.

Revision history for this message
Bryan McLellan (btm) wrote :

10.04.01 Guests built on 9.10 hosts using vmbuilder=0.12.4-0ubuntu1 (from maverick, I had issues with 0.12.3-0ubuntu1, I see there is a newer version in -proposed) still present soft lock ups. It's only one type of server, and tends to not be purely CPU driven although the guests with higher load do present much more often.

Has anyone tried this with a 10.04 (lucid) host or at least kvm=1:84+dfsg-0ubuntu16+0.12.3+noroms+0ubuntu9 backported to 9.10 (karmic)?

I can't imagine it would be vmbuilder or JeOS, as the former is mostly a convenience script around deb-bootstrap, libvirt, and such. The latter shouldn't be able to have any essential packages missing because by way of the way it is built, dependencies would be enforced. Thus toolchain issues _should_ effect a greater number of people.

I'm going back to betting on KVM or the kernel. I'll try to narrow it down more, as this is effecting my production systems gravely now.

Revision history for this message
Sergey Svishchev (svs) wrote :

This is a long shot, but try another clocksource (normally kvm-clock). I've had to disable kvm-clock altogether for another reason, and since that time "soft lockups" did not happen on 9.10 systems.

Revision history for this message
Steven Wagner (stevenwagner) wrote :

I am getting the same error, using Ubuntu Server 10.04 64 bit, with a Ubuntu Server 10.04 64 bit guest. I am trying first to turn off cpu frequency scaling and see if that makes the issue go away. Right now I can't reproduce, but occurs consistently after about 2 weeks of uptime. Sergey- What are the steps to switch off of kvm-clock?

Revision history for this message
Stephan (stephan-fishycam) wrote :

I haven't had this problem for months. I don't think anything changed on my side. This is a tricky one!

Revision history for this message
Martin (martin00) wrote :
Download full text (4.0 KiB)

I've the same problem with a INTEL P55 Mainboard. Very sad its a productive machine.
I get it randomly all 2-4 weeks on one VM of 6. Always "cpu stuck" and the websites on it are down. I hate it very badly.

Host: 2.6.32-24-server #41-Ubuntu (now installing #42)
VM: 2.6.32-24-server #41-Ubuntu (now installing #42)

syslog crashed VM:
------------------------------------------------------------------------------------------
Sep 6 08:47:51 cluster qmail-smtpd: qmail-smtpd/VC started
Sep 6 08:49:03 cluster kernel: [538123.970006] BUG: soft lockup - CPU#0 stuck for 61s! [swapper:0]
Sep 6 08:49:03 cluster kernel: [538123.993618] Modules linked in: fbcon psmouse tileblit font bitblit i2c_piix4 serio_raw softcursor lp vga16fb vgastate parport floppy
Sep 6 08:49:03 cluster kernel: [538123.993618] CPU 0:
Sep 6 08:49:03 cluster kernel: [538123.993618] Modules linked in: fbcon psmouse tileblit font bitblit i2c_piix4 serio_raw softcursor lp vga16fb vgastate parport floppy
Sep 6 08:49:03 cluster kernel: [538123.993618] Pid: 0, comm: swapper Not tainted 2.6.32-24-server #41-Ubuntu Bochs
Sep 6 08:49:03 cluster kernel: [538123.993618] RIP: 0010:[<ffffffff814abd6b>] [<ffffffff814abd6b>] __inet_lookup_established+0x1ab/0x2c0
Sep 6 08:49:03 cluster kernel: [538123.993618] RSP: 0018:ffff880001c03ba0 EFLAGS: 00000202
Sep 6 08:49:03 cluster kernel: [538123.993618] RAX: 000000000001cb93 RBX: ffff880001c03be0 RCX: 0000000091a53275
Sep 6 08:49:03 cluster kernel: [538123.993618] RDX: ffffc900002c8cb0 RSI: ffffffff81a49000 RDI: 000000000001cb93
Sep 6 08:49:03 cluster kernel: [538123.993618] RBP: ffffffff81013cb3 R08: 00000000e3a9806b R09: 0000000000509bf3
Sep 6 08:49:03 cluster kernel: [538123.993618] R10: 0000000000000002 R11: 0000000000000002 R12: ffff880001c03b20
Sep 6 08:49:03 cluster kernel: [538123.993618] R13: 46c0c8c363174858 R14: ffffffff81a46d80 R15: ffffffff8155fe2c
Sep 6 08:49:03 cluster kernel: [538123.993618] FS: 0000000000000000(0000) GS:ffff880001c00000(0000) knlGS:0000000000000000
Sep 6 08:49:03 cluster kernel: [538123.993618] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Sep 6 08:49:03 cluster kernel: [538123.993618] CR2: 00000000f7056000 CR3: 00000000199f1000 CR4: 00000000000006f0
Sep 6 08:49:03 cluster kernel: [538123.993618] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 6 08:49:03 cluster kernel: [538123.993618] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Sep 6 08:49:03 cluster kernel: [538123.993618] Call Trace:
Sep 6 08:49:03 cluster kernel: [538123.993618] <IRQ> [<ffffffff8146d35d>] ? __skb_checksum_complete_head+0x1d/0x70
Sep 6 08:49:03 cluster kernel: [538123.993618] [<ffffffff814c48bf>] ? tcp_v4_rcv+0x1cf/0x7e0
Sep 6 08:49:03 cluster kernel: [538123.993618] [<ffffffff8146906e>] ? consume_skb+0x1e/0x40
Sep 6 08:49:03 cluster kernel: [538123.993618] [<ffffffff814a2cdd>] ? ip_local_deliver_finish+0xdd/0x2d0
Sep 6 08:49:03 cluster kernel: [538123.993618] [<ffffffff814a2f60>] ? ip_local_deliver+0x90/0xa0
Sep 6 08:49:03 cluster kernel: [538123.993618] [<ffffffff814a241d>] ? ip_rcv_finish+0x12d/0x440
Sep 6 08:49:03 cluster kernel: [538123.993618] [<ffffffff814a29...

Read more...

Revision history for this message
Steven Wagner (stevenwagner) wrote :

CPU frequency scaling is on by default in lucid.

fix:
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

Once cpu frequency scaling was turned off, I no longer had this issue. This is a cpu timing issue.

Shouldnt this be turned off by default on ubuntu server, or at least a warning given for kvm libvirt users?

Revision history for this message
Stephan (stephan-fishycam) wrote :

Thanks for this.

I've made the change you put above, but also for cpu1, cpu2, cpu3 etc... I also installed rcconf and used it to disable the "ondemand" service from starting, which might also help, I suppose.

Revision history for this message
Stephan (stephan-fishycam) wrote :

Just a thought... You said you changed /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

That goes back to the normal setting on boot.

Have you had the issue since you did this, and did you make it permanent?

Also, was it on the guest, or host, or both that you did this on?

Revision history for this message
Brad Figg (brad-figg) wrote : Unsupported series, setting status to "Won't Fix".

This bug was filed against a series that is no longer supported and so is being marked as Won't Fix. If this issue still exists in a supported series, please file a new bug.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.