Lucid & Natty, KVM, After kernel message hrtimer: interrupt too slow.... the SMP kvm guest becomes slow.

Bug #503138 reported by Ferdinand Hagethorn
84
This bug affects 16 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned
Declined for Lucid by Christian Ehrhardt 
qemu (Ubuntu)
Expired
Medium
Unassigned
Declined for Lucid by Christian Ehrhardt 

Bug Description

KVM Host is running a clean 9.10 server install with just qemu-kvm and virt-manager kernel=2.6.31-17-generic
uname -a: Linux VMMASTER 2.6.31-17-generic #54-Ubuntu SMP Thu Dec 10 16:20:31 UTC 2009 i686 GNU/Linux
version signature: Ubuntu 2.6.31-17.54-generic

KVM Guest is running a clean 9.10 server install with some userland services (apache/postfix/whatnot) kernel=2.6.31-16-generic-pae (from the linux-image-virtual package)
uname -a: Linux VM1 2.6.31-16-generic-pae #53-Ubuntu SMP Tue Dec 8 05:20:21 UTC 2009 i686 GNU/Linux
version signature: Ubuntu 2.6.31-16.53-generic-pae

KVM guest startup command (as invoked by virt-manager):
/usr/bin/kvm -S -M pc-0.11 -m 1024 -smp 2 -name lexx -uuid [UUID] -monitor unix:/var/run/libvirt/qemu/VM1.monitor,server,nowait -boot c -drive file=,if=ide,media=cdrom,index=2 -drive file=/var/lib/libvirt/images/VM1.img,if=virtio,index=0,boot=on -net nic,macaddr=[MAC],vlan=0,model=virtio,name=virtio.0 -net tap,fd=16,vlan=0,name=tap.0 -serial none -parallel none -usb -vnc 127.0.0.1:0 -k en-us -vga cirrus

Problem description:

After a while (and high network IO) I see this pop up in my guest dmesg:
  hrtimer: interrupt too slow, forcing clock min delta to 215997540 ns
after that the guest becomes very slow to respond, sometimes taking seconds to echo my ssh input back, on a local lan. Only a reboot of the kvm guest fixes this, until the dreaded hrtimer message pops up

After a lot of googling and trying a lot of things I found this discussion on the patchwork kernel mailinglist, which contains a possible solution:
http://patchwork.kernel.org/patch/51561/

Please look into it, perhaps this solves a lot of kvm-users' problems

I'd like to patch my kvm guests' kernel myself to test this hrtimer patch, do you have the correct procedure for me so i can create a custom kernel?

Revision history for this message
Ferdinand Hagethorn (ferdinand-hagethorn) wrote :

I've applied the attached patch to linux-image-2.6.31-16-generic and am running the patched kernel since late last night. I do see the hrtimer messages but with much lower values and even going down a few times (3375->1500 and 5062->3375):

[ 3304.175297] hrtimer: interrupt too slow, forcing clock min delta to 1500 ns
[ 3304.319296] hrtimer: interrupt too slow, forcing clock min delta to 2250 ns
[ 3662.181424] hrtimer: interrupt too slow, forcing clock min delta to 3375 ns
[ 3674.120250] hrtimer: interrupt too slow, forcing clock min delta to 1500 ns
[ 4967.256258] hrtimer: interrupt too slow, forcing clock min delta to 2250 ns
[ 5667.966180] hrtimer: interrupt too slow, forcing clock min delta to 5062 ns
[ 7238.413947] hrtimer: interrupt too slow, forcing clock min delta to 3375 ns
[ 9633.922211] hrtimer: interrupt too slow, forcing clock min delta to 5062 ns
[11509.242590] hrtimer: interrupt too slow, forcing clock min delta to 7593 ns
[17762.000426] hrtimer: interrupt too slow, forcing clock min delta to 7593 ns
[18016.880026] hrtimer: interrupt too slow, forcing clock min delta to 11389 ns

My system is now still very responsive, which never happened before after a night of rsync backup runs.

I'll keep it running for a couple of days and report back

Revision history for this message
Ferdinand Hagethorn (ferdinand-hagethorn) wrote :

i've applied the patch to the VM guest kernel, no modifications on the Host!

Andy Whitcroft (apw)
tags: added: kernel-series-unknown
Revision history for this message
Ferdinand Hagethorn (ferdinand-hagethorn) wrote :

The patch doesn't prevent the issue from occuring. It merely delays what seems the inevitable, a hrtimer clock min delta of HUGE ns on the guest.

Yesterday i did some stress-testing with a patched guest kernel. It seems a high CPU load on the kvm host machine triggers the issue.

The guest ran in SMP mode (2 cpu's). Currently i've got it running with one cpu for over 12 hours. No hrtimer messages in my dmesg as of yet.

What the * is going on here?

Does lucid have this problem as well?

Revision history for this message
Ferdinand Hagethorn (ferdinand-hagethorn) wrote :

http://<email address hidden>/msg23491.html has a new patch, since the 1st one obviously didn't work. Going to try it out and report back

Revision history for this message
Ferdinand Hagethorn (ferdinand-hagethorn) wrote :

Last patch won't apply to the Ubuntu linux-2.6.31 for some hunks;

builder@speedmouse:~/build/linux-2.6.31$ patch -p1 <../hrtimer_2.patch
patching file include/linux/hrtimer.h
Hunk #1 succeeded at 247 (offset 2 lines).
patching file kernel/hrtimer.c
Hunk #1 FAILED at 1219.
Hunk #2 FAILED at 1248.
Hunk #3 FAILED at 1334.
3 out of 3 hunks FAILED -- saving rejects to file kernel/hrtimer.c.rej
patching file kernel/sysctl.c
Hunk #1 succeeded at 1524 with fuzz 2 (offset 492 lines).

Perhaps someone could take a look at the changes? I'm not that fluid in C (anymore)

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Ferdinand Hagethorn (ferdinand-hagethorn) wrote :

I've updated the kvm host to 2.6.31-18.55-generic from the proposed branch and also installed the linux-image-virtual package (which seems to install 2.6.31-17.54-generic-pae) on the kvm guest.

running the KVM guest with SMP enabled (2 cpu's, the host has an intel T8100 with throttling enabled)

It has been running for over a day, it hasn't shouted the dreaded hrtimer to slow message, normally i'd expect it at least once...

Hopefully it was fixed with this kernel update

Revision history for this message
Ferdinand Hagethorn (ferdinand-hagethorn) wrote :

I've reinstalled the host system (T8100 intel cpu) with ubuntu 9.10 server 64 bit, so moving from i686 to x86_64. The guest having the issue still reports the error but feels less sluggish. This guest is the original a 32bit ubuntu 9.10 installation, but running in an x86_64 virtual machine.

I also have a second guest installed now, this one is 9.10 -server / 64bit, so far (2 days uptime) this guest hasn't logged the hrtimer issue, but it's load is 0.00, going to 'bonnie it up' for a few days and see what happens.

Revision history for this message
Olivier d. (olivier-dembour) wrote :

I have exactly the same bug on a x86_64 ubuntu 9.10 (host and guest) :

Using library: libvir 0.7.0
Using API: QEMU 0.7.0
Running hypervisor: QEMU 0.11.0

[ 8.432479] hrtimer: interrupt too slow, forcing clock min delta to 120004182 ns

Does the 2.6.31-18.55-generic kernel correct this bug ? Where can it be downloaded ?

It's really annoying, is there any workaround known for this issue ?

Revision history for this message
Ferdinand Hagethorn (ferdinand-hagethorn) wrote :

Olivier d.: I'm not getting the hrtimer message and the slowdown when i run the guest system with only one (1) cpu. This is somewhat of a workaround for low-load systems.

Revision history for this message
Olivier d. (olivier-dembour) wrote :

I confirm that running the guest on a single CPU resolv this issue.

In fact hrtimer had an impact on the whole system. With a single CPU the host have now a full 100Mbits throughput.

I've seen some patch in the kernel ML dealing with the hrtimer. Will this issue be corrected in lucid ?

Revision history for this message
Ferdinand Hagethorn (ferdinand-hagethorn) wrote :

You're right, I hadn't benchmarked this before, but the impact on network throughput is significant.

I still don't know if this a KVM or Kernel issue, I'm also running multiple Ubuntu systems on VMWare ESX, all with SMP and they don't show this issue. So the issue seems to exist in KVM in this case and not the Linux Kernel. But... it could be that VMWare has already addressed this issue with some nifty workaround in their hypervisor, and it still is a linux kernel problem.

Revision history for this message
Ferdinand Hagethorn (ferdinand-hagethorn) wrote :

Hmz, now I'm seeing this in my dmesg:

Host:
[14419.002523] Clocksource tsc unstable (delta = -116170058 ns)

Guest:
[243673.504102] Clocksource tsc unstable (delta = -67373626 ns)

This could be completely unrelated to the hrtimer bug

Revision history for this message
Mark Deneen (mdeneen) wrote :

Is there any resolution to this yet? I have a pretty beefy server (http://www.pastebin.org/100787) which this happens on from time to time.

[520477.016490] hrtimer: interrupt too slow, forcing clock min delta to 204108030 ns

^^^

This really hurts!

Revision history for this message
starslights (starslights) wrote :

Hello,

I have installed the last current Mainline Kernels ( linux-headers-2.6.34-999-generic_2.6.34-999.201003131003_amd64.deb + linux-image-2.6.34-999-generic_2.6.34-999.201003131003_amd64.deb ) and now i have another mnessahe about hrtimer.

2010-03-14 12:25:21 KDELucidTest kernel [ 47.400589] hrtimer: interrupt took 200652886 ns

I don't know if it mean fixed like that. I wait your answer :P

Best regards

tags: added: patch
Revision history for this message
Mark Deneen (mdeneen) wrote :

See: https://patchwork.kernel.org/patch/89734/

According to Michael Tokarev, the guest is swapped out and needs to handle a timer interrupt. It gets swapped in by the host, but this takes some time. The next timer fires BEFORE the previous timer, so the guest thinks that the interrupt is too slow and things go downhill from there.

This does not resolve the problem, but it allows the guest to handle it gracefully.

Revision history for this message
starslights (starslights) wrote :

Hello,

I run kubuntu Lucid LVM 10.04 LTS on x86 64, fresh install and it's seem to be fixed, i don't get any more this warning

Best regards

Linux xxxxxxxx 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:28:05 UTC 2010 x86_64 GNU/Linux

Revision history for this message
DJavaBean (djavabean) wrote : Re: [Bug 503138] Re: after kernel message hrtimer: interrupt too slow.... the kvm guest becomes slow [possible solution found!]

Thanks for the update. I installed 10.04 just recently but have not tested
that functionality yet.

On Thu, May 13, 2010 at 5:17 AM, starslights <email address hidden> wrote:

> Hello,
>
> I run kubuntu Lucid LVM 10.04 LTS on x86 64, fresh install and it's seem
> to be fixed, i don't get any more this warning
>
> Best regards
>
> Linux xxxxxxxx 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:28:05 UTC
> 2010 x86_64 GNU/Linux
>
> --
> after kernel message hrtimer: interrupt too slow.... the kvm guest becomes
> slow [possible solution found!]
> https://bugs.launchpad.net/bugs/503138
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in “linux” package in Ubuntu: Triaged
>
> Bug description:
> KVM Host is running a clean 9.10 server install with just qemu-kvm and
> virt-manager kernel=2.6.31-17-generic
> uname -a: Linux VMMASTER 2.6.31-17-generic #54-Ubuntu SMP Thu Dec 10
> 16:20:31 UTC 2009 i686 GNU/Linux
> version signature: Ubuntu 2.6.31-17.54-generic
>
> KVM Guest is running a clean 9.10 server install with some userland
> services (apache/postfix/whatnot) kernel=2.6.31-16-generic-pae (from the
> linux-image-virtual package)
> uname -a: Linux VM1 2.6.31-16-generic-pae #53-Ubuntu SMP Tue Dec 8 05:20:21
> UTC 2009 i686 GNU/Linux
> version signature: Ubuntu 2.6.31-16.53-generic-pae
>
> KVM guest startup command (as invoked by virt-manager):
> /usr/bin/kvm -S -M pc-0.11 -m 1024 -smp 2 -name lexx -uuid [UUID] -monitor
> unix:/var/run/libvirt/qemu/VM1.monitor,server,nowait -boot c -drive
> file=,if=ide,media=cdrom,index=2 -drive
> file=/var/lib/libvirt/images/VM1.img,if=virtio,index=0,boot=on -net
> nic,macaddr=[MAC],vlan=0,model=virtio,name=virtio.0 -net
> tap,fd=16,vlan=0,name=tap.0 -serial none -parallel none -usb -vnc
> 127.0.0.1:0 -k en-us -vga cirrus
>
> Problem description:
>
> After a while (and high network IO) I see this pop up in my guest dmesg:
> hrtimer: interrupt too slow, forcing clock min delta to 215997540 ns
> after that the guest becomes very slow to respond, sometimes taking seconds
> to echo my ssh input back, on a local lan. Only a reboot of the kvm guest
> fixes this, until the dreaded hrtimer message pops up
>
> After a lot of googling and trying a lot of things I found this discussion
> on the patchwork kernel mailinglist, which contains a possible solution:
> http://patchwork.kernel.org/patch/51561/
>
> Please look into it, perhaps this solves a lot of kvm-users' problems
>
> I'd like to patch my kvm guests' kernel myself to test this hrtimer patch,
> do you have the correct procedure for me so i can create a custom kernel?
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/503138/+subscribe
>

Revision history for this message
Frank Müller (mueller-wave-computer) wrote : Re: after kernel message hrtimer: interrupt too slow.... the kvm guest becomes slow [possible solution found!]

10.04 still seems to have SMP problems.
I don't see any hrtimer messages on host or kvm guest, but the guest performance gets slower with each CPU I add.

Simple netperf with 1 CPU:
657.80 throughput

Writing a file to an attached iSCSI device:
4294967296 Bytes (4,3 GB) 97,0817 s, 44,2 MB/s
real 1m37.215s

Same with 4 CPU:
netperf: 534.28
dd: 4294967296 Bytes (4,3 GB) 127,865 s, 33,6 MB/s
     real 2m8.011s

44,2 MB/s still really sucks (the host gets 100-115 MB/s), but at least it's somewhat faster.

Revision history for this message
Ferdinand Hagethorn (ferdinand-hagethorn) wrote :

Okay, time for an update:

Host: Now running Ubuntu Natty 11.04 x86_64 on an AMD Phenom II x4 955 platform, linux-image-server as kernel
Guests: 2x Ubuntu 10.04 LTS x86_64, 1x Ubuntu LTS 10.04 32bit i686, all three systems use linux-image-virtual as kernel (issue occurs on all other installable kernels, - pae, -generic)

If I use >1 CPU for the kvm guests, the hrtimer appears in the dmesg and the systems slow to a crawl. When I only assign 1 CPU to the guests the issue does not occur.

This but has been known since sept 2009 and SMP guests are unusable because of this, is anyone looking into this?

summary: - after kernel message hrtimer: interrupt too slow.... the kvm guest
- becomes slow [possible solution found!]
+ after kernel message hrtimer: interrupt too slow.... the SMP(!) kvm
+ guest becomes slow.
affects: linux (Ubuntu) → kvm (Ubuntu)
Revision history for this message
Ferdinand Hagethorn (ferdinand-hagethorn) wrote : Re: after kernel message hrtimer: interrupt too slow.... the SMP(!) kvm guest becomes slow.
summary: - after kernel message hrtimer: interrupt too slow.... the SMP(!) kvm
- guest becomes slow.
+ Lucid & Natty, KVM, After kernel message hrtimer: interrupt too
+ slow.... the SMP kvm guest becomes slow.
Changed in kvm (Ubuntu):
status: Triaged → New
Revision history for this message
iMac (imac-netstatz) wrote :

I am seeing this bug on a Lucid box running on ESX 3.5 Update 4. Running 2.6.32-33-generic-pae 32bit guest (linux-image-generic-pae)

Changed in kvm (Ubuntu):
status: New → Confirmed
Revision history for this message
Arie Skliarouk (skliarie) wrote :

I am seeing this bug on a ubuntu 16.04 box running on proxmox 5.1. Running 4.4.0-72-generic x86_64 guest (linux-image-4.4.0-72-generic).

Under heavy I/O sometimes the guest get stuck for couple of seconds and this error is printed:
[538796.529342] hrtimer: interrupt took 37764682 ns

After that the guest becames unbearably slow and only reboot helps.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Arie,
I saw you also posting on some other old bugs with the same symptom (e.g. 1346917).
And that exactly is the problem with this issue - the message only represents the symptom but not the root cause. Fixing the symptom makes no sense, because if your IRQ delivery is stalled then it is stalled - there is only very little the system can do other than trottling it down.
If you follow the discussion around [1] you'll see that this was the thought back then and it didn't change since then AFAIK.

There were cases around KSM (bug 1346917), but also others around broken RAM, then others had just overloaded their CPUs, others had scheduler bug triggering the same while in other cases there was a thundering herd issue. The only thing all of these cases share is that eventually something (tm) happened which made IRQ/hrtimers stutter to then throttle them down.

Many of the underlying issues causing that have been fixed over the years - this got more rare nowadays. But for each case still left one would not need a "this happened again here" message. But instead a way to reproduce, to then debug the root cause of this exact case and check for a fix then.

Therefore three recommendations for people affected:
1. Try as good as you can to get it to reproduce reliably and then outline these steps, this will help people to hopefully get a grasp of the root cause. Do -not- just report the "hrtimer: interrupt took" being the symptom
2. Always try the same setup you have, but with the very latest virtualization stack provided - quite often things are fixed there and if that is confirmed it becomes "only" a binary search what would need to be backported.
3. open a new bug for these, because until a root cause of a given case is found and identified to be the same we don't know if it is the same issue.

[1]: http://<email address hidden>/msg23491.html

affects: kvm (Ubuntu) → qemu (Ubuntu)
Changed in qemu (Ubuntu):
status: Confirmed → Incomplete
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Arie Skliarouk (skliarie) wrote :

This happened because guest used lots of memory on NUMA machine. The KSM might merge similar memory pages of different VMs, sitting on different NUMA memory regions, causing the affected processes to crawl.

Disabled KSM merge_across_nodes:

echo 2 > /sys/kernel/mm/ksm/run && sleep 300 && cat /sys/kernel/mm/ksm/pages_shared
If there are no pages shared :

echo 0 > /sys/kernel/mm/ksm/merge_across_nodes && echo 1 > /sys/kernel/mm/ksm/run
make sure to set merge_across_nodes in /etc/sysctl.d to stay across reboots.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks Arie for providing your configuration fix to the issue for anyone else being affected - that will be really helpful.

In that sense it is much more like bug 1346917 which was about KSM.
Since the fix for that old bug they at least stopped being migrated around, but I can see how a lot of KSM based overcommit on remote nodes can make memory to be slow.

But IMHO that is just one of the potential prices to pay with KSM, not that much of a bug but a configuration which happens to overload what the system can deliver. It needs a lot very slow memory to trigger the hrtimer issue, if you think there really is an issue to fix (other than the configuration) then I'd suggest try with the latest stack in Disco (19.04) which is very up to date in regard to kernel&qemu - if it still happens there you might report that upstream to the kernel ML - maybe there is an idea how to improve MM to cause less issues, but I'd expect they say that is what the tunable is for.

But as said, if you report that to us as well - maybe once a patch exists to backport - I'd appreciate if that would be in a new bug.

Brad Figg (brad-figg)
tags: added: bjf-tracking
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for qemu (Ubuntu) because there has been no activity for 60 days.]

Changed in qemu (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.