KVM guests getting slow by time

Bug #1341195 reported by Tamas Papp
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
High
Unassigned

Bug Description

There is a post with very similar symptoms on serverfault:

http://serverfault.com/questions/609881/windows-server-2012-on-kvm-on-ubuntu-14-04-runs-well-for-a-while-and-then-slows/612084#612084

Basically all kind of KVM guests are getting slow by time, both windows or linux. The more running guests, it happens sooner.
Switching back to Saucy kernel is a good workaround.

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: qemu-kvm 2.0.0+dfsg-2ubuntu1.1
ProcVersionSignature: Ubuntu 3.13.0-29.53-generic 3.13.11.2
Uname: Linux 3.13.0-29-generic x86_64
NonfreeKernelModules: zfs zunicode zavl zcommon znvpair
ApportVersion: 2.14.1-0ubuntu3.2
Architecture: amd64
Date: Sat Jul 12 22:18:49 2014
SourcePackage: qemu
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Tamas Papp (tomposmiko) wrote :
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Since "switching to saucy kernel is a good workaround", marking this as affecting the kernel.

Could you try disabling KSM and see if that avoids the issue?

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in qemu (Ubuntu):
status: New → Incomplete
Revision history for this message
Tamas Papp (tomposmiko) wrote : Re: [Bug 1341195] Re: KVM guests getting slow by time

On 07/14/2014 05:13 PM, Serge Hallyn wrote:
> Since "switching to saucy kernel is a good workaround", marking this as
> affecting the kernel.
>
> Could you try disabling KSM and see if that avoids the issue?

I set KSM_ENABLED=0 in /etc/default/qemu-kvm and restarted the server.

Also today after 3 weeks it started the same thing with Saucy kernel too.

tamas

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Hi Tamas,

To be clear, you have not yet gotten the bug since you've set KSM_ENABLED to 0, or have you?

Interesting about the saucy kernel. Ideally we could reproduce this reliably enough to really bisect.

I wonder if we can reproduce without qemu.

Revision history for this message
Tamas Papp (tomposmiko) wrote :

On 07/16/2014 06:35 PM, Serge Hallyn wrote:

> To be clear, you have not yet gotten the bug since you've set
> KSM_ENABLED to 0, or have you?

I just did it, so no, not yet.

This happened by time:

Saucy (everything is fine) -> Trusty (issue happened) -> Trusty with
Saucy kernel 3 weeks ago (no issue) -> today with 3 weeks uptime (issue
happened) -> KSM_ENABLED=0 + reboot with Trusty kernel -> now

I just want to be clear, that both of us understand the same on this.

> Interesting about the saucy kernel. Ideally we could reproduce this
> reliably enough to really bisect.
>
> I wonder if we can reproduce without qemu.

I have a some other servers with LXC with no issue, If there is KVM,
then issue happens.
Does it really not show up anywhere else?

sysctl.conf:

kernel.printk = 3 4 1 3
net.ipv4.conf.default.rp_filter=1
net.ipv4.conf.all.rp_filter=1
net.ipv4.tcp_syncookies=1
net.ipv4.conf.all.secure_redirects = 1
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_source_route = 0
net.ipv6.conf.all.accept_source_route = 0
kernel.shmmax=335544320000
net.ipv4.tcp_rmem = 4096 16777216 33554432
net.ipv4.tcp_wmem = 4096 16777216 33554432
net.ipv4.tcp_mem = 4096 16777216 33554432
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.netdev_max_backlog = 30000
net.core.netdev_max_backlog = 30000
net.ipv6.neigh.default.gc_thresh1 = 512
net.ipv6.neigh.default.gc_thresh2 = 2048
net.ipv6.neigh.default.gc_thresh3 = 4096
vm.swappiness = 1

fs.inotify.max_user_watches = 81920
fs.inotify.max_user_instances = 1024

May that be important?

tamas

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Tamas Papp (<email address hidden>):
> On 07/16/2014 06:35 PM, Serge Hallyn wrote:
>
>
> > To be clear, you have not yet gotten the bug since you've set
> > KSM_ENABLED to 0, or have you?
>
> I just did it, so no, not yet.
>
> This happened by time:
>
> Saucy (everything is fine) -> Trusty (issue happened) -> Trusty with
> Saucy kernel 3 weeks ago (no issue) -> today with 3 weeks uptime (issue
> happened) -> KSM_ENABLED=0 + reboot with Trusty kernel -> now
>
> I just want to be clear, that both of us understand the same on this.
>
>
> > Interesting about the saucy kernel. Ideally we could reproduce this
> > reliably enough to really bisect.
> >
> > I wonder if we can reproduce without qemu.
>
>
> I have a some other servers with LXC with no issue, If there is KVM,
> then issue happens.
> Does it really not show up anywhere else?

I've personally not seen it, and noone on my team, who use a lot of
kvm instances, has seen it. Our two current theories are that (a) it
has to do with ksm page migration across numa-nodes, or (b) it has to
do with a race with transparent hugepages versus ksm. For (b) there is
already a commit in Linus' tree (f72e7dcd). For (a), you can probably
test by setting /sys/kernel/mm/ksm/merge_across_nodes to 0. (We are
trying these as well, but as the testcases are not 100% positive
it'd be good to have you try as well)

Revision history for this message
Tamas Papp (tomposmiko) wrote :

On 07/16/2014 09:39 PM, Serge Hallyn wrote
> I've personally not seen it, and noone on my team, who use a lot of
> kvm instances, has seen it. Our two current theories are that (a) it
> has to do with ksm page migration across numa-nodes, or (b) it has to
> do with a race with transparent hugepages versus ksm. For (b) there is
> already a commit in Linus' tree (f72e7dcd). For (a), you can probably

Are you referring to this or a different one?
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1323165

> test by setting /sys/kernel/mm/ksm/merge_across_nodes to 0. (We are
> trying these as well, but as the testcases are not 100% positive
> it'd be good to have you try as well)

It's 1 now.
Do you you want me to set it now or wait to the result of KSM_ENABLED=0?

I'm not familiar enough with these settings.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Tamas Papp (<email address hidden>):
> On 07/16/2014 09:39 PM, Serge Hallyn wrote
> > I've personally not seen it, and noone on my team, who use a lot of
> > kvm instances, has seen it. Our two current theories are that (a) it
> > has to do with ksm page migration across numa-nodes, or (b) it has to
> > do with a race with transparent hugepages versus ksm. For (b) there is
> > already a commit in Linus' tree (f72e7dcd). For (a), you can probably
>
> Are you referring to this or a different one?
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1323165

No, this commit is

commit f72e7dcdd25229446b102e587ef2f826f76bff28
Author: Hugh Dickins <email address hidden>
Date: Mon Jun 23 13:22:05 2014 -0700

    mm: let mm_find_pmd fix buggy race with THP fault

which seems later than that bug.

> > test by setting /sys/kernel/mm/ksm/merge_across_nodes to 0. (We are
> > trying these as well, but as the testcases are not 100% positive
> > it'd be good to have you try as well)
>
> It's 1 now.
> Do you you want me to set it now or wait to the result of KSM_ENABLED=0?

Oh - I forgot you had KSM_ENABLED=0 :) there's no sense changing
merge_across_nodes in the meantime.

(To be sure - /sys/kernel/mm/ksm/run is in fact set to 0 now right?)

Revision history for this message
Tamas Papp (tomposmiko) wrote :

On 07/16/2014 10:24 PM, Serge Hallyn wrote:
> (To be sure - /sys/kernel/mm/ksm/run is in fact set to 0 now right?)

Right.

Revision history for this message
Tamas Papp (tomposmiko) wrote :

FYI, I see no issue since I set KSM_ENABLED=0 (1 day).

Chris J Arges (arges)
tags: added: ksm-numa-guest-perf
Revision history for this message
Tamas Papp (tomposmiko) wrote :

I think we can be sure, that KSM_ENABLED=0 definitely helps.

Revision history for this message
Chris J Arges (arges) wrote :

I believe I've found the fix for this issue for 3.13.
If you can, please test the kernel posted on comment #1 on this bug:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917
Make sure KSM is enabled; and any workarounds for this bug are disabled.

If this fixes the issue for you, you are welcome to mark this bug as a duplicate of 1346917.

Thanks!

no longer affects: qemu (Ubuntu)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Marking incomplete until requested testing is complete.

Changed in linux (Ubuntu):
importance: Undecided → High
Revision history for this message
Brooks Warner (brookswarner) wrote :

Was anyone able to test the fix proposed by Chris Arges in comment #12?

Revision history for this message
Tamas Papp (tomposmiko) wrote :

It was just a day ago.
I have to wait until the weekend, when the server becomes somewhat idle and I can do such tests.

Revision history for this message
EAB (adair-boder) wrote :

I have 5 KVM servers here all running 14.04 - 2 weeks ago one host started to act up with VM's crashing - then this week another host was suffering from terrible internal network delays (bridged NIC) - rebooting the hosts and guests did nothing to resolve the issue - disabling KSM seems to have completely stopped the issues. Disabling KSM was done today ... time will tell if it lasts ...

Revision history for this message
wolfgang.moch@web.de (wolfgang-moch) wrote :

After updating a server 12.04. LTS including the new HWE stack support I've got the same problems.
The guest (MS Server 2012R2) was running after startup for about 7 to 10 min without problems, after this period the cpu utilisation for the kvm process increases more an more, the guest became sluggish.
Most significant problem: file and printer services went slow but kept working, SQL-Server applications have crashed.
Starting the previous kernel 3.11.0-26-generic was my workaround solving the problem for the moment.
I

Revision history for this message
Thiago Martins (martinx) wrote :

Hey guys!

I'm pretty sure that the following patch will fix your problems:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917
New kernel for Trusty: http://people.canonical.com/~arges/lp1346917/

My original BUG:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1338277

Cheers!
Thiago

On 28 July 2014 15:43, <email address hidden> <email address hidden> wrote:

> After updating a server 12.04. LTS including the new HWE stack support
> I've got the same problems.
> The guest (MS Server 2012R2) was running after startup for about 7 to 10
> min without problems, after this period the cpu utilisation for the kvm
> process increases more an more, the guest became sluggish.
> Most significant problem: file and printer services went slow but kept
> working, SQL-Server applications have crashed.
> Starting the previous kernel 3.11.0-26-generic was my workaround solving
> the problem for the moment.
> I
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1341195
>
> Title:
> KVM guests getting slow by time
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1341195/+subscriptions
>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.