KVM guests getting slow by time

Bug #1341195 reported by Tamas Papp on 2014-07-12
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned

Bug Description

There is a post with very similar symptoms on serverfault:

http://serverfault.com/questions/609881/windows-server-2012-on-kvm-on-ubuntu-14-04-runs-well-for-a-while-and-then-slows/612084#612084

Basically all kind of KVM guests are getting slow by time, both windows or linux. The more running guests, it happens sooner.
Switching back to Saucy kernel is a good workaround.

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: qemu-kvm 2.0.0+dfsg-2ubuntu1.1
ProcVersionSignature: Ubuntu 3.13.0-29.53-generic 3.13.11.2
Uname: Linux 3.13.0-29-generic x86_64
NonfreeKernelModules: zfs zunicode zavl zcommon znvpair
ApportVersion: 2.14.1-0ubuntu3.2
Architecture: amd64
Date: Sat Jul 12 22:18:49 2014
SourcePackage: qemu
UpgradeStatus: No upgrade log present (probably fresh install)

Tamas Papp (tomposmiko) wrote :
Serge Hallyn (serge-hallyn) wrote :

Since "switching to saucy kernel is a good workaround", marking this as affecting the kernel.

Could you try disabling KSM and see if that avoids the issue?

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in qemu (Ubuntu):
status: New → Incomplete

On 07/14/2014 05:13 PM, Serge Hallyn wrote:
> Since "switching to saucy kernel is a good workaround", marking this as
> affecting the kernel.
>
> Could you try disabling KSM and see if that avoids the issue?

I set KSM_ENABLED=0 in /etc/default/qemu-kvm and restarted the server.

Also today after 3 weeks it started the same thing with Saucy kernel too.

tamas

Serge Hallyn (serge-hallyn) wrote :

Hi Tamas,

To be clear, you have not yet gotten the bug since you've set KSM_ENABLED to 0, or have you?

Interesting about the saucy kernel. Ideally we could reproduce this reliably enough to really bisect.

I wonder if we can reproduce without qemu.

Tamas Papp (tomposmiko) wrote :

On 07/16/2014 06:35 PM, Serge Hallyn wrote:

> To be clear, you have not yet gotten the bug since you've set
> KSM_ENABLED to 0, or have you?

I just did it, so no, not yet.

This happened by time:

Saucy (everything is fine) -> Trusty (issue happened) -> Trusty with
Saucy kernel 3 weeks ago (no issue) -> today with 3 weeks uptime (issue
happened) -> KSM_ENABLED=0 + reboot with Trusty kernel -> now

I just want to be clear, that both of us understand the same on this.

> Interesting about the saucy kernel. Ideally we could reproduce this
> reliably enough to really bisect.
>
> I wonder if we can reproduce without qemu.

I have a some other servers with LXC with no issue, If there is KVM,
then issue happens.
Does it really not show up anywhere else?

sysctl.conf:

kernel.printk = 3 4 1 3
net.ipv4.conf.default.rp_filter=1
net.ipv4.conf.all.rp_filter=1
net.ipv4.tcp_syncookies=1
net.ipv4.conf.all.secure_redirects = 1
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_source_route = 0
net.ipv6.conf.all.accept_source_route = 0
kernel.shmmax=335544320000
net.ipv4.tcp_rmem = 4096 16777216 33554432
net.ipv4.tcp_wmem = 4096 16777216 33554432
net.ipv4.tcp_mem = 4096 16777216 33554432
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.netdev_max_backlog = 30000
net.core.netdev_max_backlog = 30000
net.ipv6.neigh.default.gc_thresh1 = 512
net.ipv6.neigh.default.gc_thresh2 = 2048
net.ipv6.neigh.default.gc_thresh3 = 4096
vm.swappiness = 1

fs.inotify.max_user_watches = 81920
fs.inotify.max_user_instances = 1024

May that be important?

tamas

Serge Hallyn (serge-hallyn) wrote :

Quoting Tamas Papp (<email address hidden>):
> On 07/16/2014 06:35 PM, Serge Hallyn wrote:
>
>
> > To be clear, you have not yet gotten the bug since you've set
> > KSM_ENABLED to 0, or have you?
>
> I just did it, so no, not yet.
>
> This happened by time:
>
> Saucy (everything is fine) -> Trusty (issue happened) -> Trusty with
> Saucy kernel 3 weeks ago (no issue) -> today with 3 weeks uptime (issue
> happened) -> KSM_ENABLED=0 + reboot with Trusty kernel -> now
>
> I just want to be clear, that both of us understand the same on this.
>
>
> > Interesting about the saucy kernel. Ideally we could reproduce this
> > reliably enough to really bisect.
> >
> > I wonder if we can reproduce without qemu.
>
>
> I have a some other servers with LXC with no issue, If there is KVM,
> then issue happens.
> Does it really not show up anywhere else?

I've personally not seen it, and noone on my team, who use a lot of
kvm instances, has seen it. Our two current theories are that (a) it
has to do with ksm page migration across numa-nodes, or (b) it has to
do with a race with transparent hugepages versus ksm. For (b) there is
already a commit in Linus' tree (f72e7dcd). For (a), you can probably
test by setting /sys/kernel/mm/ksm/merge_across_nodes to 0. (We are
trying these as well, but as the testcases are not 100% positive
it'd be good to have you try as well)

Tamas Papp (tomposmiko) wrote :

On 07/16/2014 09:39 PM, Serge Hallyn wrote
> I've personally not seen it, and noone on my team, who use a lot of
> kvm instances, has seen it. Our two current theories are that (a) it
> has to do with ksm page migration across numa-nodes, or (b) it has to
> do with a race with transparent hugepages versus ksm. For (b) there is
> already a commit in Linus' tree (f72e7dcd). For (a), you can probably

Are you referring to this or a different one?
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1323165

> test by setting /sys/kernel/mm/ksm/merge_across_nodes to 0. (We are
> trying these as well, but as the testcases are not 100% positive
> it'd be good to have you try as well)

It's 1 now.
Do you you want me to set it now or wait to the result of KSM_ENABLED=0?

I'm not familiar enough with these settings.

Serge Hallyn (serge-hallyn) wrote :

Quoting Tamas Papp (<email address hidden>):
> On 07/16/2014 09:39 PM, Serge Hallyn wrote
> > I've personally not seen it, and noone on my team, who use a lot of
> > kvm instances, has seen it. Our two current theories are that (a) it
> > has to do with ksm page migration across numa-nodes, or (b) it has to
> > do with a race with transparent hugepages versus ksm. For (b) there is
> > already a commit in Linus' tree (f72e7dcd). For (a), you can probably
>
> Are you referring to this or a different one?
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1323165

No, this commit is

commit f72e7dcdd25229446b102e587ef2f826f76bff28
Author: Hugh Dickins <email address hidden>
Date: Mon Jun 23 13:22:05 2014 -0700

    mm: let mm_find_pmd fix buggy race with THP fault

which seems later than that bug.

> > test by setting /sys/kernel/mm/ksm/merge_across_nodes to 0. (We are
> > trying these as well, but as the testcases are not 100% positive
> > it'd be good to have you try as well)
>
> It's 1 now.
> Do you you want me to set it now or wait to the result of KSM_ENABLED=0?

Oh - I forgot you had KSM_ENABLED=0 :) there's no sense changing
merge_across_nodes in the meantime.

(To be sure - /sys/kernel/mm/ksm/run is in fact set to 0 now right?)

Tamas Papp (tomposmiko) wrote :

On 07/16/2014 10:24 PM, Serge Hallyn wrote:
> (To be sure - /sys/kernel/mm/ksm/run is in fact set to 0 now right?)

Right.

Tamas Papp (tomposmiko) wrote :

FYI, I see no issue since I set KSM_ENABLED=0 (1 day).

Chris J Arges (arges) on 2014-07-21
tags: added: ksm-numa-guest-perf
Tamas Papp (tomposmiko) wrote :

I think we can be sure, that KSM_ENABLED=0 definitely helps.

Chris J Arges (arges) wrote :

I believe I've found the fix for this issue for 3.13.
If you can, please test the kernel posted on comment #1 on this bug:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917
Make sure KSM is enabled; and any workarounds for this bug are disabled.

If this fixes the issue for you, you are welcome to mark this bug as a duplicate of 1346917.

Thanks!

no longer affects: qemu (Ubuntu)
Joseph Salisbury (jsalisbury) wrote :

Marking incomplete until requested testing is complete.

Changed in linux (Ubuntu):
importance: Undecided → High
Brooks Warner (brookswarner) wrote :

Was anyone able to test the fix proposed by Chris Arges in comment #12?

Tamas Papp (tomposmiko) wrote :

It was just a day ago.
I have to wait until the weekend, when the server becomes somewhat idle and I can do such tests.

EAB (adair-boder) wrote :

I have 5 KVM servers here all running 14.04 - 2 weeks ago one host started to act up with VM's crashing - then this week another host was suffering from terrible internal network delays (bridged NIC) - rebooting the hosts and guests did nothing to resolve the issue - disabling KSM seems to have completely stopped the issues. Disabling KSM was done today ... time will tell if it lasts ...

After updating a server 12.04. LTS including the new HWE stack support I've got the same problems.
The guest (MS Server 2012R2) was running after startup for about 7 to 10 min without problems, after this period the cpu utilisation for the kvm process increases more an more, the guest became sluggish.
Most significant problem: file and printer services went slow but kept working, SQL-Server applications have crashed.
Starting the previous kernel 3.11.0-26-generic was my workaround solving the problem for the moment.
I

Thiago Martins (martinx) wrote :

Hey guys!

I'm pretty sure that the following patch will fix your problems:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917
New kernel for Trusty: http://people.canonical.com/~arges/lp1346917/

My original BUG:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1338277

Cheers!
Thiago

On 28 July 2014 15:43, <email address hidden> <email address hidden> wrote:

> After updating a server 12.04. LTS including the new HWE stack support
> I've got the same problems.
> The guest (MS Server 2012R2) was running after startup for about 7 to 10
> min without problems, after this period the cpu utilisation for the kvm
> process increases more an more, the guest became sluggish.
> Most significant problem: file and printer services went slow but kept
> working, SQL-Server applications have crashed.
> Starting the previous kernel 3.11.0-26-generic was my workaround solving
> the problem for the moment.
> I
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1341195
>
> Title:
> KVM guests getting slow by time
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1341195/+subscriptions
>

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers