Using KSM on NUMA capable machines can cause KVM guest performance and stability issues

Bug #1346917 reported by Chris J Arges on 2014-07-22
214
This bug affects 40 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned
Trusty
High
Chris J Arges

Bug Description

[Impact]

When using KVM on NUMA machines, both Linux and Windows guests can exhibit very poor performance and potential crashes. Disabling KSM is a known workaround to fix this issue.

[Fix]

The following patch fixes the issue in our testing:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=64a9a34e22896dad430e21a28ad8cb00a756fefc

This patch is present in v3.14-rc1 and onwards.

[Test Case]

General test case:
1) On a NUMA capable machine, setup the machine as a KVM hypervisor
  - lscpu should show more than 1 NUMA node
2) Install 4 KVM VMs
3) Run the following in another terminal to ensure that pages_shared and pages_sharing is increasing
 - watch 'tail /sys/kernel/mm/ksm/*'
4) In another terminal run a program that continually pings each node and alerts on high latencies

What we've observed is that in Linux guests, the ping latencies can go into the ~2 second range for a few pings, then return back to the < 1ms range. (This is machine dependent.) In addition, occasionally when running this test with Windows guests we observe BSODs during this test.

Chris J Arges (arges) on 2014-07-22
description: updated
Changed in linux (Ubuntu Trusty):
assignee: nobody → Chris J Arges (arges)
Changed in linux (Ubuntu):
assignee: Chris J Arges (arges) → nobody
Changed in linux (Ubuntu Trusty):
importance: Undecided → High
status: New → In Progress
Changed in linux (Ubuntu):
status: In Progress → Fix Released
importance: High → Undecided
description: updated
Chris J Arges (arges) wrote :

A test build for this patch is provided here:
http://people.canonical.com/~arges/lp1346917/

For most servers linux-image-3.13.0-33-generic_3.13.0-33.58~lp1346917v201407220903_amd64.deb should be sufficient; however if you have DKMS packages you may need to install linux-headers* packages. The linux-image-extra package has additional modules if necessary.
In addition dbgsym and tools packages are provided for debugging.

Changed in linux (Ubuntu):
importance: Undecided → High
Chris J Arges (arges) wrote :

SRU Patch to Trusty (3.13) submitted to ubuntu-kernel-team ML.

Tim Gardner (timg-tpi) wrote :
Changed in linux (Ubuntu Trusty):
status: In Progress → Fix Committed

The verification of the Stable Release Update for linux-lts-trusty has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.13.0-33.58

---------------
linux (3.13.0-33.58) trusty; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1349897

  [ Upstream Kernel Changes ]

  * mm: numa: do not automatically migrate KSM pages
    - LP: #1346917
  * net: fix UDP tunnel GSO of frag_list GRO packets
    - LP: #1331219
  * auditsc: audit_krule mask accesses need bounds checking
    - LP: #1347088
  * n_tty: Fix buffer overruns with larger-than-4k pastes
    - LP: #1208740
 -- Tim Gardner <email address hidden> Fri, 18 Jul 2014 14:57:50 +0000

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
Jay Janardhan (jay-janardhan) wrote :

I'm seeing the same issue on kernel: 3.13.0-40-generic. Per above comment the bug is fixed in 3.13.0-33.58 but not included in the later versions?

Dec 8 16:14:11 vm1 kernel: [ 109.084235] random: nonblocking pool is initialized
Dec 8 16:49:40 vm1 kernel: [ 2237.458245] hrtimer: interrupt took 42733372 ns

Sushitha (sushi-ajay) wrote :

I am seeing the same issue in 3.13.0-46-generic

hrtimer: interrupt took 4352551231 ns

Mohammed Naser (mnaser) wrote :

I am seeing the issue in 3.13.0-46-generic as well.

Dave Chiluk (chiluk) wrote :

For those seeing this issue after 3.13.0-33.58, please ensure that the virtual machine's host kernel is running 3.13.0-33.58 or newer. The VM kernel itself does not matter.

Jim (8-6buntuone-d) wrote :

This is NOT fixed by 3.13.0-33.58. It continues to persist even with 3.13.0-65.106 (and 3.13.0-63.103).

I have around 10 VMs running but ONE in particular disconnects from the network every hour or so.

I had this issue previously but it was initially gone on Ubuntu 14.04 LTS but had come back recently - perhaps some kernel regression ?

dmesg shows

[42524.196629] kvm: zapping shadow pages for mmio generation wraparound
[42538.140013] br0: port 2(vnet0) entered learning state
[42538.268017] br1: port 2(vnet1) entered learning state
[42553.180008] br0: topology change detected, propagating
[42553.180015] br0: port 2(vnet0) entered forwarding state
[42553.308008] br1: topology change detected, propagating
[42553.308014] br1: port 2(vnet1) entered forwarding state

(and NIC connection is gone)

It's not clear if this is just co-incidence or if this is a pointer to the issue.

This VM is unusual in my VMs becuase it is the only one with 2 NIC connections to br1 and br0. All the others connect to just br0. Those others work OK.

Happy to try suggestions to track this down.

Chris J Arges (arges) wrote :

This could be a different issue. For now can you file a new bug with additional information?
Running: 'apport-bug linux' in a terminal would be best as it collects dmesg output and package versions.
Any description or examples of how to reproduce would be very helpful in tracking this down. Thanks!

Mohammed Naser (mnaser) wrote :

I think this is a different issue you're describing, connectivity doesn't drop with this but it becomes very flakey, large latency spikes, etc. It would never 100% drop

Why not go with Linux 3.19 ?

You can just install it by running:

sudo apt-get install linux-generic-lts-vivid

It is by far, much more stable than Linux 3.13 (specially the network
stack)... ;-)

On 6 October 2015 at 11:36, Jim <email address hidden> wrote:
> This is NOT fixed by 3.13.0-33.58. It continues to persist even with
> 3.13.0-65.106 (and 3.13.0-63.103).
>
> I have around 10 VMs running but ONE in particular disconnects from the
> network every hour or so.
>
> I had this issue previously but it was initially gone on Ubuntu 14.04
> LTS but had come back recently - perhaps some kernel regression ?
>
> dmesg shows
>
> [42524.196629] kvm: zapping shadow pages for mmio generation wraparound
> [42538.140013] br0: port 2(vnet0) entered learning state
> [42538.268017] br1: port 2(vnet1) entered learning state
> [42553.180008] br0: topology change detected, propagating
> [42553.180015] br0: port 2(vnet0) entered forwarding state
> [42553.308008] br1: topology change detected, propagating
> [42553.308014] br1: port 2(vnet1) entered forwarding state
>
> (and NIC connection is gone)
>
> It's not clear if this is just co-incidence or if this is a pointer to
> the issue.
>
> This VM is unusual in my VMs becuase it is the only one with 2 NIC
> connections to br1 and br0. All the others connect to just br0. Those
> others work OK.
>
> Happy to try suggestions to track this down.
>
> --
> You received this bug notification because you are subscribed to a
> duplicate bug report (1341195).
> https://bugs.launchpad.net/bugs/1346917
>
> Title:
> Using KSM on NUMA capable machines can cause KVM guest performance and
> stability issues
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917/+subscriptions

to-hiro (to-hiro) on 2016-12-09
Changed in linux (Ubuntu Trusty):
milestone: none → trusty-updates
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers