Using KSM on NUMA capable machines can cause KVM guest performance and stability issues
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | linux (Ubuntu) |
High
|
Unassigned | ||
| | Trusty |
High
|
Chris J Arges | ||
Bug Description
[Impact]
When using KVM on NUMA machines, both Linux and Windows guests can exhibit very poor performance and potential crashes. Disabling KSM is a known workaround to fix this issue.
[Fix]
The following patch fixes the issue in our testing:
http://
This patch is present in v3.14-rc1 and onwards.
[Test Case]
General test case:
1) On a NUMA capable machine, setup the machine as a KVM hypervisor
- lscpu should show more than 1 NUMA node
2) Install 4 KVM VMs
3) Run the following in another terminal to ensure that pages_shared and pages_sharing is increasing
- watch 'tail /sys/kernel/
4) In another terminal run a program that continually pings each node and alerts on high latencies
What we've observed is that in Linux guests, the ping latencies can go into the ~2 second range for a few pings, then return back to the < 1ms range. (This is machine dependent.) In addition, occasionally when running this test with Windows guests we observe BSODs during this test.
| description: | updated |
| Changed in linux (Ubuntu Trusty): | |
| assignee: | nobody → Chris J Arges (arges) |
| Changed in linux (Ubuntu): | |
| assignee: | Chris J Arges (arges) → nobody |
| Changed in linux (Ubuntu Trusty): | |
| importance: | Undecided → High |
| status: | New → In Progress |
| Changed in linux (Ubuntu): | |
| status: | In Progress → Fix Released |
| importance: | High → Undecided |
| description: | updated |
| Chris J Arges (arges) wrote : | #1 |
| Changed in linux (Ubuntu): | |
| importance: | Undecided → High |
| Chris J Arges (arges) wrote : | #2 |
SRU Patch to Trusty (3.13) submitted to ubuntu-kernel-team ML.
| Tim Gardner (timg-tpi) wrote : | #3 |
| Changed in linux (Ubuntu Trusty): | |
| status: | In Progress → Fix Committed |
| Adam Conrad (adconrad) wrote : Update Released | #4 |
The verification of the Stable Release Update for linux-lts-trusty has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.
| Launchpad Janitor (janitor) wrote : | #5 |
This bug was fixed in the package linux - 3.13.0-33.58
---------------
linux (3.13.0-33.58) trusty; urgency=low
[ Brad Figg ]
* Release Tracking Bug
- LP: #1349897
[ Upstream Kernel Changes ]
* mm: numa: do not automatically migrate KSM pages
- LP: #1346917
* net: fix UDP tunnel GSO of frag_list GRO packets
- LP: #1331219
* auditsc: audit_krule mask accesses need bounds checking
- LP: #1347088
* n_tty: Fix buffer overruns with larger-than-4k pastes
- LP: #1208740
-- Tim Gardner <email address hidden> Fri, 18 Jul 2014 14:57:50 +0000
| Changed in linux (Ubuntu Trusty): | |
| status: | Fix Committed → Fix Released |
| Jay Janardhan (jay-janardhan) wrote : | #6 |
I'm seeing the same issue on kernel: 3.13.0-40-generic. Per above comment the bug is fixed in 3.13.0-33.58 but not included in the later versions?
Dec 8 16:14:11 vm1 kernel: [ 109.084235] random: nonblocking pool is initialized
Dec 8 16:49:40 vm1 kernel: [ 2237.458245] hrtimer: interrupt took 42733372 ns
| Sushitha (sushi-ajay) wrote : | #7 |
I am seeing the same issue in 3.13.0-46-generic
hrtimer: interrupt took 4352551231 ns
| Mohammed Naser (mnaser) wrote : | #8 |
I am seeing the issue in 3.13.0-46-generic as well.
| Dave Chiluk (chiluk) wrote : | #9 |
For those seeing this issue after 3.13.0-33.58, please ensure that the virtual machine's host kernel is running 3.13.0-33.58 or newer. The VM kernel itself does not matter.
| Jim (8-6buntuone-d) wrote : | #10 |
This is NOT fixed by 3.13.0-33.58. It continues to persist even with 3.13.0-65.106 (and 3.13.0-63.103).
I have around 10 VMs running but ONE in particular disconnects from the network every hour or so.
I had this issue previously but it was initially gone on Ubuntu 14.04 LTS but had come back recently - perhaps some kernel regression ?
dmesg shows
[42524.196629] kvm: zapping shadow pages for mmio generation wraparound
[42538.140013] br0: port 2(vnet0) entered learning state
[42538.268017] br1: port 2(vnet1) entered learning state
[42553.180008] br0: topology change detected, propagating
[42553.180015] br0: port 2(vnet0) entered forwarding state
[42553.308008] br1: topology change detected, propagating
[42553.308014] br1: port 2(vnet1) entered forwarding state
(and NIC connection is gone)
It's not clear if this is just co-incidence or if this is a pointer to the issue.
This VM is unusual in my VMs becuase it is the only one with 2 NIC connections to br1 and br0. All the others connect to just br0. Those others work OK.
Happy to try suggestions to track this down.
| Chris J Arges (arges) wrote : | #11 |
This could be a different issue. For now can you file a new bug with additional information?
Running: 'apport-bug linux' in a terminal would be best as it collects dmesg output and package versions.
Any description or examples of how to reproduce would be very helpful in tracking this down. Thanks!
| Mohammed Naser (mnaser) wrote : | #12 |
I think this is a different issue you're describing, connectivity doesn't drop with this but it becomes very flakey, large latency spikes, etc. It would never 100% drop
| Thiago Martins (martinx) wrote : Re: [Bug 1346917] Re: Using KSM on NUMA capable machines can cause KVM guest performance and stability issues | #13 |
Why not go with Linux 3.19 ?
You can just install it by running:
sudo apt-get install linux-generic-
It is by far, much more stable than Linux 3.13 (specially the network
stack)... ;-)
On 6 October 2015 at 11:36, Jim <email address hidden> wrote:
> This is NOT fixed by 3.13.0-33.58. It continues to persist even with
> 3.13.0-65.106 (and 3.13.0-63.103).
>
> I have around 10 VMs running but ONE in particular disconnects from the
> network every hour or so.
>
> I had this issue previously but it was initially gone on Ubuntu 14.04
> LTS but had come back recently - perhaps some kernel regression ?
>
> dmesg shows
>
> [42524.196629] kvm: zapping shadow pages for mmio generation wraparound
> [42538.140013] br0: port 2(vnet0) entered learning state
> [42538.268017] br1: port 2(vnet1) entered learning state
> [42553.180008] br0: topology change detected, propagating
> [42553.180015] br0: port 2(vnet0) entered forwarding state
> [42553.308008] br1: topology change detected, propagating
> [42553.308014] br1: port 2(vnet1) entered forwarding state
>
> (and NIC connection is gone)
>
> It's not clear if this is just co-incidence or if this is a pointer to
> the issue.
>
> This VM is unusual in my VMs becuase it is the only one with 2 NIC
> connections to br1 and br0. All the others connect to just br0. Those
> others work OK.
>
> Happy to try suggestions to track this down.
>
> --
> You received this bug notification because you are subscribed to a
> duplicate bug report (1341195).
> https:/
>
> Title:
> Using KSM on NUMA capable machines can cause KVM guest performance and
> stability issues
>
> To manage notifications about this bug go to:
> https:/
| Changed in linux (Ubuntu Trusty): | |
| milestone: | none → trusty-updates |


A test build for this patch is provided here: people. canonical. com/~arges/ lp1346917/
http://
For most servers linux-image- 3.13.0- 33-generic_ 3.13.0- 33.58~lp1346917 v201407220903_ amd64.deb should be sufficient; however if you have DKMS packages you may need to install linux-headers* packages. The linux-image-extra package has additional modules if necessary.
In addition dbgsym and tools packages are provided for debugging.