KSM causing performance and instability issues
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | linux (Ubuntu) |
High
|
Unassigned | ||
Bug Description
This seems to be a bug that has regressed, I have encountered the same issue as 2 other reports:
Running kernel: 3.13.0-46-generic
This is replicated over many compute nodes (KVM) running OpenStack. The workaround is to disable KSM:
echo 2 > /sys/kernel/
This fixes the issue temporarily.
---
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Mar 22 14:30 seq
crw-rw---- 1 root audio 116, 33 Mar 22 14:30 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.7
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 14.04
InstallationDate: Installed on 2014-12-14 (99 days ago)
InstallationMedia: Ubuntu-Server 14.04.1 LTS "Trusty Tahr" - Release amd64 (20140722.3)
MachineType: Dell Inc. PowerEdge R620
Package: linux (not installed)
PciMultimedia:
ProcFB: 0 EFI VGA
ProcKernelCmdLine: BOOT_IMAGE=
ProcVersionSign
RelatedPackageV
linux-
linux-
linux-firmware 1.127.11
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty
Uname: Linux 3.13.0-46-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:
_MarkForUpload: True
dmi.bios.date: 01/16/2014
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 2.2.2
dmi.board.name: 01W23F
dmi.board.vendor: Dell Inc.
dmi.board.version: A05
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.
dmi.product.name: PowerEdge R620
dmi.sys.vendor: Dell Inc.
| Changed in linux (Ubuntu): | |
| status: | New → Incomplete |
| tags: | added: apport-collected trusty |
| description: | updated |
| Mohammed Naser (mnaser) wrote : BootDmesg.txt | #13 |
apport information
| description: | updated |
| description: | updated |
apport information
| Mohammed Naser (mnaser) wrote : IwConfig.txt | #15 |
apport information
| Mohammed Naser (mnaser) wrote : Lspci.txt | #16 |
apport information
| Mohammed Naser (mnaser) wrote : Lsusb.txt | #17 |
apport information
apport information
apport information
apport information
apport information
| Mohammed Naser (mnaser) wrote : UdevDb.txt | #22 |
apport information
| Mohammed Naser (mnaser) wrote : UdevLog.txt | #23 |
apport information
| Mohammed Naser (mnaser) wrote : WifiSyslog.txt | #24 |
apport information
| Changed in linux (Ubuntu): | |
| status: | Incomplete → Confirmed |
| Mohammed Naser (mnaser) wrote : | #25 |
I'll be setting up a new server in the next few days, I'll attempt to use -48 and see if that issue is present or not.
| Changed in linux (Ubuntu): | |
| importance: | Undecided → High |
| tags: | added: kernel-da-key |
| Chris J Arges (arges) wrote : | #26 |
Can you also test with /sys/kernel/
Another way to potentially cause this issue to occur faster would be to set:
/sys/kernel/
or
/sys/kernel/
Thanks,
--chris j arges
| Changed in linux (Ubuntu): | |
| assignee: | nobody → Chris J Arges (arges) |
| Chris J Arges (arges) wrote : | #27 |
In addition could you explain more in depth about the performance an instability issues you've observed? How do you detect them or test for them? Thanks
| description: | updated |
| Chris J Arges (arges) wrote : | #28 |
I've attempted to reproduce bug 1346917 again on a NUMA machine and was unable to do so with the latest 3.13 kernel. Perhaps I could have more details on how your reproducing this issue to assist with debugging? Thanks
| Mohammed Naser (mnaser) wrote : | #29 |
Hi Chris,
Thanks for the help so far. I'm deploying a new machine right now and I'll be trying to replicate it on -48.
The way I detected it was that i'd see messaging in "dmesg" on guest similar to this:
hrtimer: interrupt took 4352551231 ns
In addition, when pinging the machine, you'd have a few seconds of stable pings, then unresponsive for 2-3s, and it starts responding again (with a huge delay, latency of 3s to 4s because of the delay).
I will be running this machine and monitoring it closely and report on the output, however, I'd like to note that these machines have heavy KSM usage, before turning it off, one had almost ~45-50GB of deduplicated memory on a 256GB node, so I'm not sure if that plays in as a factor..
I'll report back on -48 and see what I can check
Thank you,
Mohammed
| Mohammed Naser (mnaser) wrote : | #30 |
Hi,
I installed a new machine from the 14.04.2 media which gave me the HWE stack with kernel 3.16.0-33-generic and it's running with no problems. The machine is now loaded up to 125GB worth of VMs and with the following memory stats:
# free -m
total used free shared buffers cached
Mem: 257599 253046 4553 3 218 161499
-/+ buffers/cache: 91328 166271
Swap: 0 0 0
So, the server is quite loaded and I haven't seen any hiccups. We'll see from there on.
Thanks,
Mohammed
| Changed in linux (Ubuntu): | |
| assignee: | Chris J Arges (arges) → nobody |


This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:
apport-collect 1435363
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.