KSM causing performance and instability issues

Bug #1435363 reported by Mohammed Naser on 2015-03-23
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned

Bug Description

This seems to be a bug that has regressed, I have encountered the same issue as 2 other reports:

LP: #1346917
LP: #1349897

Running kernel: 3.13.0-46-generic

This is replicated over many compute nodes (KVM) running OpenStack. The workaround is to disable KSM:

echo 2 > /sys/kernel/mm/ksm/run

This fixes the issue temporarily.
---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Mar 22 14:30 seq
 crw-rw---- 1 root audio 116, 33 Mar 22 14:30 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.7
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 14.04
InstallationDate: Installed on 2014-12-14 (99 days ago)
InstallationMedia: Ubuntu-Server 14.04.1 LTS "Trusty Tahr" - Release amd64 (20140722.3)
MachineType: Dell Inc. PowerEdge R620
Package: linux (not installed)
PciMultimedia:

ProcFB: 0 EFI VGA
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-46-generic.efi.signed root=UUID=68d30a86-3c67-4691-b142-d27a459986e8 ro
ProcVersionSignature: Ubuntu 3.13.0-46.79-generic 3.13.11-ckt15
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-46-generic N/A
 linux-backports-modules-3.13.0-46-generic N/A
 linux-firmware 1.127.11
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty
Uname: Linux 3.13.0-46-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 01/16/2014
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 2.2.2
dmi.board.name: 01W23F
dmi.board.vendor: Dell Inc.
dmi.board.version: A05
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr2.2.2:bd01/16/2014:svnDellInc.:pnPowerEdgeR620:pvr:rvnDellInc.:rn01W23F:rvrA05:cvnDellInc.:ct23:cvr:
dmi.product.name: PowerEdge R620
dmi.sys.vendor: Dell Inc.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1435363

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Mohammed Naser (mnaser) on 2015-03-23
tags: added: apport-collected trusty
description: updated

apport information

description: updated
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Mohammed Naser (mnaser) wrote :

I'll be setting up a new server in the next few days, I'll attempt to use -48 and see if that issue is present or not.

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key
Chris J Arges (arges) wrote :

Can you also test with /sys/kernel/mm/ksm/merge_across_nodes set to 0 (and KSM enabled normally) and confirm this is a KSM+NUMA issue?

Another way to potentially cause this issue to occur faster would be to set:
/sys/kernel/mm/ksm/sleep_millisecs to a value lower than 200
or
/sys/kernel/mm/ksm/pages_to_scan to a much larger number than 100

Thanks,
--chris j arges

Changed in linux (Ubuntu):
assignee: nobody → Chris J Arges (arges)
Chris J Arges (arges) wrote :

In addition could you explain more in depth about the performance an instability issues you've observed? How do you detect them or test for them? Thanks

Chris J Arges (arges) on 2015-03-23
description: updated
Chris J Arges (arges) wrote :

I've attempted to reproduce bug 1346917 again on a NUMA machine and was unable to do so with the latest 3.13 kernel. Perhaps I could have more details on how your reproducing this issue to assist with debugging? Thanks

Mohammed Naser (mnaser) wrote :

Hi Chris,

Thanks for the help so far. I'm deploying a new machine right now and I'll be trying to replicate it on -48.

The way I detected it was that i'd see messaging in "dmesg" on guest similar to this:

hrtimer: interrupt took 4352551231 ns

In addition, when pinging the machine, you'd have a few seconds of stable pings, then unresponsive for 2-3s, and it starts responding again (with a huge delay, latency of 3s to 4s because of the delay).

I will be running this machine and monitoring it closely and report on the output, however, I'd like to note that these machines have heavy KSM usage, before turning it off, one had almost ~45-50GB of deduplicated memory on a 256GB node, so I'm not sure if that plays in as a factor..

I'll report back on -48 and see what I can check

Thank you,
Mohammed

Mohammed Naser (mnaser) wrote :

Hi,

I installed a new machine from the 14.04.2 media which gave me the HWE stack with kernel 3.16.0-33-generic and it's running with no problems. The machine is now loaded up to 125GB worth of VMs and with the following memory stats:

# free -m
             total used free shared buffers cached
Mem: 257599 253046 4553 3 218 161499
-/+ buffers/cache: 91328 166271
Swap: 0 0 0

So, the server is quite loaded and I haven't seen any hiccups. We'll see from there on.

Thanks,
Mohammed

Chris J Arges (arges) on 2016-09-21
Changed in linux (Ubuntu):
assignee: Chris J Arges (arges) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers