"Out of memory" errors after upgrade to 4.4.0-59

Bug #1655842 reported by Mike Williams
566
This bug affects 101 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Thadeu Lima de Souza Cascardo
Xenial
Fix Released
High
Thadeu Lima de Souza Cascardo
linux-aws (Ubuntu)
Confirmed
Undecided
Unassigned
Xenial
Confirmed
Undecided
Unassigned
linux-raspi2 (Ubuntu)
Fix Committed
Undecided
Paolo Pisati
Xenial
Fix Committed
Undecided
Unassigned

Bug Description

After a fix for LP#1647400, a bug that caused freezes under some workloads, some users noticed regular OOMs. Those regular OOMs were reported under this bug, and fixed after some releases.

Some of the affected kernels are documented below. In order to check your particular kernel, read its changelog and lookup for 1655842 and 1647400. If it has the fix for 1647400, but not the fix for 1655842, then it's affected.

It's still possible that you notice regressions compared to kernels that didn't have the fixes for any of the bugs. However, reverting all fixes would cause the freeze bug to come back. So, it's not a possible solution moving forward.

If you see any regressions, in the form of OOMs, mainly, please report a new bug. Different workloads may require different solutions, and it's possible that further fixes are needed, be them upstream or not. The best way to get such fixes applied is reporting that under a new bug, one that can be verified, so being able to reproduce the bug makes it possible to verify the fixes really fix the identified bug.

Kernels affected:

linux 4.4.0-58, 4.4.0-59, 4.4.0-60, 4.4.0-61, 4.4.0-62.
linux-raspi2 4.4.0-1039 to 4.4.0-1042 and 4.4.0-1044 to 4.4.0-1071

Particular kernels NOT affected by THIS bug:

linux-aws

To reiterate, if you find an OOM with an affected kernel, please upgrade.
If you find an OOM with a non-affected kernel, please report a new bug. We want to investigate it and fix it.

===================
I recently replaced some Xenial servers, and started experiencing "Out of memory" problems with the default kernel.

We bake Amazon AMIs based on an official Ubuntu-provided image (ami-e6b58e85, in ap-southeast-2, from https://cloud-images.ubuntu.com/locator/ec2/). Previous versions of our AMI included "4.4.0-57-generic", but the latest version picked up "4.4.0-59-generic" as part of a "dist-upgrade".

Instances booted using the new AMI have been using more memory, and experiencing OOM issues - sometimes during boot, and sometimes a while afterwards. An example from the system log is:

[ 130.113411] cloud-init[1560]: Cloud-init v. 0.7.8 running 'modules:final' at Wed, 11 Jan 2017 22:07:53 +0000. Up 29.28 seconds.
[ 130.124219] cloud-init[1560]: Cloud-init v. 0.7.8 finished at Wed, 11 Jan 2017 22:09:35 +0000. Datasource DataSourceEc2. Up 130.09 seconds
[29871.137128] Out of memory: Kill process 2920 (ruby) score 107 or sacrifice child
[29871.140816] Killed process 2920 (ruby) total-vm:675048kB, anon-rss:51184kB, file-rss:2164kB
[29871.449209] Out of memory: Kill process 3257 (splunkd) score 97 or sacrifice child
[29871.453282] Killed process 3258 (splunkd) total-vm:66272kB, anon-rss:6676kB, file-rss:0kB
[29871.677910] Out of memory: Kill process 2647 (fluentd) score 51 or sacrifice child
[29871.681872] Killed process 2647 (fluentd) total-vm:117944kB, anon-rss:23956kB, file-rss:1356kB

I have a hunch that this may be related to the fix for https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1647400, introduced in linux (4.4.0-58.79).

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.4.0-59-generic 4.4.0-59.80
ProcVersionSignature: User Name 4.4.0-59.80-generic 4.4.35
Uname: Linux 4.4.0-59-generic x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jan 12 06:29 seq
 crw-rw---- 1 root audio 116, 33 Jan 12 06:29 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.1-0ubuntu2.4
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Thu Jan 12 06:38:45 2017
Ec2AMI: ami-0f93966c
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: ap-southeast-2a
Ec2InstanceType: t2.nano
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb: Error: command ['lsusb'] failed with exit code 1:
MachineType: Xen HVM domU
PciMultimedia:

ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 cirrusdrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-59-generic root=UUID=fb0fef08-f3c5-40bf-9776-f7ba00fe72be ro console=tty1 console=ttyS0
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-59-generic N/A
 linux-backports-modules-4.4.0-59-generic N/A
 linux-firmware 1.157.6
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 12/09/2016
dmi.bios.vendor: Xen
dmi.bios.version: 4.2.amazon
dmi.chassis.type: 1
dmi.chassis.vendor: Xen
dmi.modalias: dmi:bvnXen:bvr4.2.amazon:bd12/09/2016:svnXen:pnHVMdomU:pvr4.2.amazon:cvnXen:ct1:cvr:
dmi.product.name: HVM domU
dmi.product.version: 4.2.amazon
dmi.sys.vendor: Xen

Revision history for this message
Mike Williams (mdub) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Xenial):
status: New → Triaged
importance: Undecided → High
Changed in linux (Ubuntu):
status: Confirmed → Triaged
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a Xenial test kernel with the following two commits reverted:

c630ec12d831 mm, oom: rework oom detection
57e9ef475661 mm: throttle on IO only when there are too many dirty and writeback pages

This test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1655842/

Can you test this kernel and see if it resolves this bug?

Thanks in advance!

Revision history for this message
Fabian Grünbichler (f-gruenbichler) wrote :

you could also try cherry-picking https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=6b4e3181d7bd5ca5ab6f45929e4a5ffa7ab4ab7f , but that will probably need some more inbetween patches as well..

reverting the two commits fixed the issue for our users (Proxmox VE, which uses a kernel based on the 4.4.x one from 16.04)

Revision history for this message
David F. (malteworld) wrote :

@f-gruenbichler: I already tried to cherry-pick that patch a while ago and it doesn't work because that patch is based on work that isn't in the 4.4.* kernel branch, not even including Canonical's backports from later branches.

Revision history for this message
Mike Williams (mdub) wrote :

Thanks jsalisbury. We have deployed using your test kernel (from http://kernel.ubuntu.com/~jsalisbury/lp1655842/), and experienced no OOM issues.

Revision history for this message
Allen Wild (aswild) wrote :

I manage a set of build servers for CPU/IO intensive builds using Yocto/OpenEmbedded. Ubuntu 14.04.5 with the 4.4 Xenial kernel. After updating to 4.4.0-59 the builds started failing because of the OOM killer.

Rolling back to 4.4.0-57 fixed the OOMs for me.

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

Can you try the kernel at [1], which includes the patches that are also at [1]?

[1] http://people.canonical.com/~cascardo/lp1655842/

Thanks.
Cascardo.

Changed in linux (Ubuntu):
assignee: nobody → Thadeu Lima de Souza Cascardo (cascardo)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Thadeu Lima de Souza Cascardo (cascardo)
Revision history for this message
Stéphane Graber (stgraber) wrote :

Just a note that Joe's armhf kernel has been working well for me.

I can't test cascardo's kernel as it's not built for armhf.

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

I will upload armhf binaries for those kernels and let you know. It's important to try those because they include an alternative solution that we would rather use instead of the one with the reverted patches.

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :
Revision history for this message
Danny B (danny.b) wrote :

Using Cascardo's kernel fixes the problem for me.

It was a bit of a hassle to install though because there's no linux-headers-4.4.0-62_4.4.0-62.83_all.deb at the link and linux-headers-generic depends on it.

Here's where to find it:
amd64: https://launchpad.net/ubuntu/xenial/amd64/linux-headers-4.4.0-62/4.4.0-62.83
armhf: https://launchpad.net/ubuntu/xenial/armhf/linux-headers-4.4.0-62/4.4.0-62.83

Ben French (octoamit)
Changed in linux (Ubuntu):
status: Triaged → In Progress
Revision history for this message
Stéphane Graber (stgraber) wrote :

I've had a few armhf systems running cascardo's kernel and so far no sign of the OOM or any other problem with it.

Revision history for this message
Mike Williams (mdub) wrote :

Cascardo: we've tried your test kernel, and it looks good - we've seen no OOM problems.

Revision history for this message
Cris (cristianpeguero25) wrote :

Hi I'd like to implement Cascardo kernel since I've been having the same issue, though not on all of
the xenial machines running 4.4.0-59-generic which is strange.
Could someone tell how to implement Cascardo kernel without completely messing up my machine.

Thanks

Revision history for this message
xb5i7o (xb5i7o) wrote :

Hi, I am having the exact same issues on a PC with 18GB ram!! kernel 4.4.0-59-generic

Please can this be fixed as soon as possible with a release of the next kernel update.

Its killing processes such as firefox and virtualbox for no good reason while only 4gb is in use really.

Hope this can be fixed soon. its becoming worse as time passes.

Revision history for this message
Eric Desrochers (slashd) wrote :

The patchset[1] for bug "LP #1655842" has been submitted on Jan 24th 2017 and acked by the kernel team on the same day[2].

The patch should be part of the following kernel release cycle :

cycle: 27-Jan through 18-Feb[3]
====
27-Jan Last day for kernel commits for this cycle
30-Jan - 04-Feb Kernel prep week.
05-Feb - 17-Feb Bug verification & Regression testing..
20-Feb Release to -updates.
====

[1] - "Xenial PATCH 00/11] Fixes OOM for LP #1655842"
[2] - "ACK: [Xenial PATCH 00/11] Fixes OOM for LP #1655842"
[3] - https://wiki.ubuntu.com/KernelTeam/Newsletter

- Eric

Changed in linux (Ubuntu Xenial):
status: Triaged → In Progress
Revision history for this message
Eric Desrochers (slashd) wrote :

Additional note :

Applied in master-next on Jan 26th 2017[2]

[1] - "APPLIED: [Xenial PATCH 00/11] Fixes OOM for LP #1655842"

- Eric

Eric Desrochers (slashd)
tags: added: sts
Revision history for this message
Gaudenz Steinlin (gaudenz-debian) wrote :

@slashd It sounds really strange to me that I should wait til 20-Feb for a fix for this bug while this is clearly a regression introduced with the latest kernel upgrade. Is there no way to speed things up to fix this regression.

Currently we had to downgrade all our xenial systems to linux-image-4.4.0-57-generic to avoid this bug.

Gaudenz

Revision history for this message
Eric Desrochers (slashd) wrote :

@Gaudenz Steinlin (gaudenz-debian),

It will takes 3 weeks to land in -updates pocket, but you can expect to have a call for testing a proposed package by EOW.

- Eric

Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux (Ubuntu):
status: In Progress → Fix Released
Revision history for this message
Luk van den Borne (luk-vandenborne) wrote :

This is a severe bug. It should be treated a high-priority bugfix that cannot wait 3 weeks.

Revision history for this message
Nate Eldredge (nate-thatsmathematics) wrote :

Just as a note for newcomers reading this, I can confirm the bug is NOT fixed in the officially released 4.4.0-62.83.

Revision history for this message
Krzysztof Dryja (cih997) wrote :

I could not reboot my machine and the ugly workaround for this issue was to login as root and clear system caches:

echo 3 > /proc/sys/vm/drop_caches

This made my machine stable again, at least for the time I needed.

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

This is fixed in 4.4.0-63.84, which will be available in proposed soon.

Revision history for this message
Shelby Cain (alyandon) wrote :

@nate Thank you! You just saved me a lot of hassle as I was about to unpin the 4.4.0-57 kernel and update a bunch of machines on the assumption the fix was in that version.

Revision history for this message
Sebastian Unger (sebunger44) wrote :

As a note: I believe this also affects the armhf kernel 4.4.0-1040-raspi2 for the Raspberry Pi.

Revision history for this message
David Glasser (glasser) wrote :

I've been struggling with this bug for nearly a week and only now found this issue. Thanks for fixing it!

For the sake of others finding it, here's the stack trace part of the oom-killer log, which contains some terms I searched for a while ago that aren't mentioned here yet.

docker invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=-1000
docker cpuset=/ mems_allowed=0
CPU: 11 PID: 4472 Comm: docker Tainted: G W 4.4.0-62-generic #83-Ubuntu
Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
 0000000000000286 0000000057f64c94 ffff880dfb5efaf0 ffffffff813f7c63
 ffff880dfb5efcc8 ffff880fbfda0000 ffff880dfb5efb60 ffffffff8120ad4e
 ffffffff81cd2d7f 0000000000000000 ffffffff81e67760 0000000000000206
Call Trace:
 [<ffffffff813f7c63>] dump_stack+0x63/0x90
 [<ffffffff8120ad4e>] dump_header+0x5a/0x1c5
 [<ffffffff811926c2>] oom_kill_process+0x202/0x3c0
 [<ffffffff81192ae9>] out_of_memory+0x219/0x460
 [<ffffffff81198a5d>] __alloc_pages_slowpath.constprop.88+0x8fd/0xa70
 [<ffffffff81198e56>] __alloc_pages_nodemask+0x286/0x2a0
 [<ffffffff81198f0b>] alloc_kmem_pages_node+0x4b/0xc0
 [<ffffffff8107ea5e>] copy_process+0x1be/0x1b70
 [<ffffffff8139225c>] ? apparmor_file_alloc_security+0x5c/0x220
 [<ffffffff811ed04a>] ? kmem_cache_alloc+0x1ca/0x1f0
 [<ffffffff81348263>] ? security_file_alloc+0x33/0x50
 [<ffffffff810caeb1>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
 [<ffffffff810805a0>] _do_fork+0x80/0x360
 [<ffffffff81080929>] SyS_clone+0x19/0x20
 [<ffffffff818385f2>] entry_SYSCALL_64_fastpath+0x16/0x71

Revision history for this message
Hajo Locke (hajo-locke) wrote :

When this new kernel will be released? This bug is killing our MySQL Servers. Booting old kernels is only a bad workaround. I think a lot of people with busy servers will have a problem.

This is 2.nd time we were hit by a big bug within short time. In oct 2016 our nameservers got problems because of bug 1634892
Is LTS-Ubuntu still right system for servers?

Revision history for this message
Luk van den Borne (luk-vandenborne) wrote :

This bug also appears to affect linux-image-4.8.0-34-generic in 16.04.1 Xenial.

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

Hi, Luk.

linux-image-4.8.0-34-generic should not be affected by this. If you see unexpected OOM problems, please open a new bug report and attach the kernel logs, please.

Thanks.
Cascardo.

Revision history for this message
xb5i7o (xb5i7o) wrote :

Just by the way - 4.4.0-62-generic has the exact same problem. Even when uninstalling 4.4.0-59-generic, my system at some point auto-updated to 4.4.0-62-generic . Only 4.4.0-57-generic is safe for now.

Revision history for this message
Nick Maynard (nick-maynard) wrote :

LTS Ubuntu with -updates shouldn't have this sort of issue - this is, frankly, unforgivable.

We need a new kernel urgently in -updates, and I'd expect serious discussions within the kernel team to understand what has caused this issue and avoid it reoccurring.

Revision history for this message
Anton Piatek (anton-piatek) wrote :

If this kernel is not going to hit -updates shortly (i.e. days), can something be done to pull or downgrade the broken kernel? At least revert linux-image-generic to depend back on linux-image-4.4.0-57-generic which doesn't have the issues and will stop more people from upgrading to a broken kernel.

Having this sort of break in an LTS kernel is not inspiring at all.

Revision history for this message
Eric Desrochers (slashd) wrote :

The fix is now available for testing in kernel version 4.4.0-63.84, if you enable proposed[1]

$ apt-cache policy linux-image-4.4.0-63-generic
linux-image-4.4.0-63-generic:
  Installed: (none)
  ==> Candidate: 4.4.0-63.84
  Version table:
     4.4.0-63.84 500
        500 http://archive.ubuntu.com/ubuntu ==>xenial-proposed/main amd64 Packages

$ apt-get changelog linux-image-4.4.0-63-generic | egrep "1655842"
 ==> * "Out of memory" errors after upgrade to 4.4.0-59 (LP: #1655842)

[1] - https://wiki.ubuntu.com/Testing/EnableProposed

- Eric

Revision history for this message
Oliver O. (oliver-o456i) wrote :

Testing...

Enabled proposed (https://wiki.ubuntu.com/Testing/EnableProposed).

Installed kernel packages:

# apt-get install -s -t xenial-proposed 'linux-headers-4.4.0.63$' 'linux-headers-4.4.0.63-generic$' 'linux-image-4.4.0.63-generic$' 'linux-image-extra-4.4.0.63-generic$'

Rebooted.

# cat /proc/version_signature
Ubuntu 4.4.0-63.84-generic 4.4.44

Revision history for this message
kulwinder singh (kulwinder-careers) wrote :

Who is the saver 4.4.0-57-generic or 4.4.0-63-generic now.

Revision history for this message
David Glasser (glasser) wrote :

kulwinder singh: Either one, but nothing in between.

-57 will reintroduce a few (unrelated) security bugs as well as the bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1647400 whose fix caused this one, but is easier to enable and has been tested for longer.

-63 should fix this bug, the older bug, and the intermediary security bugs, but requires you to enable the "proposed" repository, and hasn't been tested for quite as long.

Anything in between has this bug.

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
David Glasser (glasser) wrote :

Cascardo: Just to be clear, are you looking for verification from anyone in the world, or from specific kernel testers?

(I'd like to help, but I'm only able to reproduce the issue in production, and the process of debugging this issue when we ran into it was already more restarts than is good for my service right now (we settled on downgrading for the moment).)

Revision history for this message
David F. (malteworld) wrote :

@nick-maynard: Why is such a bug unforgivable? You can just boot a previous kernel instead. If you're concerned about availability then don't reboot in the first place unless there's an important security patch.

Revision history for this message
David Glasser (glasser) wrote :

To be fair, there have been multiple USN-reported kernel security patches fixed in post-57 kernels.

Revision history for this message
Travisgevans (travisgevans) wrote :

Don't forget that the earlier kernels are affected by Bug #1647400, which does something even worse (hang the system). I've verified that it affected my particular system before 4.4.0-59, and it may explain a couple of lockups I had previously experienced during normal operation when using previous kernels. -59 fixes the bug but introduces the permature OOM kill issue; if it weren't for the kernels currently in proposed (assuming they indeed fix this bug), I wouldn't really have a reliable kernel at all to use.

With the 4.4.0-59 kernel, I got hit with two unexplained OOM kills, each occurring within about 3 days of uptime. I then tested the -62 kernel in proposed for just under 14 days and didn't see any OOM kills, and I've now been testing -63 for a couple of days and haven't any issues yet. However, it might help if anyone has an idea how the OOM kill bug might be reliably reproduced. “5 working days” isn't very long to reliably be sure the problem is solved otherwise; it took more than half that time upon upgrading to -59 for me to hit the bug by chance.

Revision history for this message
Nate Eldredge (nate-thatsmathematics) wrote : Re: [Bug 1655842] Re: "Out of memory" errors after upgrade to 4.4.0-59

On Fri, 10 Feb 2017, Travisgevans wrote:

> However, it might help if anyone has an idea how the OOM kill bug might
> be reliably reproduced. “5 working days” isn't very long to reliably be
> sure the problem is solved otherwise; it took more than half that time
> upon upgrading to -59 for me to hit the bug by chance.

I had a job (duplicity) that would oom every time under -59 and -62. With
-63 from proposed, it doesn't.

--
Nate Eldredge
<email address hidden>

Revision history for this message
Otto Wayne (ottowayne) wrote :

I see this bug on Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-1042-raspi2 armv7l) as described here: https://superuser.com/questions/1176773/ubuntu-on-rpi-starts-killing-processes-when-ram-is-filled-up-by-cache

The workaround by Krzysztof Dryja (cih997) works for me as well but is very ugly and temporary.

Revision history for this message
wurlyfan (wurlyfan) wrote :
Download full text (4.9 KiB)

Firefox and Insync were killed pretty reliably for me, but other packages as well. I was getting half-a-dozen oom kills a day before I switched back to 57. My second workstation is fully updated and doesn't show any sign of this issue.

From m3 note

-------- Original message --------
Sender: Nate Eldredge <email address hidden>
Time: Fri 2/10 19:21
To: wurlyfan <email address hidden>
 Subject: Re: [Bug 1655842] Re: "Out of memory" errors after upgrade to 4.4.0-59

>On Fri, 10 Feb 2017, Travisgevans wrote:
>
>> However, it might help if anyone has an idea how the OOM kill bug might
>> be reliably reproduced. “5 working days” isn't very long to reliably be
>> sure the problem is solved otherwise; it took more than half that time
>> upon upgrading to -59 for me to hit the bug by chance.
>
>I had a job (duplicity) that would oom every time under -59 and -62.  With
>-63 from proposed, it doesn't.
>
>--
>Nate Eldredge
><email address hidden>
>
>--
>You received this bug notification because you are subscribed to the bug
>report.
>https://bugs.launchpad.net/bugs/1655842
>
>Title:
>  "Out of memory" errors after upgrade to 4.4.0-59
>
>Status in linux package in Ubuntu:
>  Fix Released
>Status in linux source package in Xenial:
>  Fix Committed
>
>Bug description:
>  I recently replaced some Xenial servers, and started experiencing "Out
>  of memory" problems with the default kernel.
>
>  We bake Amazon AMIs based on an official Ubuntu-provided image (ami-
>  e6b58e85, in ap-southeast-2, from https://cloud-
>  images.ubuntu.com/locator/ec2/).  Previous versions of our AMI
>  included "4.4.0-57-generic", but the latest version picked up
>  "4.4.0-59-generic" as part of a "dist-upgrade".
>
>  Instances booted using the new AMI have been using more memory, and
>  experiencing OOM issues - sometimes during boot, and sometimes a while
>  afterwards.  An example from the system log is:
>
>  [  130.113411] cloud-init[1560]: Cloud-init v. 0.7.8 running 'modules:final' at Wed, 11 Jan 2017 22:07:53 +0000. Up 29.28 seconds.
>  [  130.124219] cloud-init[1560]: Cloud-init v. 0.7.8 finished at Wed, 11 Jan 2017 22:09:35 +0000. Datasource DataSourceEc2.  Up 130.09 seconds
>  [29871.137128] Out of memory: Kill process 2920 (ruby) score 107 or sacrifice child
>  [29871.140816] Killed process 2920 (ruby) total-vm:675048kB, anon-rss:51184kB, file-rss:2164kB
>  [29871.449209] Out of memory: Kill process 3257 (splunkd) score 97 or sacrifice child
>  [29871.453282] Killed process 3258 (splunkd) total-vm:66272kB, anon-rss:6676kB, file-rss:0kB
>  [29871.677910] Out of memory: Kill process 2647 (fluentd) score 51 or sacrifice child
>  [29871.681872] Killed process 2647 (fluentd) total-vm:117944kB, anon-rss:23956kB, file-rss:1356kB
>
>  I have a hunch that this may be related to the fix for
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1647400,
>  introduced in linux (4.4.0-58.79).
>
>  ProblemType: Bug
>  DistroRelease: Ubuntu 16.04
>  Package: linux-image-4.4.0-59-generic 4.4.0-59.80
>  ProcVersionSignature: User Name 4.4.0-59.80-generic 4.4.35
>  Uname: Linux 4.4.0-59-generic x86_64
>  AlsaDevices:
>   total 0
>   crw-rw---- 1 root audio 116,  1 Jan 12...

Read more...

Revision history for this message
Ivan Kozik (ludios) wrote :

I've been using -63 for a while now (even before it was in proposed, via an sbuild setup) on a machine that had OOM problems with -59, and I haven't noticed any issues.

Revision history for this message
Serge Victor (ser) wrote :

-63 works for me as well, thank you!

Oliver O. (oliver-o456i)
tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Oliver O. (oliver-o456i) wrote :

Tested Ubuntu 4.4.0-63.84-generic 4.4.44 on a desktop system with a workload which previously led to Chrome processes being OOM-killed.

Situation with 4.4.0-62-generic: between 8 and 54 processes OOM-killed per 24-hour period
Situation with 4.4.0-63-generic: no OOM-kills during 46 hours of testing

Look solved. No negative side-effects encountered.

Revision history for this message
VSHN (vshn) wrote :

you should re-release 4.4.0-62 as linux-image-chaosmonkey-virtual

Revision history for this message
Javier Bernal (javierbernal) wrote :

Like Luk (#29), I upgraded to 4.8.0-34, but the problem disappeared for me. I ran system for two days without any OOM-kills. Before that, simply copying a big file (9Gb+) fired the system. My system has 16Gb RAM and runs 16.04.1.

Revision history for this message
kulwinder singh (kulwinder-careers) wrote :

To throw some more light into the matter. Two machines were upgraded to 4.4.0-59 on 16th Jan, same load, but only one out of them is reporting oom-kill. Anybody experiencing same scenario...

Revision history for this message
Sridhar Chandramouli (ridsharc) wrote :

Thanks to those who reported/fixed this bug.

Out of curiosity, was this a bug in the 4.4.0-59 kernel itself or in Ubuntu's packaging of the kernel, i.e. were other (non-Ubuntu) Linux users impacted ?

Revision history for this message
kulwinder singh (kulwinder-careers) wrote :

Did anybody also notice any pattern in "invoked oom-killer". I see a pattern like daily almost same time cron or scripts invoking oom-killer...

Revision history for this message
Charles Wright (wrighrc) wrote :

I was curious if I could answer Sridhar's question as I had the same question.

The introduction of the problem appears to be in Ubuntu's packaging of select upstream commits from 4.7 to address bug #1647400.

From the comments in that case, it appears 6b4e3181d7bd5ca5ab6f45929e4a5ffa7ab4ab7f should have also been selected.

From looking at the history I gather if I had been running stock 4.4 kernels I would not have been effected by the OOM issue.

I'm basing this on tracking down 0a0337e0d1d134465778a16f5cbea95086e8e9e0 in the main line kernel.

description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (23.0 KiB)

This bug was fixed in the package linux - 4.4.0-63.84

---------------
linux (4.4.0-63.84) xenial; urgency=low

  [ Thadeu Lima de Souza Cascardo ]

  * Release Tracking Bug
    - LP: #1660704

  * Backport Dirty COW patch to prevent wineserver freeze (LP: #1658270)
    - SAUCE: mm: Respect FOLL_FORCE/FOLL_COW for thp

  * Kdump through NMI SMP and single core not working on Ubuntu16.10
    (LP: #1630924)
    - x86/hyperv: Handle unknown NMIs on one CPU when unknown_nmi_panic
    - SAUCE: hv: don't reset hv_context.tsc_page on crash

  * [regression 4.8.0-14 -> 4.8.0-17] keyboard and touchscreen lost on Acer
    Chromebook R11 (LP: #1630238)
    - [Config] CONFIG_PINCTRL_CHERRYVIEW=y

  * Call trace when testing fstat stressor on ppc64el with virtual keyboard and
    mouse present (LP: #1652132)
    - SAUCE: HID: usbhid: Quirk a AMI virtual mouse and keyboard with ALWAYS_POLL

  * VLAN SR-IOV regression for IXGBE driver (LP: #1658491)
    - ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths

  * "Out of memory" errors after upgrade to 4.4.0-59 (LP: #1655842)
    - mm, page_alloc: convert alloc_flags to unsigned
    - mm, compaction: change COMPACT_ constants into enum
    - mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED
    - mm, compaction: simplify __alloc_pages_direct_compact feedback interface
    - mm, compaction: distinguish between full and partial COMPACT_COMPLETE
    - mm, compaction: abstract compaction feedback to helpers
    - mm, oom: protect !costly allocations some more
    - mm: consider compaction feedback also for costly allocation
    - mm, oom, compaction: prevent from should_compact_retry looping for ever for
      costly orders
    - mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION
    - mm, oom: prevent premature OOM killer invocation for high order request

  * Backport 3 patches to fix bugs with AIX clients using IBMVSCSI Target Driver
    (LP: #1657194)
    - SAUCE: ibmvscsis: Fix max transfer length
    - SAUCE: ibmvscsis: fix sleeping in interrupt context
    - SAUCE: ibmvscsis: Fix srp_transfer_data fail return code

  * NVMe: adapter is missing after abnormal shutdown followed by quick reboot,
    quirk needed (LP: #1656913)
    - nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too

  * Ubuntu 16.10 KVM SRIOV: if enable sriov while ping flood is running ping
    will stop working (LP: #1625318)
    - PCI: Do any VF BAR updates before enabling the BARs
    - PCI: Ignore BAR updates on virtual functions
    - PCI: Update BARs using property bits appropriate for type
    - PCI: Separate VF BAR updates from standard BAR updates
    - PCI: Don't update VF BARs while VF memory space is enabled
    - PCI: Remove pci_resource_bar() and pci_iov_resource_bar()
    - PCI: Decouple IORESOURCE_ROM_ENABLE and PCI_ROM_ADDRESS_ENABLE
    - PCI: Add comments about ROM BAR updating

  * Linux rtc self test fails in a VM under xenial (LP: #1649718)
    - kvm: x86: Convert ioapic->rtc_status.dest_map to a struct
    - kvm: x86: Track irq vectors in ioapic->rtc_status.dest_map
    - kvm: x86: Check dest_map->vector to match eoi signals for rtc

  * Xenial update to v4.4.44 stable releas...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Iain Buclaw (iainb) wrote :

Bug is prevalent on 4.8 kernels too.

---
[251529.693133] CPU: 3 PID: 1547 Comm: icinga2 Not tainted 4.8.0-34-generic #36~16.04.1-Ubuntu
[251529.693134] Hardware name: MSI MS-7823/B85M-G43 (MS-7823), BIOS V3.14B3 06/23/2014
[251529.693135] 0000000000000286 00000000e9fa7ede ffff95c3f774bb38 ffffffffab62d7b3
[251529.693137] ffff95c3f774bcc8 ffff95c3f23bd700 ffff95c3f774bba0 ffffffffab42e9bb
[251529.693138] ffff95c3f774bb40 0000000000000000 0000000000000000 0000000000000000
[251529.693140] Call Trace:
[251529.693145] [<ffffffffab62d7b3>] dump_stack+0x63/0x90
[251529.693148] [<ffffffffab42e9bb>] dump_header+0x5c/0x1dc
[251529.693151] [<ffffffffab3a5836>] oom_kill_process+0x226/0x3f0
[251529.693153] [<ffffffffab3a5daa>] out_of_memory+0x35a/0x3f0
[251529.693155] [<ffffffffab3ab06b>] __alloc_pages_slowpath+0x9fb/0xa20
[251529.693157] [<ffffffffab3ab34a>] __alloc_pages_nodemask+0x2ba/0x300
[251529.693160] [<ffffffffab280726>] copy_process.part.30+0x146/0x1b50
[251529.693162] [<ffffffffab95c66d>] ? sock_recvmsg+0x3d/0x50
[251529.693163] [<ffffffffab95c8aa>] ? SYSC_recvfrom+0xda/0x150
[251529.693164] [<ffffffffab282327>] _do_fork+0xe7/0x3f0
[251529.693166] [<ffffffffab95e171>] ? __sys_recvmsg+0x51/0x90
[251529.693168] [<ffffffffab2826d9>] SyS_clone+0x19/0x20
[251529.693170] [<ffffffffab203bae>] do_syscall_64+0x5e/0xc0
[251529.693174] [<ffffffffaba96625>] entry_SYSCALL64_slow_path+0x25/0x25
[251529.693174] Mem-Info:
[251529.693177] active_anon:339565 inactive_anon:133615 isolated_anon:0
                 active_file:3938458 inactive_file:328087 isolated_file:0
                 unevictable:8 dirty:200 writeback:37 unstable:0
                 slab_reclaimable:3365424 slab_unreclaimable:16102
                 mapped:9114 shmem:1459 pagetables:2462 bounce:0
                 free:49449 free_pcp:32 free_cma:0
---

Had 5 servers knocked out over the weekend.

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

4.8 kernels are not affected by this bug. If you have OOMs on 4.8 kernels, please file a new bug with all the revelant details and logs.

Thanks.
Cascardo.

Revision history for this message
Iain Buclaw (iainb) wrote :

Yes they are, I'm seeing the same exorbitant memory usage that we had on 4.4.0-58 as am currently getting on 4.8.0-36.

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

This bug is not about memory use. It's about the Linux kernel triggering the OOM killer when higher-order (order 2) allocations are requested and progress cannot be made. This affected Linux 4.7 on mainline and was fixed on Linux 4.7 stable and Linux 4.8. When some fixes were backported to 4.4.0-59 (4.4.0-58 was not affected), this bugs was introduced to Xenial kernels and is now fixed on 4.4.0-63. Any behavior on 4.8 kernels must be investigated separately, because all fixes that were backported to 4.4.0-63 are already present in 4.8.

Can you please open a new bug and attach all logs and details you can, so we can investigate your problem and provide a fix? Please, do not use this bug, because the fixes would be different anyway, and even thought the symptom may look like, we consider them different bugs.

I appreciate you opening a new bug and providing this new report.

Thanks.
Cascardo.

Revision history for this message
Iain Buclaw (iainb) wrote :

This is the reported /proc/meminfo Buffers usage for 4 different kernel versions. We got the same OOM call traces on both 4.4.0-58 and 4.8.0-34, I highly doubt that to be a coincidence.

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

Hi, Mr. Iain Buclaw.

Memory usage reports could be related to something else. This bug was introduced at 4.4.0-59, you mention 4.4.0-58. We could certainly investigate the issue you see, just not on this bug. Much more data is necessary, but, please, don't attach new data to this bug. Your report relates to a different kernel, which has a very different memory management code. I could ask you to test 4.4.0-63, but it still is possible that you find problems there, because they are unrelated to this bug.

Thank you.
Cascardo.

Revision history for this message
Iain Buclaw (iainb) wrote :

The OOM fixes were introduced in 4.4.0-58 according to the changelog, but sure.

Revision history for this message
Mike Williams (mdub) wrote :

I'm pretty sure this bug was introduced in 4.4.0-58, even though I reported it against 4.4.0-59. Still, +1 for a raising a separate bug against 4.8.

Thanks to Cascardo for fixing this one.

Revision history for this message
joconcepts (jonav) wrote :

Could somebody please confirm that the issue has been fixed with the kernel 4.4.0-64.85? We had massive problems with OOM killed qemu instances on our virtualization hosts and would not like to have this introduced again.

Revision history for this message
Mathias Bogaert (mathias-bogaert) wrote :

I can confirm 4.4.0-64.85 fixes our OOM issues.

Revision history for this message
Anton (azenkov) wrote :

I still see OOM killer invocation on 4.4.0-64.85:

Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857840] java invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857846] java cpuset=/ mems_allowed=0-1
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857855] CPU: 27 PID: 47820 Comm: java Tainted: G W 4.4.0-64-generic #85-Ubuntu
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857857] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/12/2016
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857860] 0000000000000286 00000000ee496386 ffff882358f13b10 ffffffff813f8083
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857867] ffff882358f13cc8 ffff883c70a8f000 ffff882358f13b80 ffffffff8120b0fe
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857871] ffffffff81cd63bf 0000000000000000 ffffffff81e677e0 0000000000000206
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857875] Call Trace:
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857885] [<ffffffff813f8083>] dump_stack+0x63/0x90
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857893] [<ffffffff8120b0fe>] dump_header+0x5a/0x1c5
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857899] [<ffffffff81192812>] oom_kill_process+0x202/0x3c0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857902] [<ffffffff81192c39>] out_of_memory+0x219/0x460
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857907] [<ffffffff81198c28>] __alloc_pages_slowpath.constprop.88+0x938/0xad0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857911] [<ffffffff81199046>] __alloc_pages_nodemask+0x286/0x2a0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857915] [<ffffffff811990fb>] alloc_kmem_pages_node+0x4b/0xc0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857921] [<ffffffff8107ea5e>] copy_process+0x1be/0x1b70
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857930] [<ffffffff811c19bd>] ? handle_mm_fault+0xcbd/0x1820
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857935] [<ffffffff81406574>] ? call_rwsem_down_read_failed+0x14/0x30
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857938] [<ffffffff810805a0>] _do_fork+0x80/0x360
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857942] [<ffffffff81080929>] SyS_clone+0x19/0x20
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857948] [<ffffffff8183c5f2>] entry_SYSCALL_64_fastpath+0x16/0x71
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857969] Mem-Info:
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981] active_anon:32518350 inactive_anon:2099 isolated_anon:0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981] active_file:45384948 inactive_file:45384381 isolated_file:64
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981] unevictable:914 dirty:104 writeback:0 unstable:0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981] slab_reclaimable:1282591 slab_unreclaimable:39566
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981] mapped:10291496 shmem:2227 pagetables:732932 bounce:0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981] free:272957 free_pcp:1153 free_cma:0

Revision history for this message
Julian Kassat (j.kassat) wrote :

Same here on Linux 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux.

Out of memory: Kill process 11067 (java) score 28 or sacrifice child
Killed process 11067 (java) total-vm:3569724kB, anon-rss:211720kB, file-rss:20208kB
systemd-journald[247]: /dev/kmsg buffer overrun, some messages lost.
swap_free: Bad swap file entry 2000000000000000
BUG: Bad page map in process java pte:00000020 pmd:dac28067
addr:00007fc34ce69000 vm_flags:08000071 anon_vma: (null) mapping: (null) index:7fc34ce69
file: (null) fault: (null) mmap: (null) readpage: (null)
CPU: 1 PID: 11108 Comm: java Tainted: G B D 4.4.0-64-generic #85-Ubuntu
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
 0000000000000286 00000000f2f56de1 ffff8800d8ceba58 ffffffff813f8083
 00007fc34ce69000 ffff8800dac023e8 ffff8800d8cebaa8 ffffffff811be06f
 ffff8800d8ceba80 ffffffff811d518c 2000000000000000 0000000000000020
Call Trace:
 [<ffffffff813f8083>] dump_stack+0x63/0x90
 [<ffffffff811be06f>] print_bad_pte+0x1df/0x2a0
 [<ffffffff811d518c>] ? swap_info_get+0x7c/0xd0
 [<ffffffff811bf9f8>] unmap_page_range+0x468/0x7a0
 [<ffffffff811bfdad>] unmap_single_vma+0x7d/0xe0
 [<ffffffff811c0871>] unmap_vmas+0x51/0xa0
 [<ffffffff811c9df7>] exit_mmap+0xa7/0x170
 [<ffffffff8107e0a7>] mmput+0x57/0x130
 [<ffffffff81083f2a>] do_exit+0x27a/0xb00
 [<ffffffff8110046c>] ? __unqueue_futex+0x2c/0x60
 [<ffffffff81100f8e>] ? futex_wait+0x16e/0x280
 [<ffffffff81084833>] do_group_exit+0x43/0xb0
 [<ffffffff810909b2>] get_signal+0x292/0x600
 [<ffffffff8102e567>] do_signal+0x37/0x6f0
 [<ffffffff8122fb84>] ? mntput+0x24/0x40
 [<ffffffff81210ba0>] ? __fput+0x190/0x220
 [<ffffffff8100320c>] exit_to_usermode_loop+0x8c/0xd0
 [<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60
 [<ffffffff8183c750>] int_ret_from_sys_call+0x25/0x8f
BUG: Bad rss-counter state mm:ffff8800d8bc8800 idx:2 val:-1

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

Hi, Anton and Julian.

Can you attach complete logs for investigation?

Thanks.
Cascardo.

Revision history for this message
Julian Kassat (j.kassat) wrote :

Attached kern.log. Let me know in case you need more logs.

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

Hi, Julian.

Do you have the output of dmesg after the incident before a reboot?

Cascardo.

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

Julian, your logs indicate some possible swap corruption, would you mind opening a new bug and sending it using apport-bug?

Thanks.
Cascardo.

Revision history for this message
Julian Kassat (j.kassat) wrote :

Hi Cascardo,

there is no related dmesg output after the incident (just some lines from apt-daily.timer).

I filed a bug for the possible swap corruption issue here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1669707

Thanks so far.

Julian

Revision history for this message
Pete Cheslock (pete-cheslock) wrote :

We have been seeing this issue recently as well. We are running 4.4.0-66-generic #87-Ubuntu - I can attempt to downgrade to 4.4.0-57 but its a large cluster with a lot of data so it may take some time. Attached a kern.log from this most recent oom.

Revision history for this message
Flemming Hoffmeyer (flemming-b-h) wrote :

I am seeing this issue as well, on Arch kernel v 4.10.4-1

Revision history for this message
Michael Dye (dye.michael) wrote :

This is plaguing Horizon project Pi2 and Pi3 devices running Xenial 16.04.2 w/ kernel 4.4.0-1050-raspi2. From a pi2:

root@horizon-00000000a17d2187:~# uname -a
Linux horizon-00000000a17d2187 4.4.0-1050-raspi2 #57-Ubuntu SMP Wed Mar 22 12:52:22 UTC 2017 armv7l armv7l armv7l GNU/Linux
root@horizon-00000000a17d2187:~# free
              total used free shared buff/cache available
Mem: 942128 149548 35456 494084 757124 239716
Swap: 0 0 0

Under these circumstances, the kernel's oom-killer will kill Wifi processes (rtl_rpcd), systemd-udevd, our Ethereum client (geth), and other critical processes in attempt to stay afloat rather than using reclaimable RAM.

Revision history for this message
Mohammad Anwar Shah (mohammadanwarshah) wrote :

I was using 4.4.0-21 as reported by `uname -r` which is default in Kubuntu 16.04. The same bug appears on mainline kernel 4.10 too!

Now, I'm in confusion. Which kernel should I upgrade to? Also I experience this only in KDE session with yandex or chrome browser opened

Revision history for this message
iKazmi (alikazmi-2040) wrote :

I have 4.4.0-59 till 4.4.0-71 and 4.8.0-41 till 4.8.0-46 installed on my system and all are affected by this bug. Firefox, Chrome and Netbeans regularly get killed without a warning and for no reason (since I have something like 10GB+ RAM and all 16GB Swap free at the time the process gets killed). Even KDE has been killed a couple of times while the system still had over 6GB RAM and 16GB Swap free.

Yesterday, after the umpteenth time Netbeans was killed while I was in the middle of doing something, I finally decided to do something about this problem and installed Kernel 4.10.9-041009 from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10.9/. Sadly, that doesn't seem to resolve the problem either and Oom Killer is still overeager to kill user processes (Firefox and Netbeans have both been killed multiple times). At least KDE hasn't been killed so far.

Revision history for this message
kulwinder singh (kulwinder-careers) wrote :

Anybody having successfully tested 4.4.0.63 for OOM-kill issue...

Revision history for this message
Anton (anton-powershop) wrote :

Yes, 4.4.0.63 solved our OOM issues (and we had plenty after 4.4.0.59). Ours were all headless server (bare metal and VMs) related though - no desktop usage.

But I never experienced this issue with my home laptop either, but that had lots of RAM and was only lightly used during that period - not really a good data point.

Revision history for this message
Travisgevans (travisgevans) wrote :

I also haven't personally encountered any further OOM issues on my home desktop (used daily) with 4.4.0.63.

Revision history for this message
Mohammad Anwar Shah (mohammadanwarshah) wrote :

I'd like to emphasise that the OOM problem only happens with KDE. I have several DE installed including Unity, GNOME3, Cinnamon. But none of them caused a OOM, at least I never noticed. But in KDE, most of the time when chrome is opened, it triggers OOM. dmesg tells that, sometimes kwin_x11 invoked the OOM or plasmashell.

Most of the time plasmashell is crashed and the opened tab in chrome is killed. However, chrome application will be there. I need to start plasmashell by pressing Alt-F2 bringing the run command dialog and type plasmashell there.

Last night, Even firefox gave an OOM.

I'm attaching a dmesg log hoping that will be helpful.

Revision history for this message
Sebastian Unger (sebunger44) wrote :

This is still an issue in the current linux-raspi2 version. Where those changes ported to that kernel?

Revision history for this message
Sebastian Unger (sebunger44) wrote :

linux-raspi2 version 4.4.0.1055.56 that is.

Revision history for this message
kimo (ubuntu-oldfield) wrote :

I'm seeing oom-killer being invoked despite having 2GB free swap when using the kernel from linux-image-4.4.0-1055-raspi2 version 4.4.0-1055.62.

kimo (ubuntu-oldfield)
Changed in linux-raspi2 (Ubuntu):
status: New → Confirmed
Changed in linux-raspi2 (Ubuntu Xenial):
status: New → Confirmed
Revision history for this message
Sebastian Unger (sebunger44) wrote :

Also observed with 4.4.0-1054-raspi2. I'm now back on 4.4.0-1038-raspi2. I think that one was ok.

Revision history for this message
Nick Hatch (nicholas-hatch) wrote :

We're still having issues with higher-order allocations failing and triggering an OOM kill for unexplainable reasons. (on 4.4.0-78-generic).

I've attached the relevant OOM killer logs. It may be relevant to note that the server these logs are from is an Elasticsearch instance with a large (~32GB) mlock'ed heap.

Revision history for this message
Pete Cheslock (pete-cheslock) wrote :

@nicholas-hatch - what file system are your disks formatted as? I was able to stop the OOM's on my ES hosts by moving from XFS to EXT4. My belief is that there was a memory fragmentation issue with ES and many small files on XFS formatted volumes.

Revision history for this message
Chris (cmavr8) wrote :

The bug is still confirmed and not fixed for linux-raspi2 (Ubuntu), 5 months after getting fixed for the main Ubuntu.

Shouldn't this have some priority? Even apt upgrade breaks if I don't use the clear cache workaround. I can live with it (cron job to clear cache) but this is not great for LTS.

Currently affected: Ubuntu 16.04.2 LTS, 4.4.0-1059-raspi2 #67-Ubuntu

Paolo Pisati (p-pisati)
Changed in linux-raspi2 (Ubuntu):
assignee: nobody → Paolo Pisati (p-pisati)
Revision history for this message
Paolo Pisati (p-pisati) wrote :
Revision history for this message
Chris (cmavr8) wrote :

Sure.
I undid the workaround, installed and booted the kernel and will test it for a few days. I'll keep you posted on results.

Thanks Paolo!

Revision history for this message
Chris (cmavr8) wrote :

Update: No sign of Out-of-memory errors or kills, after 3 days of testing the 4.4.0-1062-raspi2 kernel. I'll report back again next week.

Revision history for this message
kimo (ubuntu-oldfield) wrote :

4.4.0-1062-raspi2 is looking good - I've had it running for a week without oom-killer being invoked.

Revision history for this message
Chris (cmavr8) wrote :

Mine's also still stable (no OOMs), after running the patched kernel for 9 days, on a Raspberry pi 2 Model B v1.1.

Revision history for this message
Swe W Aung (sirswa) wrote :
Download full text (4.8 KiB)

Hi

I am experiencing at one of our compute node hypervisor. kernel version we are using is 4.4.0-83, but seems to be having the issue reported in this report.

[Mon Aug 7 00:19:42 2017] nova-compute invoked oom-killer: gfp_mask=0x2c200ca, order=0, oom_score_adj=0
[Mon Aug 7 00:19:42 2017] nova-compute cpuset=/ mems_allowed=0-1
[Mon Aug 7 00:19:42 2017] CPU: 7 PID: 2164484 Comm: nova-compute Tainted: G OE 4.4.0-83-generic #106-Ubuntu
[Mon Aug 7 00:19:42 2017] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017
[Mon Aug 7 00:19:42 2017] 0000000000000286 00000000d6004dce ffff88014e753a50 ffffffff813f9513
[Mon Aug 7 00:19:42 2017] ffff88014e753c08 ffff883fecf88e00 ffff88014e753ac0 ffffffff8120b53e
[Mon Aug 7 00:19:42 2017] 0000000000000015 0000000000000000 ffff881fe883b740 ffff883fe94f7000
[Mon Aug 7 00:19:42 2017] Call Trace:
[Mon Aug 7 00:19:42 2017] [<ffffffff813f9513>] dump_stack+0x63/0x90
[Mon Aug 7 00:19:42 2017] [<ffffffff81391c64>] ? apparmor_capable+0xc4/0x1b0
[Mon Aug 7 00:19:42 2017] [<ffffffff81192ae2>] oom_kill_process+0x202/0x3c0
[Mon Aug 7 00:19:42 2017] [<ffffffff81192f09>] out_of_memory+0x219/0x460
[Mon Aug 7 00:19:42 2017] [<ffffffff81198ef8>] __alloc_pages_slowpath.constprop.88+0x938/0xad0
[Mon Aug 7 00:19:42 2017] [<ffffffff81199316>] __alloc_pages_nodemask+0x286/0x2a0
[Mon Aug 7 00:19:42 2017] [<ffffffff811e467d>] alloc_pages_vma+0xad/0x250
[Mon Aug 7 00:19:42 2017] [<ffffffff811fad53>] do_huge_pmd_wp_page+0x153/0xb70
[Mon Aug 7 00:19:42 2017] [<ffffffff811c1a5f>] handle_mm_fault+0x90f/0x1820
[Mon Aug 7 00:19:42 2017] [<ffffffff8106b802>] ? do_page_fault+0x22/0x30
[Mon Aug 7 00:19:42 2017] [<ffffffff81842cf8>] ? page_fault+0x28/0x30
[Mon Aug 7 00:19:42 2017] [<ffffffff8106b577>] __do_page_fault+0x197/0x400
[Mon Aug 7 00:19:42 2017] [<ffffffff8106b802>] do_page_fault+0x22/0x30
[Mon Aug 7 00:19:42 2017] [<ffffffff81842cf8>] page_fault+0x28/0x30
[Mon Aug 7 00:19:42 2017] Mem-Info:
[Mon Aug 7 00:19:42 2017] active_anon:61350709 inactive_anon:2118817 isolated_anon:0
                            active_file:0 inactive_file:0 isolated_file:32
                            unevictable:915 dirty:0 writeback:8 unstable:0
                            slab_reclaimable:14082 slab_unreclaimable:64456
                            mapped:3492 shmem:329012 pagetables:142167 bounce:0
                            free:260204 free_pcp:4111 free_cma:0

[Tue Aug 8 05:50:08 2017] apt-check invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
[Tue Aug 8 05:50:08 2017] apt-check cpuset=/ mems_allowed=0-1
[Tue Aug 8 05:50:08 2017] CPU: 11 PID: 2538289 Comm: apt-check Tainted: G OE 4.4.0-83-generic #106-Ubuntu
[Tue Aug 8 05:50:08 2017] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017
[Tue Aug 8 05:50:08 2017] 0000000000000286 000000005e467cc9 ffff8820b44a39f8 ffffffff813f9513
[Tue Aug 8 05:50:08 2017] ffff8820b44a3bb0 ffff881fec15b800 ffff8820b44a3a68 ffffffff8120b53e
[Tue Aug 8 05:50:08 2017] 0000000000000015 ffffffff81e42ac0 ffff883fe996f980 ffffffffffffff04
[Tue Aug 8 05:50:08 2017] Call Trace:
[Tue Aug 8 05:50:08 2017] [<ff...

Read more...

Revision history for this message
Jake Billo (ev98) wrote :

We are also experiencing this issue running linux-aws 4.4.0-1028.37, which tracks Ubuntu kernel 4.4.0-89.112. Our use case is very similar to comment #86 (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1655842/comments/86). In our case ElasticSearch 2.4.5 is running under Java 1.8.0_131 with a ~29GB heap; we downsized from 31GB as a troubleshooting effort with no change to the frequency of OOM. The issue also occurs regardless of vm.overcommit_memory being set to 0, 1 or 2.

The relevant data from kern.log (with redacted hostname) is attached; I'm happy to provide additional logs or test different kernels, but since our use case is i3-class instances in AWS, we need the nvme enhancements and enhanced network I/O provided by the linux-aws package.

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

Please, do not cut the logs. Without the "invoked oom-killer" line, for example, it's hard to see the gfp flags and allocation order that failed.

Revision history for this message
Pete Cheslock (pete-cheslock) wrote :

I have seemingly solved this issue with linux-aws version 4.4.0-1016-aws at the very least. The specific issue I was seeing was 2nd order allocations failing when OOMKiller triggered. At the time I was thinking the issue was due to XFS and memory fragmentation with lots and lots of memory mapped files in Elasticsearch/Lucene. When we moved to EXT4 the rate of oomkiller firing dropped, but did not stop. We made the following 2 changes to sysctls which have effectively stopped higher order memory allocaitons from failing and oomkiller firing.

Also these settings were used on i3.2xlarge hosts that have 60G of ram - your milage may vary. Also we do not run swap on our servers, so likely adding swap could have helped, but not an option for us.

vm.min_free_kbytes = 1000000 # We set this to leave about 1G of ram available for the kernel in the hope that even if the memory was heavily fragmented there might still be enough memory for linux to grab a higher order memory allocation fast enough before oomkiller does things.

vm.zone_reclaim_mode = 1 # our hope here was to get the kernel to get more aggressive in reclaiming memory

Revision history for this message
Jake Billo (ev98) wrote :

Apologies - the file was inadvertently split by logrotate. I have concatenated the entire contents of kern.log and kern.log.1 into the attached file; these are the only kern.log files in /var/log on the system.

I do have to redact the hostname in question, but it is a simple substitution of 'localhost' for the FQDN of the system.

Revision history for this message
Pete Cheslock (pete-cheslock) wrote :

> kthreadd invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0

Yea - that 2nd order allocation failure is the exact same issue I was able to see (same GFP mask also)

Revision history for this message
Swe W Aung (sirswa) wrote :

We have another case of OOM at one of the host that we upgraded kernel to 4.4.0-89 a week ago.

kern.log attached.

Revision history for this message
Swe W Aung (sirswa) wrote :

Attaching dmesg output

Revision history for this message
Jake Billo (ev98) wrote :

With the sysctl settings provided by Pete (vm.min_free_kbytes = 1000000 and vm.zone_reclaim_mode = 1), we've been running the linux-aws 4.4.0-1028.37 kernel successfully without an OOM killer invocation for about four days now. Previously we would have seen three or more occurrences of this per day, so it's a positive indication.

Revision history for this message
Willem (wdekker) wrote :

We have found this issue on 4.4.0-92 too.
But only when the systems were put under stress.
Reverting back to 4.4.0-57 resolved it.

Revision history for this message
Willem (wdekker) wrote :

Attached kern.log

Paolo Pisati (p-pisati)
Changed in linux-raspi2 (Ubuntu):
status: Confirmed → Fix Committed
Changed in linux-raspi2 (Ubuntu Xenial):
status: Confirmed → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-aws (Ubuntu Xenial):
status: New → Confirmed
Changed in linux-aws (Ubuntu):
status: New → Confirmed
Revision history for this message
Vladimir Nicolici (vnicolici) wrote :
Download full text (5.7 KiB)

Not sure if it's the same issue, but we had an unexpected OOM with Ubuntu 16.04.3 LTS, 4.4.0-91.

Oct 31 23:52:25 db3 kernel: [6569272.882023] psql invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0

...

Oct 31 23:52:25 db3 kernel: [6569272.882154] Mem-Info:
Oct 31 23:52:25 db3 kernel: [6569272.882165] active_anon:38011018 inactive_anon:1422084 isolated_anon:0
Oct 31 23:52:25 db3 kernel: [6569272.882165] active_file:11699125 inactive_file:11727535 isolated_file:0
Oct 31 23:52:25 db3 kernel: [6569272.882165] unevictable:0 dirty:88019 writeback:2902991 unstable:23308
Oct 31 23:52:25 db3 kernel: [6569272.882165] slab_reclaimable:1455159 slab_unreclaimable:533985
Oct 31 23:52:25 db3 kernel: [6569272.882165] mapped:38499394 shmem:38495946 pagetables:33687177 bounce:0
Oct 31 23:52:25 db3 kernel: [6569272.882165] free:212612 free_pcp:0 free_cma:0
Oct 31 23:52:25 db3 kernel: [6569272.882172] Node 0 DMA free:13256kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15976kB managed:15892kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Oct 31 23:52:25 db3 kernel: [6569272.882182] lowmem_reserve[]: 0 1882 193368 193368 193368
Oct 31 23:52:25 db3 kernel: [6569272.882188] Node 0 DMA32 free:768204kB min:316kB low:392kB high:472kB active_anon:8kB inactive_anon:32kB active_file:20kB inactive_file:48kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2045556kB managed:1964868kB mlocked:0kB dirty:0kB writeback:44kB mapped:16kB shmem:12kB slab_reclaimable:729192kB slab_unreclaimable:35928kB kernel_stack:1920kB pagetables:415552kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Oct 31 23:52:25 db3 kernel: [6569272.882196] lowmem_reserve[]: 0 0 191486 191486 191486
Oct 31 23:52:25 db3 kernel: [6569272.882201] Node 0 Normal free:34260kB min:32432kB low:40540kB high:48648kB active_anon:58162056kB inactive_anon:2546400kB active_file:18254204kB inactive_file:18282192kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:199229440kB managed:196081724kB mlocked:0kB dirty:152124kB writeback:4685924kB mapped:58223800kB shmem:58229824kB slab_reclaimable:2362116kB slab_unreclaimable:1123984kB kernel_stack:11056kB pagetables:94580096kB unstable:22108kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Oct 31 23:52:25 db3 kernel: [6569272.882210] lowmem_reserve[]: 0 0 0 0 0
Oct 31 23:52:25 db3 kernel: [6569272.882215] Node 1 Normal free:34728kB min:32780kB low:40972kB high:49168kB active_anon:93882008kB inactive_anon:3141904kB active_file:28542276kB inactive_file:28627900kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:201326592kB managed:198178644kB mlocked:0kB dirty:199952kB writeback:6925996kB mapped:95773760kB shmem:95753948kB slab_reclaimable:2729328kB slab_unreclaimable:976028kB...

Read more...

Revision history for this message
William DeLuca (qops1981) wrote :

We believe that we are experiencing this issue on kernel 4.4.0-1030-aws as well. We recently move from 14 LTS to 16LTS and are now experience oom kills.

Revision history for this message
William DeLuca (qops1981) wrote :

Side Question... Is there something I can specifically look for on a Ubuntu install that would indicate if that kernel has the fix or not. I assume the "Fix or Not" indicates are manually triggered and the fix could be out for AWS but not indicated.

description: updated
Revision history for this message
Erik Hess (p-we-x) wrote :

In our production environment of ~1800 nodes we've seen oom-kill events that looked similar to this bug's pattern - oom-kills killing large server processes while resident memory was far lower than available physical memory.

We were affected by the original bug and saw that issue readily addressed in newer kernel versions, as mentioned in the earlier comments in this ticket. However, we still kept seeing oom-kill events, albeit in far lower numbers over time, that were happening on kernel-upgraded systems. These were a mystery for awhile, largely due to their infrequent occurrence.

After a lot of research we think we've pinned it down to a subset of our multi-socket servers that have >1 NUMA memory pools. After implementing some scripts to track NUMA stats we've observed that one of the two NUMA pools is being fully utilized while the other has large amounts of memory to spare (often 90-95%) Either our server app, the JVM its running on, or the kernel itself isn't handling the NUMA memory pooling well and we're ending up exhausting an entire NUMA pool.

Work is ongoing to see the causality chain that's leading to this. We don't yet have confirmation about whether its something our app (or its libraries) is doing, if we just need to make the JVM NUMA-aware with args, or if there's kernel tuning to be done. But I did want to mention it here as a warning to folks running on multi-NUMA-pool multi-socket systems seeing similar behavior.

Revision history for this message
Trent Lloyd (lathiat) wrote :

You can potentially use numactl to launch the process and set a policy of interleaving allocations between NUMA nodes to avoid these 1 sided allocations. Tends to happen with servers that make big allocations from a single thread during startup, as commonly seen on mysqld servers and the innodb_buffer_pool for example.

numactl --interleave all /path/to/server/process --argument-1 #etc

Reference:
https://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.