"Out of memory" errors after upgrade to 4.4.0-59

Bug #1655842 reported by Mike Williams on 2017-01-12
494
This bug affects 87 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Thadeu Lima de Souza Cascardo
Xenial
High
Thadeu Lima de Souza Cascardo

Bug Description

I recently replaced some Xenial servers, and started experiencing "Out of memory" problems with the default kernel.

We bake Amazon AMIs based on an official Ubuntu-provided image (ami-e6b58e85, in ap-southeast-2, from https://cloud-images.ubuntu.com/locator/ec2/). Previous versions of our AMI included "4.4.0-57-generic", but the latest version picked up "4.4.0-59-generic" as part of a "dist-upgrade".

Instances booted using the new AMI have been using more memory, and experiencing OOM issues - sometimes during boot, and sometimes a while afterwards. An example from the system log is:

[ 130.113411] cloud-init[1560]: Cloud-init v. 0.7.8 running 'modules:final' at Wed, 11 Jan 2017 22:07:53 +0000. Up 29.28 seconds.
[ 130.124219] cloud-init[1560]: Cloud-init v. 0.7.8 finished at Wed, 11 Jan 2017 22:09:35 +0000. Datasource DataSourceEc2. Up 130.09 seconds
[29871.137128] Out of memory: Kill process 2920 (ruby) score 107 or sacrifice child
[29871.140816] Killed process 2920 (ruby) total-vm:675048kB, anon-rss:51184kB, file-rss:2164kB
[29871.449209] Out of memory: Kill process 3257 (splunkd) score 97 or sacrifice child
[29871.453282] Killed process 3258 (splunkd) total-vm:66272kB, anon-rss:6676kB, file-rss:0kB
[29871.677910] Out of memory: Kill process 2647 (fluentd) score 51 or sacrifice child
[29871.681872] Killed process 2647 (fluentd) total-vm:117944kB, anon-rss:23956kB, file-rss:1356kB

I have a hunch that this may be related to the fix for https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1647400, introduced in linux (4.4.0-58.79).

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.4.0-59-generic 4.4.0-59.80
ProcVersionSignature: User Name 4.4.0-59.80-generic 4.4.35
Uname: Linux 4.4.0-59-generic x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jan 12 06:29 seq
 crw-rw---- 1 root audio 116, 33 Jan 12 06:29 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.1-0ubuntu2.4
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Thu Jan 12 06:38:45 2017
Ec2AMI: ami-0f93966c
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: ap-southeast-2a
Ec2InstanceType: t2.nano
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb: Error: command ['lsusb'] failed with exit code 1:
MachineType: Xen HVM domU
PciMultimedia:

ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 cirrusdrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-59-generic root=UUID=fb0fef08-f3c5-40bf-9776-f7ba00fe72be ro console=tty1 console=ttyS0
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-59-generic N/A
 linux-backports-modules-4.4.0-59-generic N/A
 linux-firmware 1.157.6
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 12/09/2016
dmi.bios.vendor: Xen
dmi.bios.version: 4.2.amazon
dmi.chassis.type: 1
dmi.chassis.vendor: Xen
dmi.modalias: dmi:bvnXen:bvr4.2.amazon:bd12/09/2016:svnXen:pnHVMdomU:pvr4.2.amazon:cvnXen:ct1:cvr:
dmi.product.name: HVM domU
dmi.product.version: 4.2.amazon
dmi.sys.vendor: Xen

Mike Williams (mdub) wrote :

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Xenial):
status: New → Triaged
importance: Undecided → High
Changed in linux (Ubuntu):
status: Confirmed → Triaged
Joseph Salisbury (jsalisbury) wrote :

I built a Xenial test kernel with the following two commits reverted:

c630ec12d831 mm, oom: rework oom detection
57e9ef475661 mm: throttle on IO only when there are too many dirty and writeback pages

This test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1655842/

Can you test this kernel and see if it resolves this bug?

Thanks in advance!

you could also try cherry-picking https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=6b4e3181d7bd5ca5ab6f45929e4a5ffa7ab4ab7f , but that will probably need some more inbetween patches as well..

reverting the two commits fixed the issue for our users (Proxmox VE, which uses a kernel based on the 4.4.x one from 16.04)

David F. (malteworld) wrote :

@f-gruenbichler: I already tried to cherry-pick that patch a while ago and it doesn't work because that patch is based on work that isn't in the 4.4.* kernel branch, not even including Canonical's backports from later branches.

Mike Williams (mdub) wrote :

Thanks jsalisbury. We have deployed using your test kernel (from http://kernel.ubuntu.com/~jsalisbury/lp1655842/), and experienced no OOM issues.

Allen Wild (aswild) wrote :

I manage a set of build servers for CPU/IO intensive builds using Yocto/OpenEmbedded. Ubuntu 14.04.5 with the 4.4 Xenial kernel. After updating to 4.4.0-59 the builds started failing because of the OOM killer.

Rolling back to 4.4.0-57 fixed the OOMs for me.

Can you try the kernel at [1], which includes the patches that are also at [1]?

[1] http://people.canonical.com/~cascardo/lp1655842/

Thanks.
Cascardo.

Changed in linux (Ubuntu):
assignee: nobody → Thadeu Lima de Souza Cascardo (cascardo)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Thadeu Lima de Souza Cascardo (cascardo)
Stéphane Graber (stgraber) wrote :

Just a note that Joe's armhf kernel has been working well for me.

I can't test cascardo's kernel as it's not built for armhf.

I will upload armhf binaries for those kernels and let you know. It's important to try those because they include an alternative solution that we would rather use instead of the one with the reverted patches.

Danny B (danny.b) wrote :

Using Cascardo's kernel fixes the problem for me.

It was a bit of a hassle to install though because there's no linux-headers-4.4.0-62_4.4.0-62.83_all.deb at the link and linux-headers-generic depends on it.

Here's where to find it:
amd64: https://launchpad.net/ubuntu/xenial/amd64/linux-headers-4.4.0-62/4.4.0-62.83
armhf: https://launchpad.net/ubuntu/xenial/armhf/linux-headers-4.4.0-62/4.4.0-62.83

Ben French (octoamit) on 2017-01-21
Changed in linux (Ubuntu):
status: Triaged → In Progress
Stéphane Graber (stgraber) wrote :

I've had a few armhf systems running cascardo's kernel and so far no sign of the OOM or any other problem with it.

Mike Williams (mdub) wrote :

Cascardo: we've tried your test kernel, and it looks good - we've seen no OOM problems.

Cris (cristianpeguero25) wrote :

Hi I'd like to implement Cascardo kernel since I've been having the same issue, though not on all of
the xenial machines running 4.4.0-59-generic which is strange.
Could someone tell how to implement Cascardo kernel without completely messing up my machine.

Thanks

xb5i7o (xb5i7o) wrote :

Hi, I am having the exact same issues on a PC with 18GB ram!! kernel 4.4.0-59-generic

Please can this be fixed as soon as possible with a release of the next kernel update.

Its killing processes such as firefox and virtualbox for no good reason while only 4gb is in use really.

Hope this can be fixed soon. its becoming worse as time passes.

Eric Desrochers (slashd) wrote :

The patchset[1] for bug "LP #1655842" has been submitted on Jan 24th 2017 and acked by the kernel team on the same day[2].

The patch should be part of the following kernel release cycle :

cycle: 27-Jan through 18-Feb[3]
====
27-Jan Last day for kernel commits for this cycle
30-Jan - 04-Feb Kernel prep week.
05-Feb - 17-Feb Bug verification & Regression testing..
20-Feb Release to -updates.
====

[1] - "Xenial PATCH 00/11] Fixes OOM for LP #1655842"
[2] - "ACK: [Xenial PATCH 00/11] Fixes OOM for LP #1655842"
[3] - https://wiki.ubuntu.com/KernelTeam/Newsletter

- Eric

Changed in linux (Ubuntu Xenial):
status: Triaged → In Progress
Eric Desrochers (slashd) wrote :

Additional note :

Applied in master-next on Jan 26th 2017[2]

[1] - "APPLIED: [Xenial PATCH 00/11] Fixes OOM for LP #1655842"

- Eric

Eric Desrochers (slashd) on 2017-01-27
tags: added: sts

@slashd It sounds really strange to me that I should wait til 20-Feb for a fix for this bug while this is clearly a regression introduced with the latest kernel upgrade. Is there no way to speed things up to fix this regression.

Currently we had to downgrade all our xenial systems to linux-image-4.4.0-57-generic to avoid this bug.

Gaudenz

Eric Desrochers (slashd) wrote :

@Gaudenz Steinlin (gaudenz-debian),

It will takes 3 weeks to land in -updates pocket, but you can expect to have a call for testing a proposed package by EOW.

- Eric

Tim Gardner (timg-tpi) on 2017-01-31
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux (Ubuntu):
status: In Progress → Fix Released

This is a severe bug. It should be treated a high-priority bugfix that cannot wait 3 weeks.

Just as a note for newcomers reading this, I can confirm the bug is NOT fixed in the officially released 4.4.0-62.83.

Krzysztof Dryja (cih997) wrote :

I could not reboot my machine and the ugly workaround for this issue was to login as root and clear system caches:

echo 3 > /proc/sys/vm/drop_caches

This made my machine stable again, at least for the time I needed.

This is fixed in 4.4.0-63.84, which will be available in proposed soon.

Shelby Cain (alyandon) wrote :

@nate Thank you! You just saved me a lot of hassle as I was about to unpin the 4.4.0-57 kernel and update a bunch of machines on the assumption the fix was in that version.

Sebastian Unger (sebunger44) wrote :

As a note: I believe this also affects the armhf kernel 4.4.0-1040-raspi2 for the Raspberry Pi.

David Glasser (glasser) wrote :

I've been struggling with this bug for nearly a week and only now found this issue. Thanks for fixing it!

For the sake of others finding it, here's the stack trace part of the oom-killer log, which contains some terms I searched for a while ago that aren't mentioned here yet.

docker invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=-1000
docker cpuset=/ mems_allowed=0
CPU: 11 PID: 4472 Comm: docker Tainted: G W 4.4.0-62-generic #83-Ubuntu
Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
 0000000000000286 0000000057f64c94 ffff880dfb5efaf0 ffffffff813f7c63
 ffff880dfb5efcc8 ffff880fbfda0000 ffff880dfb5efb60 ffffffff8120ad4e
 ffffffff81cd2d7f 0000000000000000 ffffffff81e67760 0000000000000206
Call Trace:
 [<ffffffff813f7c63>] dump_stack+0x63/0x90
 [<ffffffff8120ad4e>] dump_header+0x5a/0x1c5
 [<ffffffff811926c2>] oom_kill_process+0x202/0x3c0
 [<ffffffff81192ae9>] out_of_memory+0x219/0x460
 [<ffffffff81198a5d>] __alloc_pages_slowpath.constprop.88+0x8fd/0xa70
 [<ffffffff81198e56>] __alloc_pages_nodemask+0x286/0x2a0
 [<ffffffff81198f0b>] alloc_kmem_pages_node+0x4b/0xc0
 [<ffffffff8107ea5e>] copy_process+0x1be/0x1b70
 [<ffffffff8139225c>] ? apparmor_file_alloc_security+0x5c/0x220
 [<ffffffff811ed04a>] ? kmem_cache_alloc+0x1ca/0x1f0
 [<ffffffff81348263>] ? security_file_alloc+0x33/0x50
 [<ffffffff810caeb1>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
 [<ffffffff810805a0>] _do_fork+0x80/0x360
 [<ffffffff81080929>] SyS_clone+0x19/0x20
 [<ffffffff818385f2>] entry_SYSCALL_64_fastpath+0x16/0x71

Hajo Locke (hajo-locke) wrote :

When this new kernel will be released? This bug is killing our MySQL Servers. Booting old kernels is only a bad workaround. I think a lot of people with busy servers will have a problem.

This is 2.nd time we were hit by a big bug within short time. In oct 2016 our nameservers got problems because of bug 1634892
Is LTS-Ubuntu still right system for servers?

This bug also appears to affect linux-image-4.8.0-34-generic in 16.04.1 Xenial.

Hi, Luk.

linux-image-4.8.0-34-generic should not be affected by this. If you see unexpected OOM problems, please open a new bug report and attach the kernel logs, please.

Thanks.
Cascardo.

xb5i7o (xb5i7o) wrote :

Just by the way - 4.4.0-62-generic has the exact same problem. Even when uninstalling 4.4.0-59-generic, my system at some point auto-updated to 4.4.0-62-generic . Only 4.4.0-57-generic is safe for now.

Nick Maynard (nick-maynard) wrote :

LTS Ubuntu with -updates shouldn't have this sort of issue - this is, frankly, unforgivable.

We need a new kernel urgently in -updates, and I'd expect serious discussions within the kernel team to understand what has caused this issue and avoid it reoccurring.

Anton Piatek (anton-piatek) wrote :

If this kernel is not going to hit -updates shortly (i.e. days), can something be done to pull or downgrade the broken kernel? At least revert linux-image-generic to depend back on linux-image-4.4.0-57-generic which doesn't have the issues and will stop more people from upgrading to a broken kernel.

Having this sort of break in an LTS kernel is not inspiring at all.

Eric Desrochers (slashd) wrote :

The fix is now available for testing in kernel version 4.4.0-63.84, if you enable proposed[1]

$ apt-cache policy linux-image-4.4.0-63-generic
linux-image-4.4.0-63-generic:
  Installed: (none)
  ==> Candidate: 4.4.0-63.84
  Version table:
     4.4.0-63.84 500
        500 http://archive.ubuntu.com/ubuntu ==>xenial-proposed/main amd64 Packages

$ apt-get changelog linux-image-4.4.0-63-generic | egrep "1655842"
 ==> * "Out of memory" errors after upgrade to 4.4.0-59 (LP: #1655842)

[1] - https://wiki.ubuntu.com/Testing/EnableProposed

- Eric

Oliver O. (oliver-o456i) wrote :

Testing...

Enabled proposed (https://wiki.ubuntu.com/Testing/EnableProposed).

Installed kernel packages:

# apt-get install -s -t xenial-proposed 'linux-headers-4.4.0.63$' 'linux-headers-4.4.0.63-generic$' 'linux-image-4.4.0.63-generic$' 'linux-image-extra-4.4.0.63-generic$'

Rebooted.

# cat /proc/version_signature
Ubuntu 4.4.0-63.84-generic 4.4.44

Who is the saver 4.4.0-57-generic or 4.4.0-63-generic now.

David Glasser (glasser) wrote :

kulwinder singh: Either one, but nothing in between.

-57 will reintroduce a few (unrelated) security bugs as well as the bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1647400 whose fix caused this one, but is easier to enable and has been tested for longer.

-63 should fix this bug, the older bug, and the intermediary security bugs, but requires you to enable the "proposed" repository, and hasn't been tested for quite as long.

Anything in between has this bug.

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
David Glasser (glasser) wrote :

Cascardo: Just to be clear, are you looking for verification from anyone in the world, or from specific kernel testers?

(I'd like to help, but I'm only able to reproduce the issue in production, and the process of debugging this issue when we ran into it was already more restarts than is good for my service right now (we settled on downgrading for the moment).)

David F. (malteworld) wrote :

@nick-maynard: Why is such a bug unforgivable? You can just boot a previous kernel instead. If you're concerned about availability then don't reboot in the first place unless there's an important security patch.

David Glasser (glasser) wrote :

To be fair, there have been multiple USN-reported kernel security patches fixed in post-57 kernels.

Travisgevans (travisgevans) wrote :

Don't forget that the earlier kernels are affected by Bug #1647400, which does something even worse (hang the system). I've verified that it affected my particular system before 4.4.0-59, and it may explain a couple of lockups I had previously experienced during normal operation when using previous kernels. -59 fixes the bug but introduces the permature OOM kill issue; if it weren't for the kernels currently in proposed (assuming they indeed fix this bug), I wouldn't really have a reliable kernel at all to use.

With the 4.4.0-59 kernel, I got hit with two unexplained OOM kills, each occurring within about 3 days of uptime. I then tested the -62 kernel in proposed for just under 14 days and didn't see any OOM kills, and I've now been testing -63 for a couple of days and haven't any issues yet. However, it might help if anyone has an idea how the OOM kill bug might be reliably reproduced. “5 working days” isn't very long to reliably be sure the problem is solved otherwise; it took more than half that time upon upgrading to -59 for me to hit the bug by chance.

On Fri, 10 Feb 2017, Travisgevans wrote:

> However, it might help if anyone has an idea how the OOM kill bug might
> be reliably reproduced. “5 working days” isn't very long to reliably be
> sure the problem is solved otherwise; it took more than half that time
> upon upgrading to -59 for me to hit the bug by chance.

I had a job (duplicity) that would oom every time under -59 and -62. With
-63 from proposed, it doesn't.

--
Nate Eldredge
<email address hidden>

Otto Wayne (ottowayne) wrote :

I see this bug on Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-1042-raspi2 armv7l) as described here: https://superuser.com/questions/1176773/ubuntu-on-rpi-starts-killing-processes-when-ram-is-filled-up-by-cache

The workaround by Krzysztof Dryja (cih997) works for me as well but is very ugly and temporary.

wurlyfan (wurlyfan) wrote :
Download full text (4.9 KiB)

Firefox and Insync were killed pretty reliably for me, but other packages as well. I was getting half-a-dozen oom kills a day before I switched back to 57. My second workstation is fully updated and doesn't show any sign of this issue.

From m3 note

-------- Original message --------
Sender: Nate Eldredge <email address hidden>
Time: Fri 2/10 19:21
To: wurlyfan <email address hidden>
 Subject: Re: [Bug 1655842] Re: "Out of memory" errors after upgrade to 4.4.0-59

>On Fri, 10 Feb 2017, Travisgevans wrote:
>
>> However, it might help if anyone has an idea how the OOM kill bug might
>> be reliably reproduced. “5 working days” isn't very long to reliably be
>> sure the problem is solved otherwise; it took more than half that time
>> upon upgrading to -59 for me to hit the bug by chance.
>
>I had a job (duplicity) that would oom every time under -59 and -62.  With
>-63 from proposed, it doesn't.
>
>--
>Nate Eldredge
><email address hidden>
>
>--
>You received this bug notification because you are subscribed to the bug
>report.
>https://bugs.launchpad.net/bugs/1655842
>
>Title:
>  "Out of memory" errors after upgrade to 4.4.0-59
>
>Status in linux package in Ubuntu:
>  Fix Released
>Status in linux source package in Xenial:
>  Fix Committed
>
>Bug description:
>  I recently replaced some Xenial servers, and started experiencing "Out
>  of memory" problems with the default kernel.
>
>  We bake Amazon AMIs based on an official Ubuntu-provided image (ami-
>  e6b58e85, in ap-southeast-2, from https://cloud-
>  images.ubuntu.com/locator/ec2/).  Previous versions of our AMI
>  included "4.4.0-57-generic", but the latest version picked up
>  "4.4.0-59-generic" as part of a "dist-upgrade".
>
>  Instances booted using the new AMI have been using more memory, and
>  experiencing OOM issues - sometimes during boot, and sometimes a while
>  afterwards.  An example from the system log is:
>
>  [  130.113411] cloud-init[1560]: Cloud-init v. 0.7.8 running 'modules:final' at Wed, 11 Jan 2017 22:07:53 +0000. Up 29.28 seconds.
>  [  130.124219] cloud-init[1560]: Cloud-init v. 0.7.8 finished at Wed, 11 Jan 2017 22:09:35 +0000. Datasource DataSourceEc2.  Up 130.09 seconds
>  [29871.137128] Out of memory: Kill process 2920 (ruby) score 107 or sacrifice child
>  [29871.140816] Killed process 2920 (ruby) total-vm:675048kB, anon-rss:51184kB, file-rss:2164kB
>  [29871.449209] Out of memory: Kill process 3257 (splunkd) score 97 or sacrifice child
>  [29871.453282] Killed process 3258 (splunkd) total-vm:66272kB, anon-rss:6676kB, file-rss:0kB
>  [29871.677910] Out of memory: Kill process 2647 (fluentd) score 51 or sacrifice child
>  [29871.681872] Killed process 2647 (fluentd) total-vm:117944kB, anon-rss:23956kB, file-rss:1356kB
>
>  I have a hunch that this may be related to the fix for
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1647400,
>  introduced in linux (4.4.0-58.79).
>
>  ProblemType: Bug
>  DistroRelease: Ubuntu 16.04
>  Package: linux-image-4.4.0-59-generic 4.4.0-59.80
>  ProcVersionSignature: User Name 4.4.0-59.80-generic 4.4.35
>  Uname: Linux 4.4.0-59-generic x86_64
>  AlsaDevices:
>   total 0
>   crw-rw---- 1 root audio 116,  1 Jan 12...

Read more...

Ivan Kozik (ludios) wrote :

I've been using -63 for a while now (even before it was in proposed, via an sbuild setup) on a machine that had OOM problems with -59, and I haven't noticed any issues.

Serge Victor (ser) wrote :

-63 works for me as well, thank you!

Oliver O. (oliver-o456i) on 2017-02-11
tags: added: verification-done-xenial
removed: verification-needed-xenial
Oliver O. (oliver-o456i) wrote :

Tested Ubuntu 4.4.0-63.84-generic 4.4.44 on a desktop system with a workload which previously led to Chrome processes being OOM-killed.

Situation with 4.4.0-62-generic: between 8 and 54 processes OOM-killed per 24-hour period
Situation with 4.4.0-63-generic: no OOM-kills during 46 hours of testing

Look solved. No negative side-effects encountered.

VSHN (vshn) wrote :

you should re-release 4.4.0-62 as linux-image-chaosmonkey-virtual

Javier Bernal (javierbernal) wrote :

Like Luk (#29), I upgraded to 4.8.0-34, but the problem disappeared for me. I ran system for two days without any OOM-kills. Before that, simply copying a big file (9Gb+) fired the system. My system has 16Gb RAM and runs 16.04.1.

To throw some more light into the matter. Two machines were upgraded to 4.4.0-59 on 16th Jan, same load, but only one out of them is reporting oom-kill. Anybody experiencing same scenario...

Thanks to those who reported/fixed this bug.

Out of curiosity, was this a bug in the 4.4.0-59 kernel itself or in Ubuntu's packaging of the kernel, i.e. were other (non-Ubuntu) Linux users impacted ?

Did anybody also notice any pattern in "invoked oom-killer". I see a pattern like daily almost same time cron or scripts invoking oom-killer...

Charles Wright (wrighrc) wrote :

I was curious if I could answer Sridhar's question as I had the same question.

The introduction of the problem appears to be in Ubuntu's packaging of select upstream commits from 4.7 to address bug #1647400.

From the comments in that case, it appears 6b4e3181d7bd5ca5ab6f45929e4a5ffa7ab4ab7f should have also been selected.

From looking at the history I gather if I had been running stock 4.4 kernels I would not have been effected by the OOM issue.

I'm basing this on tracking down 0a0337e0d1d134465778a16f5cbea95086e8e9e0 in the main line kernel.

description: updated
Launchpad Janitor (janitor) wrote :
Download full text (23.0 KiB)

This bug was fixed in the package linux - 4.4.0-63.84

---------------
linux (4.4.0-63.84) xenial; urgency=low

  [ Thadeu Lima de Souza Cascardo ]

  * Release Tracking Bug
    - LP: #1660704

  * Backport Dirty COW patch to prevent wineserver freeze (LP: #1658270)
    - SAUCE: mm: Respect FOLL_FORCE/FOLL_COW for thp

  * Kdump through NMI SMP and single core not working on Ubuntu16.10
    (LP: #1630924)
    - x86/hyperv: Handle unknown NMIs on one CPU when unknown_nmi_panic
    - SAUCE: hv: don't reset hv_context.tsc_page on crash

  * [regression 4.8.0-14 -> 4.8.0-17] keyboard and touchscreen lost on Acer
    Chromebook R11 (LP: #1630238)
    - [Config] CONFIG_PINCTRL_CHERRYVIEW=y

  * Call trace when testing fstat stressor on ppc64el with virtual keyboard and
    mouse present (LP: #1652132)
    - SAUCE: HID: usbhid: Quirk a AMI virtual mouse and keyboard with ALWAYS_POLL

  * VLAN SR-IOV regression for IXGBE driver (LP: #1658491)
    - ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths

  * "Out of memory" errors after upgrade to 4.4.0-59 (LP: #1655842)
    - mm, page_alloc: convert alloc_flags to unsigned
    - mm, compaction: change COMPACT_ constants into enum
    - mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED
    - mm, compaction: simplify __alloc_pages_direct_compact feedback interface
    - mm, compaction: distinguish between full and partial COMPACT_COMPLETE
    - mm, compaction: abstract compaction feedback to helpers
    - mm, oom: protect !costly allocations some more
    - mm: consider compaction feedback also for costly allocation
    - mm, oom, compaction: prevent from should_compact_retry looping for ever for
      costly orders
    - mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION
    - mm, oom: prevent premature OOM killer invocation for high order request

  * Backport 3 patches to fix bugs with AIX clients using IBMVSCSI Target Driver
    (LP: #1657194)
    - SAUCE: ibmvscsis: Fix max transfer length
    - SAUCE: ibmvscsis: fix sleeping in interrupt context
    - SAUCE: ibmvscsis: Fix srp_transfer_data fail return code

  * NVMe: adapter is missing after abnormal shutdown followed by quick reboot,
    quirk needed (LP: #1656913)
    - nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too

  * Ubuntu 16.10 KVM SRIOV: if enable sriov while ping flood is running ping
    will stop working (LP: #1625318)
    - PCI: Do any VF BAR updates before enabling the BARs
    - PCI: Ignore BAR updates on virtual functions
    - PCI: Update BARs using property bits appropriate for type
    - PCI: Separate VF BAR updates from standard BAR updates
    - PCI: Don't update VF BARs while VF memory space is enabled
    - PCI: Remove pci_resource_bar() and pci_iov_resource_bar()
    - PCI: Decouple IORESOURCE_ROM_ENABLE and PCI_ROM_ADDRESS_ENABLE
    - PCI: Add comments about ROM BAR updating

  * Linux rtc self test fails in a VM under xenial (LP: #1649718)
    - kvm: x86: Convert ioapic->rtc_status.dest_map to a struct
    - kvm: x86: Track irq vectors in ioapic->rtc_status.dest_map
    - kvm: x86: Check dest_map->vector to match eoi signals for rtc

  * Xenial update to v4.4.44 stable releas...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Iain Buclaw (ibuclaw) wrote :

Bug is prevalent on 4.8 kernels too.

---
[251529.693133] CPU: 3 PID: 1547 Comm: icinga2 Not tainted 4.8.0-34-generic #36~16.04.1-Ubuntu
[251529.693134] Hardware name: MSI MS-7823/B85M-G43 (MS-7823), BIOS V3.14B3 06/23/2014
[251529.693135] 0000000000000286 00000000e9fa7ede ffff95c3f774bb38 ffffffffab62d7b3
[251529.693137] ffff95c3f774bcc8 ffff95c3f23bd700 ffff95c3f774bba0 ffffffffab42e9bb
[251529.693138] ffff95c3f774bb40 0000000000000000 0000000000000000 0000000000000000
[251529.693140] Call Trace:
[251529.693145] [<ffffffffab62d7b3>] dump_stack+0x63/0x90
[251529.693148] [<ffffffffab42e9bb>] dump_header+0x5c/0x1dc
[251529.693151] [<ffffffffab3a5836>] oom_kill_process+0x226/0x3f0
[251529.693153] [<ffffffffab3a5daa>] out_of_memory+0x35a/0x3f0
[251529.693155] [<ffffffffab3ab06b>] __alloc_pages_slowpath+0x9fb/0xa20
[251529.693157] [<ffffffffab3ab34a>] __alloc_pages_nodemask+0x2ba/0x300
[251529.693160] [<ffffffffab280726>] copy_process.part.30+0x146/0x1b50
[251529.693162] [<ffffffffab95c66d>] ? sock_recvmsg+0x3d/0x50
[251529.693163] [<ffffffffab95c8aa>] ? SYSC_recvfrom+0xda/0x150
[251529.693164] [<ffffffffab282327>] _do_fork+0xe7/0x3f0
[251529.693166] [<ffffffffab95e171>] ? __sys_recvmsg+0x51/0x90
[251529.693168] [<ffffffffab2826d9>] SyS_clone+0x19/0x20
[251529.693170] [<ffffffffab203bae>] do_syscall_64+0x5e/0xc0
[251529.693174] [<ffffffffaba96625>] entry_SYSCALL64_slow_path+0x25/0x25
[251529.693174] Mem-Info:
[251529.693177] active_anon:339565 inactive_anon:133615 isolated_anon:0
                 active_file:3938458 inactive_file:328087 isolated_file:0
                 unevictable:8 dirty:200 writeback:37 unstable:0
                 slab_reclaimable:3365424 slab_unreclaimable:16102
                 mapped:9114 shmem:1459 pagetables:2462 bounce:0
                 free:49449 free_pcp:32 free_cma:0
---

Had 5 servers knocked out over the weekend.

4.8 kernels are not affected by this bug. If you have OOMs on 4.8 kernels, please file a new bug with all the revelant details and logs.

Thanks.
Cascardo.

Iain Buclaw (ibuclaw) wrote :

Yes they are, I'm seeing the same exorbitant memory usage that we had on 4.4.0-58 as am currently getting on 4.8.0-36.

This bug is not about memory use. It's about the Linux kernel triggering the OOM killer when higher-order (order 2) allocations are requested and progress cannot be made. This affected Linux 4.7 on mainline and was fixed on Linux 4.7 stable and Linux 4.8. When some fixes were backported to 4.4.0-59 (4.4.0-58 was not affected), this bugs was introduced to Xenial kernels and is now fixed on 4.4.0-63. Any behavior on 4.8 kernels must be investigated separately, because all fixes that were backported to 4.4.0-63 are already present in 4.8.

Can you please open a new bug and attach all logs and details you can, so we can investigate your problem and provide a fix? Please, do not use this bug, because the fixes would be different anyway, and even thought the symptom may look like, we consider them different bugs.

I appreciate you opening a new bug and providing this new report.

Thanks.
Cascardo.

Iain Buclaw (ibuclaw) wrote :

This is the reported /proc/meminfo Buffers usage for 4 different kernel versions. We got the same OOM call traces on both 4.4.0-58 and 4.8.0-34, I highly doubt that to be a coincidence.

Hi, Mr. Iain Buclaw.

Memory usage reports could be related to something else. This bug was introduced at 4.4.0-59, you mention 4.4.0-58. We could certainly investigate the issue you see, just not on this bug. Much more data is necessary, but, please, don't attach new data to this bug. Your report relates to a different kernel, which has a very different memory management code. I could ask you to test 4.4.0-63, but it still is possible that you find problems there, because they are unrelated to this bug.

Thank you.
Cascardo.

Iain Buclaw (ibuclaw) wrote :

The OOM fixes were introduced in 4.4.0-58 according to the changelog, but sure.

Mike Williams (mdub) wrote :

I'm pretty sure this bug was introduced in 4.4.0-58, even though I reported it against 4.4.0-59. Still, +1 for a raising a separate bug against 4.8.

Thanks to Cascardo for fixing this one.

joconcepts (jonav) wrote :

Could somebody please confirm that the issue has been fixed with the kernel 4.4.0-64.85? We had massive problems with OOM killed qemu instances on our virtualization hosts and would not like to have this introduced again.

I can confirm 4.4.0-64.85 fixes our OOM issues.

Anton (azenkov) wrote :

I still see OOM killer invocation on 4.4.0-64.85:

Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857840] java invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857846] java cpuset=/ mems_allowed=0-1
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857855] CPU: 27 PID: 47820 Comm: java Tainted: G W 4.4.0-64-generic #85-Ubuntu
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857857] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/12/2016
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857860] 0000000000000286 00000000ee496386 ffff882358f13b10 ffffffff813f8083
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857867] ffff882358f13cc8 ffff883c70a8f000 ffff882358f13b80 ffffffff8120b0fe
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857871] ffffffff81cd63bf 0000000000000000 ffffffff81e677e0 0000000000000206
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857875] Call Trace:
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857885] [<ffffffff813f8083>] dump_stack+0x63/0x90
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857893] [<ffffffff8120b0fe>] dump_header+0x5a/0x1c5
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857899] [<ffffffff81192812>] oom_kill_process+0x202/0x3c0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857902] [<ffffffff81192c39>] out_of_memory+0x219/0x460
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857907] [<ffffffff81198c28>] __alloc_pages_slowpath.constprop.88+0x938/0xad0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857911] [<ffffffff81199046>] __alloc_pages_nodemask+0x286/0x2a0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857915] [<ffffffff811990fb>] alloc_kmem_pages_node+0x4b/0xc0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857921] [<ffffffff8107ea5e>] copy_process+0x1be/0x1b70
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857930] [<ffffffff811c19bd>] ? handle_mm_fault+0xcbd/0x1820
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857935] [<ffffffff81406574>] ? call_rwsem_down_read_failed+0x14/0x30
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857938] [<ffffffff810805a0>] _do_fork+0x80/0x360
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857942] [<ffffffff81080929>] SyS_clone+0x19/0x20
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857948] [<ffffffff8183c5f2>] entry_SYSCALL_64_fastpath+0x16/0x71
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857969] Mem-Info:
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981] active_anon:32518350 inactive_anon:2099 isolated_anon:0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981] active_file:45384948 inactive_file:45384381 isolated_file:64
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981] unevictable:914 dirty:104 writeback:0 unstable:0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981] slab_reclaimable:1282591 slab_unreclaimable:39566
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981] mapped:10291496 shmem:2227 pagetables:732932 bounce:0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981] free:272957 free_pcp:1153 free_cma:0

Julian Kassat (j.kassat) wrote :

Same here on Linux 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux.

Out of memory: Kill process 11067 (java) score 28 or sacrifice child
Killed process 11067 (java) total-vm:3569724kB, anon-rss:211720kB, file-rss:20208kB
systemd-journald[247]: /dev/kmsg buffer overrun, some messages lost.
swap_free: Bad swap file entry 2000000000000000
BUG: Bad page map in process java pte:00000020 pmd:dac28067
addr:00007fc34ce69000 vm_flags:08000071 anon_vma: (null) mapping: (null) index:7fc34ce69
file: (null) fault: (null) mmap: (null) readpage: (null)
CPU: 1 PID: 11108 Comm: java Tainted: G B D 4.4.0-64-generic #85-Ubuntu
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
 0000000000000286 00000000f2f56de1 ffff8800d8ceba58 ffffffff813f8083
 00007fc34ce69000 ffff8800dac023e8 ffff8800d8cebaa8 ffffffff811be06f
 ffff8800d8ceba80 ffffffff811d518c 2000000000000000 0000000000000020
Call Trace:
 [<ffffffff813f8083>] dump_stack+0x63/0x90
 [<ffffffff811be06f>] print_bad_pte+0x1df/0x2a0
 [<ffffffff811d518c>] ? swap_info_get+0x7c/0xd0
 [<ffffffff811bf9f8>] unmap_page_range+0x468/0x7a0
 [<ffffffff811bfdad>] unmap_single_vma+0x7d/0xe0
 [<ffffffff811c0871>] unmap_vmas+0x51/0xa0
 [<ffffffff811c9df7>] exit_mmap+0xa7/0x170
 [<ffffffff8107e0a7>] mmput+0x57/0x130
 [<ffffffff81083f2a>] do_exit+0x27a/0xb00
 [<ffffffff8110046c>] ? __unqueue_futex+0x2c/0x60
 [<ffffffff81100f8e>] ? futex_wait+0x16e/0x280
 [<ffffffff81084833>] do_group_exit+0x43/0xb0
 [<ffffffff810909b2>] get_signal+0x292/0x600
 [<ffffffff8102e567>] do_signal+0x37/0x6f0
 [<ffffffff8122fb84>] ? mntput+0x24/0x40
 [<ffffffff81210ba0>] ? __fput+0x190/0x220
 [<ffffffff8100320c>] exit_to_usermode_loop+0x8c/0xd0
 [<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60
 [<ffffffff8183c750>] int_ret_from_sys_call+0x25/0x8f
BUG: Bad rss-counter state mm:ffff8800d8bc8800 idx:2 val:-1

Hi, Anton and Julian.

Can you attach complete logs for investigation?

Thanks.
Cascardo.

Julian Kassat (j.kassat) wrote :

Attached kern.log. Let me know in case you need more logs.

Hi, Julian.

Do you have the output of dmesg after the incident before a reboot?

Cascardo.

Julian, your logs indicate some possible swap corruption, would you mind opening a new bug and sending it using apport-bug?

Thanks.
Cascardo.

Julian Kassat (j.kassat) wrote :

Hi Cascardo,

there is no related dmesg output after the incident (just some lines from apt-daily.timer).

I filed a bug for the possible swap corruption issue here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1669707

Thanks so far.

Julian

Pete Cheslock (pete-cheslock) wrote :

We have been seeing this issue recently as well. We are running 4.4.0-66-generic #87-Ubuntu - I can attempt to downgrade to 4.4.0-57 but its a large cluster with a lot of data so it may take some time. Attached a kern.log from this most recent oom.

I am seeing this issue as well, on Arch kernel v 4.10.4-1

Michael Dye (dye.michael) wrote :

This is plaguing Horizon project Pi2 and Pi3 devices running Xenial 16.04.2 w/ kernel 4.4.0-1050-raspi2. From a pi2:

root@horizon-00000000a17d2187:~# uname -a
Linux horizon-00000000a17d2187 4.4.0-1050-raspi2 #57-Ubuntu SMP Wed Mar 22 12:52:22 UTC 2017 armv7l armv7l armv7l GNU/Linux
root@horizon-00000000a17d2187:~# free
              total used free shared buff/cache available
Mem: 942128 149548 35456 494084 757124 239716
Swap: 0 0 0

Under these circumstances, the kernel's oom-killer will kill Wifi processes (rtl_rpcd), systemd-udevd, our Ethereum client (geth), and other critical processes in attempt to stay afloat rather than using reclaimable RAM.

I was using 4.4.0-21 as reported by `uname -r` which is default in Kubuntu 16.04. The same bug appears on mainline kernel 4.10 too!

Now, I'm in confusion. Which kernel should I upgrade to? Also I experience this only in KDE session with yandex or chrome browser opened

iKazmi (alikazmi-2040) wrote :

I have 4.4.0-59 till 4.4.0-71 and 4.8.0-41 till 4.8.0-46 installed on my system and all are affected by this bug. Firefox, Chrome and Netbeans regularly get killed without a warning and for no reason (since I have something like 10GB+ RAM and all 16GB Swap free at the time the process gets killed). Even KDE has been killed a couple of times while the system still had over 6GB RAM and 16GB Swap free.

Yesterday, after the umpteenth time Netbeans was killed while I was in the middle of doing something, I finally decided to do something about this problem and installed Kernel 4.10.9-041009 from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10.9/. Sadly, that doesn't seem to resolve the problem either and Oom Killer is still overeager to kill user processes (Firefox and Netbeans have both been killed multiple times). At least KDE hasn't been killed so far.

Anybody having successfully tested 4.4.0.63 for OOM-kill issue...

Anton (anton-powershop) wrote :

Yes, 4.4.0.63 solved our OOM issues (and we had plenty after 4.4.0.59). Ours were all headless server (bare metal and VMs) related though - no desktop usage.

But I never experienced this issue with my home laptop either, but that had lots of RAM and was only lightly used during that period - not really a good data point.

Travisgevans (travisgevans) wrote :

I also haven't personally encountered any further OOM issues on my home desktop (used daily) with 4.4.0.63.

I'd like to emphasise that the OOM problem only happens with KDE. I have several DE installed including Unity, GNOME3, Cinnamon. But none of them caused a OOM, at least I never noticed. But in KDE, most of the time when chrome is opened, it triggers OOM. dmesg tells that, sometimes kwin_x11 invoked the OOM or plasmashell.

Most of the time plasmashell is crashed and the opened tab in chrome is killed. However, chrome application will be there. I need to start plasmashell by pressing Alt-F2 bringing the run command dialog and type plasmashell there.

Last night, Even firefox gave an OOM.

I'm attaching a dmesg log hoping that will be helpful.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers