Bug #1655842 ""Out of memory” errors after upgrade to 4.4.0-59” : Bugs : linux package : Ubuntu

Revision history for this message

Mike Williams (mdub) wrote on 2017-01-12:

#1

CRDA.txt Edit (422 bytes, text/plain; charset="utf-8")
CurrentDmesg.txt Edit (65.6 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (2.7 KiB, text/plain; charset="utf-8")
JournalErrors.txt Edit (23.2 KiB, text/plain; charset="utf-8")
Lspci.txt Edit (3.2 KiB, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (835 bytes, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (1.6 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (4.4 KiB, text/plain; charset="utf-8")
UdevDb.txt Edit (94.5 KiB, text/plain; charset="utf-8")
WifiSyslog.txt Edit (83.0 KiB, text/plain; charset="utf-8")

Revision history for this message

Brad Figg (brad-figg) wrote on 2017-01-12: Status changed to Confirmed

#2

This change was made by a bot.

Changed in linux (Ubuntu):
status:	New → Confirmed

Joseph Salisbury (jsalisbury) on 2017-01-12

Changed in linux (Ubuntu):
importance:	Undecided → High
Changed in linux (Ubuntu Xenial):
status:	New → Triaged
importance:	Undecided → High
Changed in linux (Ubuntu):
status:	Confirmed → Triaged

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-01-12:

#3

I built a Xenial test kernel with the following two commits reverted:

c630ec12d831 mm, oom: rework oom detection
57e9ef475661 mm: throttle on IO only when there are too many dirty and writeback pages

This test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1655842/

Can you test this kernel and see if it resolves this bug?

Thanks in advance!

Revision history for this message

Fabian Grünbichler (f-gruenbichler) wrote on 2017-01-13:

#4

you could also try cherry-picking https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=6b4e3181d7bd5ca5ab6f45929e4a5ffa7ab4ab7f , but that will probably need some more inbetween patches as well..

reverting the two commits fixed the issue for our users (Proxmox VE, which uses a kernel based on the 4.4.x one from 16.04)

Revision history for this message

David F. (malteworld) wrote on 2017-01-17:

#5

@f-gruenbichler: I already tried to cherry-pick that patch a while ago and it doesn't work because that patch is based on work that isn't in the 4.4.* kernel branch, not even including Canonical's backports from later branches.

Revision history for this message

Mike Williams (mdub) wrote on 2017-01-18:

#6

Thanks jsalisbury. We have deployed using your test kernel (from http://kernel.ubuntu.com/~jsalisbury/lp1655842/), and experienced no OOM issues.

Revision history for this message

Allen Wild (aswild) wrote on 2017-01-19:

#7

I manage a set of build servers for CPU/IO intensive builds using Yocto/OpenEmbedded. Ubuntu 14.04.5 with the 4.4 Xenial kernel. After updating to 4.4.0-59 the builds started failing because of the OOM killer.

Rolling back to 4.4.0-57 fixed the OOMs for me.

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-01-20:

#8

Can you try the kernel at [1], which includes the patches that are also at [1]?

[1] http://people.canonical.com/~cascardo/lp1655842/

Thanks.
Cascardo.

Changed in linux (Ubuntu):
assignee:	nobody → Thadeu Lima de Souza Cascardo (cascardo)
Changed in linux (Ubuntu Xenial):
assignee:	nobody → Thadeu Lima de Souza Cascardo (cascardo)

Revision history for this message

Stéphane Graber (stgraber) wrote on 2017-01-20:

#9

Just a note that Joe's armhf kernel has been working well for me.

I can't test cascardo's kernel as it's not built for armhf.

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-01-20:

#10

I will upload armhf binaries for those kernels and let you know. It's important to try those because they include an alternative solution that we would rather use instead of the one with the reverted patches.

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-01-20:

#11

Patched kernel for armhf available here: http://people.canonical.com/~cascardo/lp1655842/linux-image-4.4.0-62-generic_4.4.0-62.83_armhf.deb

Revision history for this message

Danny B (danny.b) wrote on 2017-01-21:

#12

Using Cascardo's kernel fixes the problem for me.

It was a bit of a hassle to install though because there's no linux-headers-4.4.0-62_4.4.0-62.83_all.deb at the link and linux-headers-generic depends on it.

Here's where to find it:
amd64: https://launchpad.net/ubuntu/xenial/amd64/linux-headers-4.4.0-62/4.4.0-62.83
armhf: https://launchpad.net/ubuntu/xenial/armhf/linux-headers-4.4.0-62/4.4.0-62.83

Ben French (octoamit) on 2017-01-21

Changed in linux (Ubuntu):
status:	Triaged → In Progress

Revision history for this message

Stéphane Graber (stgraber) wrote on 2017-01-23:

#13

I've had a few armhf systems running cascardo's kernel and so far no sign of the OOM or any other problem with it.

Revision history for this message

Mike Williams (mdub) wrote on 2017-01-24:

#14

Cascardo: we've tried your test kernel, and it looks good - we've seen no OOM problems.

Revision history for this message

Cris (cristianpeguero25) wrote on 2017-01-25:

#15

Hi I'd like to implement Cascardo kernel since I've been having the same issue, though not on all of
the xenial machines running 4.4.0-59-generic which is strange.
Could someone tell how to implement Cascardo kernel without completely messing up my machine.

Thanks

Revision history for this message

xb5i7o (xb5i7o) wrote on 2017-01-27:

#16

Hi, I am having the exact same issues on a PC with 18GB ram!! kernel 4.4.0-59-generic

Please can this be fixed as soon as possible with a release of the next kernel update.

Its killing processes such as firefox and virtualbox for no good reason while only 4gb is in use really.

Hope this can be fixed soon. its becoming worse as time passes.

Revision history for this message

Eric Desrochers (slashd) wrote on 2017-01-27:

#17

The patchset[1] for bug "LP #1655842" has been submitted on Jan 24th 2017 and acked by the kernel team on the same day[2].

The patch should be part of the following kernel release cycle :

cycle: 27-Jan through 18-Feb[3]
====
27-Jan Last day for kernel commits for this cycle
30-Jan - 04-Feb Kernel prep week.
05-Feb - 17-Feb Bug verification & Regression testing..
20-Feb Release to -updates.
====

[1] - "Xenial PATCH 00/11] Fixes OOM for LP #1655842"
[2] - "ACK: [Xenial PATCH 00/11] Fixes OOM for LP #1655842"
[3] - https://wiki.ubuntu.com/KernelTeam/Newsletter

- Eric

Changed in linux (Ubuntu Xenial):
status:	Triaged → In Progress

Revision history for this message

Eric Desrochers (slashd) wrote on 2017-01-27:

#18

Additional note :

Applied in master-next on Jan 26th 2017[2]

[1] - "APPLIED: [Xenial PATCH 00/11] Fixes OOM for LP #1655842"

- Eric

Eric Desrochers (slashd) on 2017-01-27

tags:

added: sts

Revision history for this message

Gaudenz Steinlin (gaudenz-debian) wrote on 2017-01-27:

#19

@slashd It sounds really strange to me that I should wait til 20-Feb for a fix for this bug while this is clearly a regression introduced with the latest kernel upgrade. Is there no way to speed things up to fix this regression.

Currently we had to downgrade all our xenial systems to linux-image-4.4.0-57-generic to avoid this bug.

Gaudenz

Revision history for this message

Eric Desrochers (slashd) wrote on 2017-01-30:

#20

@Gaudenz Steinlin (gaudenz-debian),

It will takes 3 weeks to land in -updates pocket, but you can expect to have a call for testing a proposed package by EOW.

- Eric

Tim Gardner (timg-tpi) on 2017-01-31

Changed in linux (Ubuntu Xenial):
status:	In Progress → Fix Committed
Changed in linux (Ubuntu):
status:	In Progress → Fix Released

Revision history for this message

Luk van den Borne (luk-vandenborne) wrote on 2017-02-02:

#21

This is a severe bug. It should be treated a high-priority bugfix that cannot wait 3 weeks.

Revision history for this message

Nate Eldredge (nate-thatsmathematics) wrote on 2017-02-03:

#22

Just as a note for newcomers reading this, I can confirm the bug is NOT fixed in the officially released 4.4.0-62.83.

Revision history for this message

Krzysztof Dryja (cih997) wrote on 2017-02-03:

#23

I could not reboot my machine and the ugly workaround for this issue was to login as root and clear system caches:

echo 3 > /proc/sys/vm/drop_caches

This made my machine stable again, at least for the time I needed.

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-02-03:

#24

This is fixed in 4.4.0-63.84, which will be available in proposed soon.

Revision history for this message

Shelby Cain (alyandon) wrote on 2017-02-03:

#25

@nate Thank you! You just saved me a lot of hassle as I was about to unpin the 4.4.0-57 kernel and update a bunch of machines on the assumption the fix was in that version.

Revision history for this message

Sebastian Unger (sebunger44) wrote on 2017-02-05:

#26

As a note: I believe this also affects the armhf kernel 4.4.0-1040-raspi2 for the Raspberry Pi.

Revision history for this message

David Glasser (glasser) wrote on 2017-02-07:

#27

I've been struggling with this bug for nearly a week and only now found this issue. Thanks for fixing it!

For the sake of others finding it, here's the stack trace part of the oom-killer log, which contains some terms I searched for a while ago that aren't mentioned here yet.

docker invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=-1000
docker cpuset=/ mems_allowed=0
CPU: 11 PID: 4472 Comm: docker Tainted: G W 4.4.0-62-generic #83-Ubuntu
Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
0000000000000286 0000000057f64c94 ffff880dfb5efaf0 ffffffff813f7c63
ffff880dfb5efcc8 ffff880fbfda0000 ffff880dfb5efb60 ffffffff8120ad4e
ffffffff81cd2d7f 0000000000000000 ffffffff81e67760 0000000000000206
Call Trace:
[<ffffffff813f7c63>] dump_stack+0x63/0x90
[<ffffffff8120ad4e>] dump_header+0x5a/0x1c5
[<ffffffff811926c2>] oom_kill_process+0x202/0x3c0
[<ffffffff81192ae9>] out_of_memory+0x219/0x460
[<ffffffff81198a5d>] __alloc_pages_slowpath.constprop.88+0x8fd/0xa70
[<ffffffff81198e56>] __alloc_pages_nodemask+0x286/0x2a0
[<ffffffff81198f0b>] alloc_kmem_pages_node+0x4b/0xc0
[<ffffffff8107ea5e>] copy_process+0x1be/0x1b70
[<ffffffff8139225c>] ? apparmor_file_alloc_security+0x5c/0x220
[<ffffffff811ed04a>] ? kmem_cache_alloc+0x1ca/0x1f0
[<ffffffff81348263>] ? security_file_alloc+0x33/0x50
[<ffffffff810caeb1>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[<ffffffff810805a0>] _do_fork+0x80/0x360
[<ffffffff81080929>] SyS_clone+0x19/0x20
[<ffffffff818385f2>] entry_SYSCALL_64_fastpath+0x16/0x71

Revision history for this message

Hajo Locke (hajo-locke) wrote on 2017-02-08:

#28

When this new kernel will be released? This bug is killing our MySQL Servers. Booting old kernels is only a bad workaround. I think a lot of people with busy servers will have a problem.

This is 2.nd time we were hit by a big bug within short time. In oct 2016 our nameservers got problems because of bug 1634892
Is LTS-Ubuntu still right system for servers?

Revision history for this message

Luk van den Borne (luk-vandenborne) wrote on 2017-02-08:

#29

This bug also appears to affect linux-image-4.8.0-34-generic in 16.04.1 Xenial.

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-02-08:

#30

Hi, Luk.

linux-image-4.8.0-34-generic should not be affected by this. If you see unexpected OOM problems, please open a new bug report and attach the kernel logs, please.

Thanks.
Cascardo.

Revision history for this message

xb5i7o (xb5i7o) wrote on 2017-02-08:

#31

Just by the way - 4.4.0-62-generic has the exact same problem. Even when uninstalling 4.4.0-59-generic, my system at some point auto-updated to 4.4.0-62-generic . Only 4.4.0-57-generic is safe for now.

Revision history for this message

Nick Maynard (nick-maynard) wrote on 2017-02-09:

#32

LTS Ubuntu with -updates shouldn't have this sort of issue - this is, frankly, unforgivable.

We need a new kernel urgently in -updates, and I'd expect serious discussions within the kernel team to understand what has caused this issue and avoid it reoccurring.

Revision history for this message

Anton Piatek (anton-piatek) wrote on 2017-02-09:

#33

If this kernel is not going to hit -updates shortly (i.e. days), can something be done to pull or downgrade the broken kernel? At least revert linux-image-generic to depend back on linux-image-4.4.0-57-generic which doesn't have the issues and will stop more people from upgrading to a broken kernel.

Having this sort of break in an LTS kernel is not inspiring at all.

Revision history for this message

Eric Desrochers (slashd) wrote on 2017-02-09:

#34

The fix is now available for testing in kernel version 4.4.0-63.84, if you enable proposed[1]

$ apt-cache policy linux-image-4.4.0-63-generic
linux-image-4.4.0-63-generic:
  Installed: (none)
  ==> Candidate: 4.4.0-63.84
  Version table:
     4.4.0-63.84 500
        500 http://archive.ubuntu.com/ubuntu ==>xenial-proposed/main amd64 Packages

$ apt-get changelog linux-image-4.4.0-63-generic | egrep "1655842"
==> * "Out of memory" errors after upgrade to 4.4.0-59 (LP: #1655842)

[1] - https://wiki.ubuntu.com/Testing/EnableProposed

- Eric

Revision history for this message

Oliver O. (oliver-o456i) wrote on 2017-02-09:

#35

Testing...

Enabled proposed (https://wiki.ubuntu.com/Testing/EnableProposed).

Installed kernel packages:

# apt-get install -s -t xenial-proposed 'linux-headers-4.4.0.63$' 'linux-headers-4.4.0.63-generic$' 'linux-image-4.4.0.63-generic$' 'linux-image-extra-4.4.0.63-generic$'

Rebooted.

# cat /proc/version_signature
Ubuntu 4.4.0-63.84-generic 4.4.44

Revision history for this message

kulwinder singh (kulwinder-careers) wrote on 2017-02-09:

#36

Who is the saver 4.4.0-57-generic or 4.4.0-63-generic now.

Revision history for this message

David Glasser (glasser) wrote on 2017-02-09:

#37

kulwinder singh: Either one, but nothing in between.

-57 will reintroduce a few (unrelated) security bugs as well as the bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1647400 whose fix caused this one, but is easier to enable and has been tested for longer.

-63 should fix this bug, the older bug, and the intermediary security bugs, but requires you to enable the "proposed" repository, and hasn't been tested for quite as long.

Anything in between has this bug.

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-02-09:

#38

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: verification-needed-xenial

Revision history for this message

David Glasser (glasser) wrote on 2017-02-10:

#39

Cascardo: Just to be clear, are you looking for verification from anyone in the world, or from specific kernel testers?

(I'd like to help, but I'm only able to reproduce the issue in production, and the process of debugging this issue when we ran into it was already more restarts than is good for my service right now (we settled on downgrading for the moment).)

Revision history for this message

David F. (malteworld) wrote on 2017-02-10:

#40

@nick-maynard: Why is such a bug unforgivable? You can just boot a previous kernel instead. If you're concerned about availability then don't reboot in the first place unless there's an important security patch.

Revision history for this message

David Glasser (glasser) wrote on 2017-02-10:

#41

To be fair, there have been multiple USN-reported kernel security patches fixed in post-57 kernels.

Revision history for this message

Travisgevans (travisgevans) wrote on 2017-02-10:

#42

Don't forget that the earlier kernels are affected by Bug #1647400, which does something even worse (hang the system). I've verified that it affected my particular system before 4.4.0-59, and it may explain a couple of lockups I had previously experienced during normal operation when using previous kernels. -59 fixes the bug but introduces the permature OOM kill issue; if it weren't for the kernels currently in proposed (assuming they indeed fix this bug), I wouldn't really have a reliable kernel at all to use.

With the 4.4.0-59 kernel, I got hit with two unexplained OOM kills, each occurring within about 3 days of uptime. I then tested the -62 kernel in proposed for just under 14 days and didn't see any OOM kills, and I've now been testing -63 for a couple of days and haven't any issues yet. However, it might help if anyone has an idea how the OOM kill bug might be reliably reproduced. “5 working days” isn't very long to reliably be sure the problem is solved otherwise; it took more than half that time upon upgrading to -59 for me to hit the bug by chance.

Revision history for this message

Nate Eldredge (nate-thatsmathematics) wrote on 2017-02-10: Re: [Bug 1655842] Re: "Out of memory" errors after upgrade to 4.4.0-59

#43

On Fri, 10 Feb 2017, Travisgevans wrote:

> However, it might help if anyone has an idea how the OOM kill bug might
> be reliably reproduced. “5 working days” isn't very long to reliably be
> sure the problem is solved otherwise; it took more than half that time
> upon upgrading to -59 for me to hit the bug by chance.

I had a job (duplicity) that would oom every time under -59 and -62. With
-63 from proposed, it doesn't.

--
Nate Eldredge
<email address hidden>

Revision history for this message

Otto Wayne (ottowayne) wrote on 2017-02-10:

#44

I see this bug on Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-1042-raspi2 armv7l) as described here: https://superuser.com/questions/1176773/ubuntu-on-rpi-starts-killing-processes-when-ram-is-filled-up-by-cache

The workaround by Krzysztof Dryja (cih997) works for me as well but is very ugly and temporary.

Revision history for this message

wurlyfan (wurlyfan) wrote on 2017-02-10:

#45

Download full text (4.9 KiB)

Firefox and Insync were killed pretty reliably for me, but other packages as well. I was getting half-a-dozen oom kills a day before I switched back to 57. My second workstation is fully updated and doesn't show any sign of this issue.

From m3 note

-------- Original message --------
Sender: Nate Eldredge <email address hidden>
Time: Fri 2/10 19:21
To: wurlyfan <email address hidden>
Subject: Re: [Bug 1655842] Re: "Out of memory" errors after upgrade to 4.4.0-59

>On Fri, 10 Feb 2017, Travisgevans wrote:
>
>> However, it might help if anyone has an idea how the OOM kill bug might
>> be reliably reproduced. “5 working days” isn't very long to reliably be
>> sure the problem is solved otherwise; it took more than half that time
>> upon upgrading to -59 for me to hit the bug by chance.
>
>I had a job (duplicity) that would oom every time under -59 and -62. With
>-63 from proposed, it doesn't.
>
>--
>Nate Eldredge
><email address hidden>
>
>--
>You received this bug notification because you are subscribed to the bug
>report.
>https://bugs.launchpad.net/bugs/1655842
>
>Title:
> "Out of memory" errors after upgrade to 4.4.0-59
>
>Status in linux package in Ubuntu:
> Fix Released
>Status in linux source package in Xenial:
> Fix Committed
>
>Bug description:
> I recently replaced some Xenial servers, and started experiencing "Out
> of memory" problems with the default kernel.
>
> We bake Amazon AMIs based on an official Ubuntu-provided image (ami-
> e6b58e85, in ap-southeast-2, from https://cloud-
> images.ubuntu.com/locator/ec2/). Previous versions of our AMI
> included "4.4.0-57-generic", but the latest version picked up
> "4.4.0-59-generic" as part of a "dist-upgrade".
>
> Instances booted using the new AMI have been using more memory, and
> experiencing OOM issues - sometimes during boot, and sometimes a while
> afterwards. An example from the system log is:
>
> [ 130.113411] cloud-init[1560]: Cloud-init v. 0.7.8 running 'modules:final' at Wed, 11 Jan 2017 22:07:53 +0000. Up 29.28 seconds.
> [ 130.124219] cloud-init[1560]: Cloud-init v. 0.7.8 finished at Wed, 11 Jan 2017 22:09:35 +0000. Datasource DataSourceEc2. Up 130.09 seconds
> [29871.137128] Out of memory: Kill process 2920 (ruby) score 107 or sacrifice child
> [29871.140816] Killed process 2920 (ruby) total-vm:675048kB, anon-rss:51184kB, file-rss:2164kB
> [29871.449209] Out of memory: Kill process 3257 (splunkd) score 97 or sacrifice child
> [29871.453282] Killed process 3258 (splunkd) total-vm:66272kB, anon-rss:6676kB, file-rss:0kB
> [29871.677910] Out of memory: Kill process 2647 (fluentd) score 51 or sacrifice child
> [29871.681872] Killed process 2647 (fluentd) total-vm:117944kB, anon-rss:23956kB, file-rss:1356kB
>
> I have a hunch that this may be related to the fix for
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1647400,
> introduced in linux (4.4.0-58.79).
>
> ProblemType: Bug
> DistroRelease: Ubuntu 16.04
> Package: linux-image-4.4.0-59-generic 4.4.0-59.80
> ProcVersionSignature: User Name 4.4.0-59.80-generic 4.4.35
> Uname: Linux 4.4.0-59-generic x86_64
> AlsaDevices:
> total 0
> crw-rw---- 1 root audio 116, 1 Jan 12...

Firefox and Insync were killed pretty reliably for me, but other packages as well. I was getting half-a-dozen oom kills a day before I switched back to 57. My second workstation is fully updated and doesn't show any sign of this issue.

From m3 note

-------- Original message --------
Sender: Nate Eldredge <1655842@bugs.launchpad.net>
Time: Fri 2/10 19:21
To: wurlyfan <wurlyfan@gmail.com>
 Subject: Re: [Bug 1655842] Re: "Out of memory" errors after upgrade to 4.4.0-59

>On Fri, 10 Feb 2017, Travisgevans wrote:
>
>> However, it might help if anyone has an idea how the OOM kill bug might
>> be reliably reproduced. “5 working days” isn't very long to reliably be
>> sure the problem is solved otherwise; it took more than half that time
>> upon upgrading to -59 for me to hit the bug by chance.
>
>I had a job (duplicity) that would oom every time under -59 and -62.  With 
>-63 from proposed, it doesn't.
>
>-- 
>Nate Eldredge
>nate@thatsmathematics.com
>
>-- 
>You received this bug notification because you are subscribed to the bug
>report.
>https://bugs.launchpad.net/bugs/1655842
>
>Title:
>  "Out of memory" errors after upgrade to 4.4.0-59
>
>Status in linux package in Ubuntu:
>  Fix Released
>Status in linux source package in Xenial:
>  Fix Committed
>
>Bug description:
>  I recently replaced some Xenial servers, and started experiencing "Out
>  of memory" problems with the default kernel.
>
>  We bake Amazon AMIs based on an official Ubuntu-provided image (ami-
>  e6b58e85, in ap-southeast-2, from https://cloud-
>  images.ubuntu.com/locator/ec2/).  Previous versions of our AMI
>  included "4.4.0-57-generic", but the latest version picked up
>  "4.4.0-59-generic" as part of a "dist-upgrade".
>
>  Instances booted using the new AMI have been using more memory, and
>  experiencing OOM issues - sometimes during boot, and sometimes a while
>  afterwards.  An example from the system log is:
>
>  [  130.113411] cloud-init[1560]: Cloud-init v. 0.7.8 running 'modules:final' at Wed, 11 Jan 2017 22:07:53 +0000. Up 29.28 seconds.
>  [  130.124219] cloud-init[1560]: Cloud-init v. 0.7.8 finished at Wed, 11 Jan 2017 22:09:35 +0000. Datasource DataSourceEc2.  Up 130.09 seconds
>  [29871.137128] Out of memory: Kill process 2920 (ruby) score 107 or sacrifice child
>  [29871.140816] Killed process 2920 (ruby) total-vm:675048kB, anon-rss:51184kB, file-rss:2164kB
>  [29871.449209] Out of memory: Kill process 3257 (splunkd) score 97 or sacrifice child
>  [29871.453282] Killed process 3258 (splunkd) total-vm:66272kB, anon-rss:6676kB, file-rss:0kB
>  [29871.677910] Out of memory: Kill process 2647 (fluentd) score 51 or sacrifice child
>  [29871.681872] Killed process 2647 (fluentd) total-vm:117944kB, anon-rss:23956kB, file-rss:1356kB
>
>  I have a hunch that this may be related to the fix for
>  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1647400,
>  introduced in linux (4.4.0-58.79).
>
>  ProblemType: Bug
>  DistroRelease: Ubuntu 16.04
>  Package: linux-image-4.4.0-59-generic 4.4.0-59.80
>  ProcVersionSignature: User Name 4.4.0-59.80-generic 4.4.35
>  Uname: Linux 4.4.0-59-generic x86_64
>  AlsaDevices:
>   total 0
>   crw-rw---- 1 root audio 116,  1 Jan 12 06:29 seq
>   crw-rw---- 1 root audio 116, 33 Jan 12 06:29 timer
>  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
>  ApportVersion: 2.20.1-0ubuntu2.4
>  Architecture: amd64
>  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
>  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
>  Date: Thu Jan 12 06:38:45 2017
>  Ec2AMI: ami-0f93966c
>  Ec2AMIManifest: (unknown)
>  Ec2AvailabilityZone: ap-southeast-2a
>  Ec2InstanceType: t2.nano
>  Ec2Kernel: unavailable
>  Ec2Ramdisk: unavailable
>  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
>  Lsusb: Error: command ['lsusb'] failed with exit code 1:
>  MachineType: Xen HVM domU
>  PciMultimedia:
>   
>  ProcEnviron:
>   TERM=xterm-256color
>   PATH=(custom, no user)
>   XDG_RUNTIME_DIR=<set>
>   LANG=en_US.UTF-8
>   SHELL=/bin/bash
>  ProcFB: 0 cirrusdrmfb
>  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-59-generic root=UUID=fb0fef08-f3c5-40bf-9776-f7ba00fe72be ro console=tty1 console=ttyS0
>  RelatedPackageVersions:
>   linux-restricted-modules-4.4.0-59-generic N/A
>   linux-backports-modules-4.4.0-59-generic  N/A
>   linux-firmware                            1.157.6
>  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
>  SourcePackage: linux
>  UpgradeStatus: No upgrade log present (probably fresh install)
>  dmi.bios.date: 12/09/2016
>  dmi.bios.vendor: Xen
>  dmi.bios.version: 4.2.amazon
>  dmi.chassis.type: 1
>  dmi.chassis.vendor: Xen
>  dmi.modalias: dmi:bvnXen:bvr4.2.amazon:bd12/09/2016:svnXen:pnHVMdomU:pvr4.2.amazon:cvnXen:ct1:cvr:
>  dmi.product.name: HVM domU
>  dmi.product.version: 4.2.amazon
>  dmi.sys.vendor: Xen
>
>To manage notifications about this bug go to:
>https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1655842/+subscriptions

Revision history for this message

Ivan Kozik (ludios) wrote on 2017-02-10:

#46

I've been using -63 for a while now (even before it was in proposed, via an sbuild setup) on a machine that had OOM problems with -59, and I haven't noticed any issues.

Revision history for this message

Serge Victor (ser) wrote on 2017-02-11:

#47

-63 works for me as well, thank you!

Oliver O. (oliver-o456i) on 2017-02-11

tags:

added: verification-done-xenial
removed: verification-needed-xenial

Revision history for this message

Oliver O. (oliver-o456i) wrote on 2017-02-11:

#48

Tested Ubuntu 4.4.0-63.84-generic 4.4.44 on a desktop system with a workload which previously led to Chrome processes being OOM-killed.

Situation with 4.4.0-62-generic: between 8 and 54 processes OOM-killed per 24-hour period
Situation with 4.4.0-63-generic: no OOM-kills during 46 hours of testing

Look solved. No negative side-effects encountered.

Revision history for this message

VSHN (vshn) wrote on 2017-02-13:

#49

you should re-release 4.4.0-62 as linux-image-chaosmonkey-virtual

Revision history for this message

Javier Bernal (javierbernal) wrote on 2017-02-13:

#50

Like Luk (#29), I upgraded to 4.8.0-34, but the problem disappeared for me. I ran system for two days without any OOM-kills. Before that, simply copying a big file (9Gb+) fired the system. My system has 16Gb RAM and runs 16.04.1.

Revision history for this message

kulwinder singh (kulwinder-careers) wrote on 2017-02-14:

#51

To throw some more light into the matter. Two machines were upgraded to 4.4.0-59 on 16th Jan, same load, but only one out of them is reporting oom-kill. Anybody experiencing same scenario...

Revision history for this message

Sridhar Chandramouli (ridsharc) wrote on 2017-02-16:

#52

Thanks to those who reported/fixed this bug.

Out of curiosity, was this a bug in the 4.4.0-59 kernel itself or in Ubuntu's packaging of the kernel, i.e. were other (non-Ubuntu) Linux users impacted ?

Revision history for this message

kulwinder singh (kulwinder-careers) wrote on 2017-02-16:

#53

Did anybody also notice any pattern in "invoked oom-killer". I see a pattern like daily almost same time cron or scripts invoking oom-killer...

Revision history for this message

Charles Wright (wrighrc) wrote on 2017-02-16:

#54

I was curious if I could answer Sridhar's question as I had the same question.

The introduction of the problem appears to be in Ubuntu's packaging of select upstream commits from 4.7 to address bug #1647400.

From the comments in that case, it appears 6b4e3181d7bd5ca5ab6f45929e4a5ffa7ab4ab7f should have also been selected.

From looking at the history I gather if I had been running stock 4.4 kernels I would not have been effected by the OOM issue.

I'm basing this on tracking down 0a0337e0d1d134465778a16f5cbea95086e8e9e0 in the main line kernel.

Craig Francis (craig.francis) on 2017-02-17

description:

updated

Revision history for this message

Launchpad Janitor (janitor) wrote on 2017-02-20:

#55

Download full text (23.0 KiB)

This bug was fixed in the package linux - 4.4.0-63.84

---------------
linux (4.4.0-63.84) xenial; urgency=low

[ Thadeu Lima de Souza Cascardo ]

* Release Tracking Bug
- LP: #1660704

* Backport Dirty COW patch to prevent wineserver freeze (LP: #1658270)
- SAUCE: mm: Respect FOLL_FORCE/FOLL_COW for thp

  * Kdump through NMI SMP and single core not working on Ubuntu16.10
    (LP: #1630924)
    - x86/hyperv: Handle unknown NMIs on one CPU when unknown_nmi_panic
    - SAUCE: hv: don't reset hv_context.tsc_page on crash

  * [regression 4.8.0-14 -> 4.8.0-17] keyboard and touchscreen lost on Acer
    Chromebook R11 (LP: #1630238)
    - [Config] CONFIG_PINCTRL_CHERRYVIEW=y

  * Call trace when testing fstat stressor on ppc64el with virtual keyboard and
    mouse present (LP: #1652132)
    - SAUCE: HID: usbhid: Quirk a AMI virtual mouse and keyboard with ALWAYS_POLL

* VLAN SR-IOV regression for IXGBE driver (LP: #1658491)
- ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths

  * "Out of memory" errors after upgrade to 4.4.0-59 (LP: #1655842)
    - mm, page_alloc: convert alloc_flags to unsigned
    - mm, compaction: change COMPACT_ constants into enum
    - mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED
    - mm, compaction: simplify __alloc_pages_direct_compact feedback interface
    - mm, compaction: distinguish between full and partial COMPACT_COMPLETE
    - mm, compaction: abstract compaction feedback to helpers
    - mm, oom: protect !costly allocations some more
    - mm: consider compaction feedback also for costly allocation
    - mm, oom, compaction: prevent from should_compact_retry looping for ever for
      costly orders
    - mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION
    - mm, oom: prevent premature OOM killer invocation for high order request

  * Backport 3 patches to fix bugs with AIX clients using IBMVSCSI Target Driver
    (LP: #1657194)
    - SAUCE: ibmvscsis: Fix max transfer length
    - SAUCE: ibmvscsis: fix sleeping in interrupt context
    - SAUCE: ibmvscsis: Fix srp_transfer_data fail return code

  * NVMe: adapter is missing after abnormal shutdown followed by quick reboot,
    quirk needed (LP: #1656913)
    - nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too

  * Ubuntu 16.10 KVM SRIOV: if enable sriov while ping flood is running ping
    will stop working (LP: #1625318)
    - PCI: Do any VF BAR updates before enabling the BARs
    - PCI: Ignore BAR updates on virtual functions
    - PCI: Update BARs using property bits appropriate for type
    - PCI: Separate VF BAR updates from standard BAR updates
    - PCI: Don't update VF BARs while VF memory space is enabled
    - PCI: Remove pci_resource_bar() and pci_iov_resource_bar()
    - PCI: Decouple IORESOURCE_ROM_ENABLE and PCI_ROM_ADDRESS_ENABLE
    - PCI: Add comments about ROM BAR updating

  * Linux rtc self test fails in a VM under xenial (LP: #1649718)
    - kvm: x86: Convert ioapic->rtc_status.dest_map to a struct
    - kvm: x86: Track irq vectors in ioapic->rtc_status.dest_map
    - kvm: x86: Check dest_map->vector to match eoi signals for rtc

* Xenial update to v4.4.44 stable releas...

This bug was fixed in the package linux - 4.4.0-63.84

---------------
linux (4.4.0-63.84) xenial; urgency=low

[ Thadeu Lima de Souza Cascardo ]

* Release Tracking Bug
    - LP: #1660704

* Backport Dirty COW patch to prevent wineserver freeze (LP: #1658270)
    - SAUCE: mm: Respect FOLL_FORCE/FOLL_COW for thp

* Kdump through NMI SMP and single core not working on Ubuntu16.10
    (LP: #1630924)
    - x86/hyperv: Handle unknown NMIs on one CPU when unknown_nmi_panic
    - SAUCE: hv: don't reset hv_context.tsc_page on crash

* [regression 4.8.0-14 -> 4.8.0-17] keyboard and touchscreen lost on Acer
    Chromebook R11 (LP: #1630238)
    - [Config] CONFIG_PINCTRL_CHERRYVIEW=y

* Call trace when testing fstat stressor on ppc64el with virtual  keyboard and
    mouse present (LP: #1652132)
    - SAUCE: HID: usbhid: Quirk a AMI virtual mouse and keyboard with ALWAYS_POLL

* VLAN SR-IOV regression for IXGBE driver (LP: #1658491)
    - ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths

* "Out of memory" errors after upgrade to 4.4.0-59 (LP: #1655842)
    - mm, page_alloc: convert alloc_flags to unsigned
    - mm, compaction: change COMPACT_ constants into enum
    - mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED
    - mm, compaction: simplify __alloc_pages_direct_compact feedback interface
    - mm, compaction: distinguish between full and partial COMPACT_COMPLETE
    - mm, compaction: abstract compaction feedback to helpers
    - mm, oom: protect !costly allocations some more
    - mm: consider compaction feedback also for costly allocation
    - mm, oom, compaction: prevent from should_compact_retry looping for ever for
      costly orders
    - mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION
    - mm, oom: prevent premature OOM killer invocation for high order request

* Backport 3 patches to fix bugs with AIX clients using IBMVSCSI Target Driver
    (LP: #1657194)
    - SAUCE: ibmvscsis: Fix max transfer length
    - SAUCE: ibmvscsis: fix sleeping in interrupt context
    - SAUCE: ibmvscsis: Fix srp_transfer_data fail return code

* NVMe: adapter is missing after abnormal shutdown followed by quick reboot,
    quirk needed (LP: #1656913)
    - nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too

* Ubuntu 16.10 KVM SRIOV: if enable sriov while ping flood is running ping
    will stop working (LP: #1625318)
    - PCI: Do any VF BAR updates before enabling the BARs
    - PCI: Ignore BAR updates on virtual functions
    - PCI: Update BARs using property bits appropriate for type
    - PCI: Separate VF BAR updates from standard BAR updates
    - PCI: Don't update VF BARs while VF memory space is enabled
    - PCI: Remove pci_resource_bar() and pci_iov_resource_bar()
    - PCI: Decouple IORESOURCE_ROM_ENABLE and PCI_ROM_ADDRESS_ENABLE
    - PCI: Add comments about ROM BAR updating

* Linux rtc self test fails in a VM under xenial (LP: #1649718)
    - kvm: x86: Convert ioapic->rtc_status.dest_map to a struct
    - kvm: x86: Track irq vectors in ioapic->rtc_status.dest_map
    - kvm: x86: Check dest_map->vector to match eoi signals for rtc

* Xenial update to v4.4.44 stable release (LP: #1658091)
    - Input: xpad - use correct product id for x360w controllers
    - Input: i8042 - add Pegatron touchpad to noloop table
    - selftests: do not require bash to run netsocktests testcase
    - selftests: do not require bash for the generated test
    - mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}
    - ocfs2: fix crash caused by stale lvb with fsdlm plugin
    - mm/hugetlb.c: fix reservation race when freeing surplus pages
    - KVM: x86: fix emulation of "MOV SS, null selector"
    - KVM: eventfd: fix NULL deref irqbypass consumer
    - jump_labels: API for flushing deferred jump label updates
    - KVM: x86: flush pending lapic jump label updates on module unload
    - KVM: x86: add Align16 instruction flag
    - KVM: x86: add asm_safe wrapper
    - KVM: x86: emulate FXSAVE and FXRSTOR
    - KVM: x86: Introduce segmented_write_std
    - nl80211: fix sched scan netlink socket owner destruction
    - USB: serial: kl5kusb105: fix line-state error handling
    - USB: serial: ch341: fix initial modem-control state
    - USB: serial: ch341: fix open error handling
    - USB: serial: ch341: fix control-message error handling
    - USB: serial: ch341: fix open and resume after B0
    - Input: elants_i2c - avoid divide by 0 errors on bad touchscreen data
    - i2c: print correct device invalid address
    - i2c: fix kernel memory disclosure in dev interface
    - xhci: fix deadlock at host remove by running watchdog correctly
    - vme: Fix wrong pointer utilization in ca91cx42_slave_get
    - mnt: Protect the mountpoint hashtable with mount_lock
    - tty/serial: atmel_serial: BUG: stop DMA from transmitting in stop_tx
    - sysrq: attach sysrq handler correctly for 32-bit kernel
    - sysctl: Drop reference added by grab_header in proc_sys_readdir
    - drm/radeon: drop verde dpm quirks
    - USB: serial: ch341: fix resume after reset
    - USB: serial: ch341: fix modem-control and B0 handling
    - x86/cpu: Fix bootup crashes by sanitizing the argument of the 'clearcpuid='
      command-line option
    - btrfs: fix locking when we put back a delayed ref that's too new
    - btrfs: fix error handling when run_delayed_extent_op fails
    - pinctrl: meson: fix gpio request disabling other modes
    - pNFS: Fix race in pnfs_wait_on_layoutreturn
    - NFS: Fix a performance regression in readdir
    - NFSv4.1: nfs4_fl_prepare_ds must be careful about reporting success.
    - cpufreq: powernv: Disable preemption while checking CPU throttling state
    - block: cfq_cpd_alloc() should use @gfp
    - ACPI / APEI: Fix NMI notification handling
    - blk-mq: Always schedule hctx->next_cpu
    - bus: vexpress-config: fix device reference leak
    - powerpc/ibmebus: Fix further device reference leaks
    - powerpc/ibmebus: Fix device reference leaks in sysfs interface
    - pinctrl: sh-pfc: Do not unconditionally support PIN_CONFIG_BIAS_DISABLE
    - Linux 4.4.44

* Add support for RT5660 codec based sound cards on Baytrail (LP: #1657674)
    - ASoC: rt5660: add rt5660 codec driver
    - ASoC: rt5660: enable MCLK detection
    - ASoC: Intel: Atom: flip logic for gain Switch
    - SAUCE: (no-up) ASoC: rt5660: Add ACPI support
    - SAUCE: (no-up) ASoC: Intel: Support machine driver for RT5660 on Baytrail
    - [Config] CONFIG_SND_SOC_INTEL_BYTCR_RT5660_MACH=m, CONFIG_SND_SOC_RT5660=m

* Support latest Redpine WLAN/BT RS9113 driver (LP: #1657682)
    - SAUCE: Support Redpine RS9113 WLAN/BT
    - SAUCE: Separate Redpine RS9113 WLAN/BT vendor and kernel drivers
    - SAUCE: Redpine RS9113 WLAN/BT driver ver. 0.9.7
    - SAUCE: RS9113: Use vendor driver to support WLAN/BT card on Caracalla HW
      only
    - SAUCE: RS9113: Comment out IDs from upstream driver
    - [Config] Enable CONFIG_VEN_RSI_* configs

* [Hyper-V] netvsc: add rcu_read locked to netvsc callback (LP: #1657540)
    - netvsc: add rcu_read locking to netvsc callback

* [Hyper-V] Rebase Hyper-V in 16.04 and 16.10 to the the upstream 4.9 kernel
    (LP: #1650059)
    - memory-hotplug: add automatic onlining policy for the newly added memory
    - hv_netvsc: Add query for initial physical link speed
    - hv_netvsc: Add handler for physical link speed change
    - hv_netvsc: Implement batching of receive completions
    - PCI: hv: Use list_move_tail() instead of list_del() + list_add_tail()
    - hv_netvsc: fix rtnl locking in callback
    - hv_netvsc: make RSS hash key static
    - hv_netvsc: use kcalloc
    - hv_netvsc: style cleanups
    - hv_netvsc: make inline functions static
    - hv_netvsc: use ARRAY_SIZE() for NDIS versions
    - hv_netvsc: make device_remove void
    - hv_netvsc: init completion during alloc
    - hv_netvsc: rearrange start_xmit
    - hv_netvsc: refactor completion function
    - hv_netvsc: make netvsc_destroy_buf void
    - hv_netvsc: make variable local
    - hv_netvsc: report vmbus name in ethtool
    - hv_netvsc: add ethtool statistics for tx packet issues
    - Drivers: hv: get rid of redundant messagecount in create_gpadl_header()
    - Drivers: hv: don't leak memory in vmbus_establish_gpadl()
    - Drivers: hv: get rid of timeout in vmbus_open()
    - Drivers: hv: utils: fix a race on userspace daemons registration
    - Drivers: hv: vmbus: fix the race when querying & updating the percpu list
    - Drivers: hv: vmbus: Enable explicit signaling policy for NIC channels
    - Drivers: hv: vmbus: Reduce the delay between retries in vmbus_post_msg()
    - Drivers: hv: vmbus: Implement a mechanism to tag the channel for low latency
    - Tools: hv: kvp: ensure kvp device fd is closed on exec
    - Drivers: hv: balloon: keep track of where ha_region starts
    - Drivers: hv: balloon: account for gaps in hot add regions
    - Drivers: hv: balloon: don't wait for ol_waitevent when memhp_auto_online is
      enabled
    - Drivers: hv: balloon: replace ha_region_mutex with spinlock
    - Drivers: hv: balloon: Use available memory value in pressure report
    - Drivers: hv: cleanup vmbus_open() for wrap around mappings
    - Drivers: hv: ring_buffer: wrap around mappings for ring buffers
    - Drivers: hv: ring_buffer: use wrap around mappings in hv_copy{from,
      to}_ringbuffer()
    - Drivers: hv: ring_buffer: count on wrap around mappings in
      get_next_pkt_raw()
    - Drivers: hv: Introduce a policy for controlling channel affinity
    - Drivers: hv: utils: Continue to poll VSS channel after handling requests.
    - Drivers: hv: utils: Check VSS daemon is listening before a hot backup
    - PCI: hv: Use zero-length array in struct pci_packet
    - PCI: hv: Use pci_function_description[0] in struct definitions
    - PCI: hv: Remove the unused 'wrk' in struct hv_pcibus_device
    - PCI: hv: Handle vmbus_sendpacket() failure in hv_compose_msi_msg()
    - PCI: hv: Handle hv_pci_generic_compl() error case
    - Revert "Drivers: hv: ring_buffer: count on wrap around mappings in
      get_next_pkt_raw()"
    - Driver: hv: vmbus: Make mmio resource local
    - Drivers: hv: vmbus: suppress some "hv_vmbus: Unknown GUID" warnings
    - Drivers: hv: utils: Rename version definitions to reflect protocol version.
    - Drivers: hv: utils: Use TimeSync samples to adjust the clock after boot.
    - Drivers: hv: utils: Support TimeSync version 4.0 protocol samples.
    - Drivers: hv: hv_util: Avoid dynamic allocation in time synch
    - Revert "hv_netvsc: make inline functions static"
    - hv_netvsc: use consume_skb
    - hv_netvsc: dev hold/put reference to VF
    - hv_netvsc: simplify callback event code
    - hv_netvsc: improve VF device matching
    - hv_netvsc: use RCU to protect vf_netdev
    - hv_netvsc: remove VF in flight counters
    - hv_netvsc: count multicast packets received
    - hv_netvsc: fix comments
    - Drivers: hv: make VMBus bus ids persistent
    - Drivers: hv: get rid of id in struct vmbus_channel
    - netvsc: fix checksum on UDP IPV6
    - netvsc: Remove mistaken udp.h inclusion.
    - net/hyperv: avoid uninitialized variable
    - Revert "hv_netvsc: report vmbus name in ethtool"
    - vmbus: make sysfs names consistent with PCI
    - netvsc: reduce maximum GSO size
    - Drivers: hv: vmbus: Base host signaling strictly on the ring state
    - tools: hv: Add a script to help bonding synthetic and VF NICs

* Ubuntu - ibmveth: abnormally large TCP MSS value caused a TCP session to
    hang with a zero window (LP: #1655420)
    - ibmveth: set correct gso_size and gso_type
    - ibmveth: calculate gso_segs for large packets

* netfilter regression introducing a performance slowdown in binary
    arp/ip/ip6tables (LP: #1640786)
    - netfilter: x_tables: pass xt_counters struct instead of packet counter
    - netfilter: x_tables: pass xt_counters struct to counter allocator
    - netfilter: x_tables: pack percpu counter allocations

* Move some kernel modules to the main kernel package (part 2) (LP: #1655002)
    - [Config] Add IBM power drivers to the inclusion list

* Xenial update to v4.4.43 stable release (LP: #1656876)
    - netvsc: reduce maximum GSO size
    - ser_gigaset: return -ENOMEM on error instead of success
    - net: vrf: Drop conntrack data after pass through VRF device on Tx
    - ipv6: handle -EFAULT from skb_copy_bits
    - net, sched: fix soft lockup in tc_classify
    - net: stmmac: Fix race between stmmac_drv_probe and stmmac_open
    - net/mlx5: Check FW limitations on log_max_qp before setting it
    - net/mlx5: Avoid shadowing numa_node
    - drop_monitor: add missing call to genlmsg_end
    - drop_monitor: consider inserted data in genlmsg_end
    - igmp: Make igmp group member RFC 3376 compliant
    - ipv4: Do not allow MAIN to be alias for new LOCAL w/ custom rules
    - r8152: split rtl8152_suspend function
    - r8152: fix rx issue for runtime suspend
    - gro: Enter slow-path if there is no tailroom
    - gro: use min_t() in skb_gro_reset_offset()
    - gro: Disable frag0 optimization on IPv6 ext headers
    - net: ipv4: Fix multipath selection with vrf
    - net: vrf: do not allow table id 0
    - HID: hid-cypress: validate length of report
    - ALSA: firewire-tascam: Fix to handle error from initialization of stream
      data
    - powerpc: Fix build warning on 32-bit PPC
    - ARM: zynq: Reserve correct amount of non-DMA RAM
    - ARM: OMAP4+: Fix bad fallthrough for cpuidle
    - spi: mvebu: fix baudrate calculation for armada variant
    - ALSA: usb-audio: Add a quirk for Plantronics BT600
    - mm/init: fix zone boundary creation
    - Linux 4.4.43

* Xenial update to v4.4.42 stable release (LP: #1655969)
    - ALSA: hda - Fix up GPIO for ASUS ROG Ranger
    - ALSA: hda - Apply asus-mode8 fixup to ASUS X71SL
    - ALSA: usb-audio: Fix irq/process data synchronization
    - ARM: davinci: da850: don't add emac clock to lookup table twice
    - mac80211: initialize fast-xmit 'info' later
    - KVM: x86: reset MMU on KVM_SET_VCPU_EVENTS
    - KVM: MIPS: Flush KVM entry code from icache globally
    - usb: musb: core: add clear_ep_rxintr() to musb_platform_ops
    - usb: musb: dsps: implement clear_ep_rxintr() callback
    - usb: storage: unusual_uas: Add JMicron JMS56x to unusual device
    - usb: gadgetfs: restrict upper bound on device configuration size
    - USB: gadgetfs: fix unbounded memory allocation bug
    - USB: gadgetfs: fix use-after-free bug
    - USB: gadgetfs: fix checks of wTotalLength in config descriptors
    - USB: fix problems with duplicate endpoint addresses
    - USB: dummy-hcd: fix bug in stop_activity (handle ep0)
    - usb: gadget: composite: Test get_alt() presence instead of set_alt()
    - usb: dwc3: core: avoid Overflow events
    - usb: xhci: fix possible wild pointer
    - xhci: workaround for hosts missing CAS bit
    - usb: xhci: apply XHCI_PME_STUCK_QUIRK to Intel Apollo Lake
    - xhci: free xhci virtual devices with leaf nodes first
    - usb: xhci: fix return value of xhci_setup_device()
    - usb: host: xhci: Fix possible wild pointer when handling abort command
    - xhci: Handle command completion and timeout race
    - usb: xhci: hold lock over xhci_abort_cmd_ring()
    - USB: serial: omninet: fix NULL-derefs at open and disconnect
    - USB: serial: quatech2: fix sleep-while-atomic in close
    - USB: serial: pl2303: fix NULL-deref at open
    - USB: serial: keyspan_pda: verify endpoints at probe
    - USB: serial: spcp8x5: fix NULL-deref at open
    - USB: serial: io_ti: fix NULL-deref at open
    - USB: serial: io_ti: fix another NULL-deref at open
    - USB: serial: io_ti: fix I/O after disconnect
    - USB: serial: iuu_phoenix: fix NULL-deref at open
    - USB: serial: garmin_gps: fix memory leak on failed URB submit
    - USB: serial: ti_usb_3410_5052: fix NULL-deref at open
    - USB: serial: io_edgeport: fix NULL-deref at open
    - USB: serial: oti6858: fix NULL-deref at open
    - USB: serial: cyberjack: fix NULL-deref at open
    - USB: serial: kobil_sct: fix NULL-deref in write
    - USB: serial: mos7840: fix NULL-deref at open
    - USB: serial: mos7720: fix NULL-deref at open
    - USB: serial: mos7720: fix use-after-free on probe errors
    - USB: serial: mos7720: fix parport use-after-free on probe errors
    - USB: serial: mos7720: fix parallel probe
    - usb: xhci-mem: use passed in GFP flags instead of GFP_KERNEL
    - xhci: Use delayed_work instead of timer for command timeout
    - xhci: Fix race related to abort operation
    - usb: dwc3: pci: add Intel Gemini Lake PCI ID
    - usb: musb: Fix trying to free already-free IRQ 4
    - usb: hub: Move hub_port_disable() to fix warning if PM is disabled
    - usb: musb: blackfin: add bfin_fifo_offset in bfin_ops
    - ALSA: usb-audio: Fix bogus error return in snd_usb_create_stream()
    - USB: serial: kl5kusb105: abort on open exception path
    - ARM: dts: r8a7794: Correct hsusb parent clock
    - USB: phy: am335x-control: fix device and of_node leaks
    - USB: serial: io_ti: bind to interface after fw download
    - mei: bus: fix mei_cldev_enable KDoc
    - staging: iio: ad7606: fix improper setting of oversampling pins
    - usb: dwc3: gadget: always unmap EP0 requests
    - usb: dwc3: ep0: add dwc3_ep0_prepare_one_trb()
    - usb: dwc3: ep0: explicitly call dwc3_ep0_prepare_one_trb()
    - stable-fixup: hotplug: fix unused function warning
    - ath10k: use the right length of "background"
    - cris: Only build flash rescue image if CONFIG_ETRAX_AXISFLASHMAP is selected
    - hwmon: (scpi) Fix module autoload
    - hwmon: (amc6821) sign extension temperature
    - hwmon: (ds620) Fix overflows seen when writing temperature limits
    - hwmon: (nct7802) Fix overflows seen when writing into limit attributes
    - hwmon: (g762) Fix overflows and crash seen when writing limit attributes
    - clk: clk-wm831x: fix a logic error
    - clk: imx31: fix rewritten input argument of mx31_clocks_init()
    - iommu/amd: Missing error code in amd_iommu_init_device()
    - iommu/amd: Fix the left value check of cmd buffer
    - iommu/vt-d: Fix pasid table size encoding
    - iommu/vt-d: Flush old iommu caches for kdump when the device gets context
      mapped
    - ASoC: samsung: i2s: Fixup last IRQ unsafe spin lock call
    - scsi: mvsas: fix command_active typo
    - target/iscsi: Fix double free in lio_target_tiqn_addtpg()
    - irqchip/bcm7038-l1: Implement irq_cpu_offline() callback
    - PM / wakeirq: Fix dedicated wakeirq for drivers not using autosuspend
    - mmc: mmc_test: Uninitialized return value
    - s390/crypto: unlock on error in prng_tdes_read()
    - crypto: arm64/sha2-ce - fix for big endian
    - crypto: arm64/ghash-ce - fix for big endian
    - crypto: arm/aes-ce - fix for big endian
    - crypto: arm64/aes-ccm-ce: fix for big endian
    - crypto: arm64/aes-neon - fix for big endian
    - crypto: arm64/sha1-ce - fix for big endian
    - crypto: arm64/aes-xts-ce: fix for big endian
    - crypto: arm64/aes-ce - fix for big endian
    - md: MD_RECOVERY_NEEDED is set for mddev->recovery
    - powerpc/pci/rpadlpar: Fix device reference leaks
    - staging: comedi: dt282x: tidy up register bit defines
    - cred/userns: define current_user_ns() as a function
    - net: ti: cpmac: Fix compiler warning due to type confusion
    - net: vxge: avoid unused function warnings
    - cx23885-dvb: move initialization of a8293_pdata
    - drm/radeon: Always store CRTC relative radeon_crtc->cursor_x/y values
    - tick/broadcast: Prevent NULL pointer dereference
    - Revert "usb: gadget: composite: always set ep->mult to a sensible value"
    - usb: gadget: composite: always set ep->mult to a sensible value
    - Linux 4.4.42

* Xenial update to v4.4.41 stable release (LP: #1655041)
    - ssb: Fix error routine when fallback SPROM fails
    - rtlwifi: Fix enter/exit power_save
    - cfg80211/mac80211: fix BSS leaks when abandoning assoc attempts
    - ath9k: Really fix LED polarity for some Mini PCI AR9220 MB92 cards.
    - mmc: sdhci: Fix recovery from tuning timeout
    - regulator: stw481x-vmmc: fix ages old enable error
    - timekeeping_Force_unsigned_clocksource_to_nanoseconds_conversion
    - clk: bcm2835: Avoid overwriting the div info when disabling a pll_div clk
    - thermal: hwmon: Properly report critical temperature in sysfs
    - staging: comedi: ni_mio_common: fix M Series ni_ai_insn_read() data mask
    - staging: comedi: ni_mio_common: fix E series ni_ai_insn_read() data
    - ACPI / video: Add force_native quirk for Dell XPS 17 L702X
    - ACPI / video: Add force_native quirk for HP Pavilion dv6
    - drm/nouveau/kms: lvds panel strap moved again on maxwell
    - drm/nouveau/bios: require checksum to match for fast acpi shadow method
    - drm/nouveau/ltc: protect clearing of comptags with mutex
    - drm/nouveau/fifo/gf100-: protect channel preempt with subdev mutex
    - drm/nouveau/i2c/gk110b,gm10x: use the correct implementation
    - drm/radeon: Also call cursor_move_locked when the cursor size changes
    - drm/radeon: Hide the HW cursor while it's out of bounds
    - drm/radeon: add additional pci revision to dpm workaround
    - drm/gma500: Add compat ioctl
    - drivers/gpu/drm/ast: Fix infinite loop if read fails
    - mei: request async autosuspend at the end of enumeration
    - block: protect iterate_bdevs() against concurrent close
    - vt: fix Scroll Lock LED trigger name
    - scsi: megaraid_sas: For SRIOV enabled firmware, ensure VF driver waits for
      30secs before reset
    - scsi: megaraid_sas: Do not set MPI2_TYPE_CUDA for JBOD FP path for FW which
      does not support JBOD sequence map
    - scsi: zfcp: fix use-after-"free" in FC ingress path after TMF
    - scsi: zfcp: do not trace pure benign residual HBA responses at default level
    - scsi: zfcp: fix rport unblock race with LUN recovery
    - scsi: avoid a permanent stop of the scsi device's request queue
    - ARC: mm: arc700: Don't assume 2 colours for aliasing VIPT dcache
    - firmware: fix usermode helper fallback loading
    - s390/vmlogrdr: fix IUCV buffer allocation
    - sc16is7xx: Drop bogus use of IRQF_ONESHOT
    - md/raid5: limit request size according to implementation limits
    - KVM: PPC: Book3S HV: Save/restore XER in checkpointed register state
    - KVM: PPC: Book3S HV: Don't lose hardware R/C bit updates in H_PROTECT
    - kvm: nVMX: Allow L1 to intercept software exceptions (#BP and #OF)
    - platform/x86: asus-nb-wmi.c: Add X45U quirk
    - fgraph: Handle a case where a tracer ignores set_graph_notrace
    - IB/mad: Fix an array index check
    - IPoIB: Avoid reading an uninitialized member variable
    - IB/multicast: Check ib_find_pkey() return value
    - IB/cma: Fix a race condition in iboe_addr_get_sgid()
    - media: solo6x10: fix lockup by avoiding delayed register write
    - Input: drv260x - fix input device's parent assignment
    - PCI: Check for PME in targeted sleep state
    - libceph: verify authorize reply on connect
    - nfs_write_end(): fix handling of short copies
    - powerpc/ps3: Fix system hang with GCC 5 builds
    - powerpc: Convert cmp to cmpd in idle enter sequence
    - kconfig/nconf: Fix hang when editing symbol with a long prompt
    - sg_write()/bsg_write() is not fit to be called under KERNEL_DS
    - net: mvpp2: fix dma unmapping of TX buffers for fragments
    - Linux 4.4.41

-- Thadeu Lima de Souza Cascardo <cascardo@canonical.com>  Wed, 01 Feb 2017 14:00:35 -0200

Changed in linux (Ubuntu Xenial):
status:	Fix Committed → Fix Released

Revision history for this message

Iain Buclaw (iainb) wrote on 2017-02-20:

#56

Bug is prevalent on 4.8 kernels too.

---
[251529.693133] CPU: 3 PID: 1547 Comm: icinga2 Not tainted 4.8.0-34-generic #36~16.04.1-Ubuntu
[251529.693134] Hardware name: MSI MS-7823/B85M-G43 (MS-7823), BIOS V3.14B3 06/23/2014
[251529.693135] 0000000000000286 00000000e9fa7ede ffff95c3f774bb38 ffffffffab62d7b3
[251529.693137] ffff95c3f774bcc8 ffff95c3f23bd700 ffff95c3f774bba0 ffffffffab42e9bb
[251529.693138] ffff95c3f774bb40 0000000000000000 0000000000000000 0000000000000000
[251529.693140] Call Trace:
[251529.693145] [<ffffffffab62d7b3>] dump_stack+0x63/0x90
[251529.693148] [<ffffffffab42e9bb>] dump_header+0x5c/0x1dc
[251529.693151] [<ffffffffab3a5836>] oom_kill_process+0x226/0x3f0
[251529.693153] [<ffffffffab3a5daa>] out_of_memory+0x35a/0x3f0
[251529.693155] [<ffffffffab3ab06b>] __alloc_pages_slowpath+0x9fb/0xa20
[251529.693157] [<ffffffffab3ab34a>] __alloc_pages_nodemask+0x2ba/0x300
[251529.693160] [<ffffffffab280726>] copy_process.part.30+0x146/0x1b50
[251529.693162] [<ffffffffab95c66d>] ? sock_recvmsg+0x3d/0x50
[251529.693163] [<ffffffffab95c8aa>] ? SYSC_recvfrom+0xda/0x150
[251529.693164] [<ffffffffab282327>] _do_fork+0xe7/0x3f0
[251529.693166] [<ffffffffab95e171>] ? __sys_recvmsg+0x51/0x90
[251529.693168] [<ffffffffab2826d9>] SyS_clone+0x19/0x20
[251529.693170] [<ffffffffab203bae>] do_syscall_64+0x5e/0xc0
[251529.693174] [<ffffffffaba96625>] entry_SYSCALL64_slow_path+0x25/0x25
[251529.693174] Mem-Info:
[251529.693177] active_anon:339565 inactive_anon:133615 isolated_anon:0
                 active_file:3938458 inactive_file:328087 isolated_file:0
                 unevictable:8 dirty:200 writeback:37 unstable:0
                 slab_reclaimable:3365424 slab_unreclaimable:16102
                 mapped:9114 shmem:1459 pagetables:2462 bounce:0
                 free:49449 free_pcp:32 free_cma:0
---

Had 5 servers knocked out over the weekend.

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-02-20:

#57

4.8 kernels are not affected by this bug. If you have OOMs on 4.8 kernels, please file a new bug with all the revelant details and logs.

Thanks.
Cascardo.

Revision history for this message

Iain Buclaw (iainb) wrote on 2017-02-20:

#58

Yes they are, I'm seeing the same exorbitant memory usage that we had on 4.4.0-58 as am currently getting on 4.8.0-36.

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-02-20:

#59

This bug is not about memory use. It's about the Linux kernel triggering the OOM killer when higher-order (order 2) allocations are requested and progress cannot be made. This affected Linux 4.7 on mainline and was fixed on Linux 4.7 stable and Linux 4.8. When some fixes were backported to 4.4.0-59 (4.4.0-58 was not affected), this bugs was introduced to Xenial kernels and is now fixed on 4.4.0-63. Any behavior on 4.8 kernels must be investigated separately, because all fixes that were backported to 4.4.0-63 are already present in 4.8.

Can you please open a new bug and attach all logs and details you can, so we can investigate your problem and provide a fix? Please, do not use this bug, because the fixes would be different anyway, and even thought the symptom may look like, we consider them different bugs.

I appreciate you opening a new bug and providing this new report.

Thanks.
Cascardo.

Revision history for this message

Iain Buclaw (iainb) wrote on 2017-02-20:

#60

meminfo-buffers.png Edit (84.1 KiB, image/png)

This is the reported /proc/meminfo Buffers usage for 4 different kernel versions. We got the same OOM call traces on both 4.4.0-58 and 4.8.0-34, I highly doubt that to be a coincidence.

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-02-20:

#61

Hi, Mr. Iain Buclaw.

Memory usage reports could be related to something else. This bug was introduced at 4.4.0-59, you mention 4.4.0-58. We could certainly investigate the issue you see, just not on this bug. Much more data is necessary, but, please, don't attach new data to this bug. Your report relates to a different kernel, which has a very different memory management code. I could ask you to test 4.4.0-63, but it still is possible that you find problems there, because they are unrelated to this bug.

Thank you.
Cascardo.

Revision history for this message

Iain Buclaw (iainb) wrote on 2017-02-20:

#62

The OOM fixes were introduced in 4.4.0-58 according to the changelog, but sure.

Revision history for this message

Mike Williams (mdub) wrote on 2017-02-20:

#63

I'm pretty sure this bug was introduced in 4.4.0-58, even though I reported it against 4.4.0-59. Still, +1 for a raising a separate bug against 4.8.

Thanks to Cascardo for fixing this one.

Revision history for this message

joconcepts (jonav) wrote on 2017-02-23:

#64

Could somebody please confirm that the issue has been fixed with the kernel 4.4.0-64.85? We had massive problems with OOM killed qemu instances on our virtualization hosts and would not like to have this introduced again.

Revision history for this message

Mathias Bogaert (mathias-bogaert) wrote on 2017-02-23:

#65

I can confirm 4.4.0-64.85 fixes our OOM issues.

Revision history for this message

Anton (azenkov) wrote on 2017-02-27:

#66

I still see OOM killer invocation on 4.4.0-64.85:

Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857840] java invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857846] java cpuset=/ mems_allowed=0-1
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857855] CPU: 27 PID: 47820 Comm: java Tainted: G W 4.4.0-64-generic #85-Ubuntu
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857857] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/12/2016
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857860] 0000000000000286 00000000ee496386 ffff882358f13b10 ffffffff813f8083
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857867] ffff882358f13cc8 ffff883c70a8f000 ffff882358f13b80 ffffffff8120b0fe
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857871] ffffffff81cd63bf 0000000000000000 ffffffff81e677e0 0000000000000206
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857875] Call Trace:
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857885] [<ffffffff813f8083>] dump_stack+0x63/0x90
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857893] [<ffffffff8120b0fe>] dump_header+0x5a/0x1c5
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857899] [<ffffffff81192812>] oom_kill_process+0x202/0x3c0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857902] [<ffffffff81192c39>] out_of_memory+0x219/0x460
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857907] [<ffffffff81198c28>] __alloc_pages_slowpath.constprop.88+0x938/0xad0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857911] [<ffffffff81199046>] __alloc_pages_nodemask+0x286/0x2a0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857915] [<ffffffff811990fb>] alloc_kmem_pages_node+0x4b/0xc0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857921] [<ffffffff8107ea5e>] copy_process+0x1be/0x1b70
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857930] [<ffffffff811c19bd>] ? handle_mm_fault+0xcbd/0x1820
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857935] [<ffffffff81406574>] ? call_rwsem_down_read_failed+0x14/0x30
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857938] [<ffffffff810805a0>] _do_fork+0x80/0x360
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857942] [<ffffffff81080929>] SyS_clone+0x19/0x20
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857948] [<ffffffff8183c5f2>] entry_SYSCALL_64_fastpath+0x16/0x71
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857969] Mem-Info:
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981] active_anon:32518350 inactive_anon:2099 isolated_anon:0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981] active_file:45384948 inactive_file:45384381 isolated_file:64
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981] unevictable:914 dirty:104 writeback:0 unstable:0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981] slab_reclaimable:1282591 slab_unreclaimable:39566
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981] mapped:10291496 shmem:2227 pagetables:732932 bounce:0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981] free:272957 free_pcp:1153 free_cma:0

I still see OOM killer invocation on 4.4.0-64.85:

Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857840] java invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857846] java cpuset=/ mems_allowed=0-1
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857855] CPU: 27 PID: 47820 Comm: java Tainted: G        W       4.4.0-64-generic #85-Ubuntu
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857857] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/12/2016
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857860]  0000000000000286 00000000ee496386 ffff882358f13b10 ffffffff813f8083
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857867]  ffff882358f13cc8 ffff883c70a8f000 ffff882358f13b80 ffffffff8120b0fe
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857871]  ffffffff81cd63bf 0000000000000000 ffffffff81e677e0 0000000000000206
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857875] Call Trace:
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857885]  [<ffffffff813f8083>] dump_stack+0x63/0x90
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857893]  [<ffffffff8120b0fe>] dump_header+0x5a/0x1c5
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857899]  [<ffffffff81192812>] oom_kill_process+0x202/0x3c0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857902]  [<ffffffff81192c39>] out_of_memory+0x219/0x460
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857907]  [<ffffffff81198c28>] __alloc_pages_slowpath.constprop.88+0x938/0xad0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857911]  [<ffffffff81199046>] __alloc_pages_nodemask+0x286/0x2a0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857915]  [<ffffffff811990fb>] alloc_kmem_pages_node+0x4b/0xc0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857921]  [<ffffffff8107ea5e>] copy_process+0x1be/0x1b70
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857930]  [<ffffffff811c19bd>] ? handle_mm_fault+0xcbd/0x1820
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857935]  [<ffffffff81406574>] ? call_rwsem_down_read_failed+0x14/0x30
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857938]  [<ffffffff810805a0>] _do_fork+0x80/0x360
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857942]  [<ffffffff81080929>] SyS_clone+0x19/0x20
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857948]  [<ffffffff8183c5f2>] entry_SYSCALL_64_fastpath+0x16/0x71
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857969] Mem-Info:
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981] active_anon:32518350 inactive_anon:2099 isolated_anon:0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981]  active_file:45384948 inactive_file:45384381 isolated_file:64
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981]  unevictable:914 dirty:104 writeback:0 unstable:0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981]  slab_reclaimable:1282591 slab_unreclaimable:39566
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981]  mapped:10291496 shmem:2227 pagetables:732932 bounce:0
Feb 27 15:43:09 ip-10-0-9-47 kernel: [61385.857981]  free:272957 free_pcp:1153 free_cma:0

Revision history for this message

Julian Kassat (j.kassat) wrote on 2017-03-02:

#67

Same here on Linux 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux.

Out of memory: Kill process 11067 (java) score 28 or sacrifice child
Killed process 11067 (java) total-vm:3569724kB, anon-rss:211720kB, file-rss:20208kB
systemd-journald[247]: /dev/kmsg buffer overrun, some messages lost.
swap_free: Bad swap file entry 2000000000000000
BUG: Bad page map in process java pte:00000020 pmd:dac28067
addr:00007fc34ce69000 vm_flags:08000071 anon_vma: (null) mapping: (null) index:7fc34ce69
file: (null) fault: (null) mmap: (null) readpage: (null)
CPU: 1 PID: 11108 Comm: java Tainted: G B D 4.4.0-64-generic #85-Ubuntu
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
0000000000000286 00000000f2f56de1 ffff8800d8ceba58 ffffffff813f8083
00007fc34ce69000 ffff8800dac023e8 ffff8800d8cebaa8 ffffffff811be06f
ffff8800d8ceba80 ffffffff811d518c 2000000000000000 0000000000000020
Call Trace:
[<ffffffff813f8083>] dump_stack+0x63/0x90
[<ffffffff811be06f>] print_bad_pte+0x1df/0x2a0
[<ffffffff811d518c>] ? swap_info_get+0x7c/0xd0
[<ffffffff811bf9f8>] unmap_page_range+0x468/0x7a0
[<ffffffff811bfdad>] unmap_single_vma+0x7d/0xe0
[<ffffffff811c0871>] unmap_vmas+0x51/0xa0
[<ffffffff811c9df7>] exit_mmap+0xa7/0x170
[<ffffffff8107e0a7>] mmput+0x57/0x130
[<ffffffff81083f2a>] do_exit+0x27a/0xb00
[<ffffffff8110046c>] ? __unqueue_futex+0x2c/0x60
[<ffffffff81100f8e>] ? futex_wait+0x16e/0x280
[<ffffffff81084833>] do_group_exit+0x43/0xb0
[<ffffffff810909b2>] get_signal+0x292/0x600
[<ffffffff8102e567>] do_signal+0x37/0x6f0
[<ffffffff8122fb84>] ? mntput+0x24/0x40
[<ffffffff81210ba0>] ? __fput+0x190/0x220
[<ffffffff8100320c>] exit_to_usermode_loop+0x8c/0xd0
[<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60
[<ffffffff8183c750>] int_ret_from_sys_call+0x25/0x8f
BUG: Bad rss-counter state mm:ffff8800d8bc8800 idx:2 val:-1

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-03-02:

#68

Hi, Anton and Julian.

Can you attach complete logs for investigation?

Thanks.
Cascardo.

Revision history for this message

Julian Kassat (j.kassat) wrote on 2017-03-02:

#69

kern.log Edit (438.3 KiB, text/plain)

Attached kern.log. Let me know in case you need more logs.

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-03-02:

#70

Hi, Julian.

Do you have the output of dmesg after the incident before a reboot?

Cascardo.

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-03-02:

#71

Julian, your logs indicate some possible swap corruption, would you mind opening a new bug and sending it using apport-bug?

Thanks.
Cascardo.

Revision history for this message

Julian Kassat (j.kassat) wrote on 2017-03-03:

#72

Hi Cascardo,

there is no related dmesg output after the incident (just some lines from apt-daily.timer).

I filed a bug for the possible swap corruption issue here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1669707

Thanks so far.

Julian

Revision history for this message

Pete Cheslock (pete-cheslock) wrote on 2017-03-20:

#73

kern.log Edit (12.5 KiB, text/plain)

We have been seeing this issue recently as well. We are running 4.4.0-66-generic #87-Ubuntu - I can attempt to downgrade to 4.4.0-57 but its a large cluster with a lot of data so it may take some time. Attached a kern.log from this most recent oom.

Revision history for this message

Flemming Hoffmeyer (flemming-b-h) wrote on 2017-03-22:

#74

I am seeing this issue as well, on Arch kernel v 4.10.4-1

Revision history for this message

Michael Dye (dye.michael) wrote on 2017-04-03:

#75

This is plaguing Horizon project Pi2 and Pi3 devices running Xenial 16.04.2 w/ kernel 4.4.0-1050-raspi2. From a pi2:

root@horizon-00000000a17d2187:~# uname -a
Linux horizon-00000000a17d2187 4.4.0-1050-raspi2 #57-Ubuntu SMP Wed Mar 22 12:52:22 UTC 2017 armv7l armv7l armv7l GNU/Linux
root@horizon-00000000a17d2187:~# free
total used free shared buff/cache available
Mem: 942128 149548 35456 494084 757124 239716
Swap: 0 0 0

Under these circumstances, the kernel's oom-killer will kill Wifi processes (rtl_rpcd), systemd-udevd, our Ethereum client (geth), and other critical processes in attempt to stay afloat rather than using reclaimable RAM.

Revision history for this message

Mohammad Anwar Shah (mohammadanwarshah) wrote on 2017-04-09:

#76

I was using 4.4.0-21 as reported by `uname -r` which is default in Kubuntu 16.04. The same bug appears on mainline kernel 4.10 too!

Now, I'm in confusion. Which kernel should I upgrade to? Also I experience this only in KDE session with yandex or chrome browser opened

Revision history for this message

iKazmi (alikazmi-2040) wrote on 2017-04-09:

#77

I have 4.4.0-59 till 4.4.0-71 and 4.8.0-41 till 4.8.0-46 installed on my system and all are affected by this bug. Firefox, Chrome and Netbeans regularly get killed without a warning and for no reason (since I have something like 10GB+ RAM and all 16GB Swap free at the time the process gets killed). Even KDE has been killed a couple of times while the system still had over 6GB RAM and 16GB Swap free.

Yesterday, after the umpteenth time Netbeans was killed while I was in the middle of doing something, I finally decided to do something about this problem and installed Kernel 4.10.9-041009 from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10.9/. Sadly, that doesn't seem to resolve the problem either and Oom Killer is still overeager to kill user processes (Firefox and Netbeans have both been killed multiple times). At least KDE hasn't been killed so far.

Revision history for this message

kulwinder singh (kulwinder-careers) wrote on 2017-04-10:

#78

Anybody having successfully tested 4.4.0.63 for OOM-kill issue...

Revision history for this message

Anton (anton-powershop) wrote on 2017-04-10:

#79

Yes, 4.4.0.63 solved our OOM issues (and we had plenty after 4.4.0.59). Ours were all headless server (bare metal and VMs) related though - no desktop usage.

But I never experienced this issue with my home laptop either, but that had lots of RAM and was only lightly used during that period - not really a good data point.

Revision history for this message

Travisgevans (travisgevans) wrote on 2017-04-11:

#80

I also haven't personally encountered any further OOM issues on my home desktop (used daily) with 4.4.0.63.

Revision history for this message

Mohammad Anwar Shah (mohammadanwarshah) wrote on 2017-04-11:

#81

dmesg log in KDE with 4.4.0-21 kernel Edit (188.5 KiB, text/plain)

I'd like to emphasise that the OOM problem only happens with KDE. I have several DE installed including Unity, GNOME3, Cinnamon. But none of them caused a OOM, at least I never noticed. But in KDE, most of the time when chrome is opened, it triggers OOM. dmesg tells that, sometimes kwin_x11 invoked the OOM or plasmashell.

Most of the time plasmashell is crashed and the opened tab in chrome is killed. However, chrome application will be there. I need to start plasmashell by pressing Alt-F2 bringing the run command dialog and type plasmashell there.

Last night, Even firefox gave an OOM.

I'm attaching a dmesg log hoping that will be helpful.

Revision history for this message

Sebastian Unger (sebunger44) wrote on 2017-05-22:

#82

This is still an issue in the current linux-raspi2 version. Where those changes ported to that kernel?

Revision history for this message

Sebastian Unger (sebunger44) wrote on 2017-05-22:

#83

linux-raspi2 version 4.4.0.1055.56 that is.

Revision history for this message

kimo (ubuntu-oldfield) wrote on 2017-05-24:

#84

I'm seeing oom-killer being invoked despite having 2GB free swap when using the kernel from linux-image-4.4.0-1055-raspi2 version 4.4.0-1055.62.

kimo (ubuntu-oldfield) on 2017-05-25

Changed in linux-raspi2 (Ubuntu):
status:	New → Confirmed
Changed in linux-raspi2 (Ubuntu Xenial):
status:	New → Confirmed

Revision history for this message

Sebastian Unger (sebunger44) wrote on 2017-05-31:

#85

Also observed with 4.4.0-1054-raspi2. I'm now back on 4.4.0-1038-raspi2. I think that one was ok.

Revision history for this message

Nick Hatch (nicholas-hatch) wrote on 2017-06-16:

#86

kthreadd triggered order=2 allocation failure Edit (113.6 KiB, text/plain)

We're still having issues with higher-order allocations failing and triggering an OOM kill for unexplainable reasons. (on 4.4.0-78-generic).

I've attached the relevant OOM killer logs. It may be relevant to note that the server these logs are from is an Elasticsearch instance with a large (~32GB) mlock'ed heap.

Revision history for this message

Pete Cheslock (pete-cheslock) wrote on 2017-06-16:

#87

@nicholas-hatch - what file system are your disks formatted as? I was able to stop the OOM's on my ES hosts by moving from XFS to EXT4. My belief is that there was a memory fragmentation issue with ES and many small files on XFS formatted volumes.

Revision history for this message

Chris (cmavr8) wrote on 2017-06-30:

#88

The bug is still confirmed and not fixed for linux-raspi2 (Ubuntu), 5 months after getting fixed for the main Ubuntu.

Shouldn't this have some priority? Even apt upgrade breaks if I don't use the clear cache workaround. I can live with it (cron job to clear cache) but this is not great for LTS.

Currently affected: Ubuntu 16.04.2 LTS, 4.4.0-1059-raspi2 #67-Ubuntu

Paolo Pisati (p-pisati) on 2017-06-30

Changed in linux-raspi2 (Ubuntu):
assignee:	nobody → Paolo Pisati (p-pisati)

Revision history for this message

Paolo Pisati (p-pisati) wrote on 2017-06-30:

#89

Chris, can you test if this kernel solves your problem?

http://people.canonical.com/~ppisati/lp1655842/linux-image-4.4.0-1062-raspi2_4.4.0-1062.70~lp1655842_armhf.deb

Revision history for this message

Chris (cmavr8) wrote on 2017-07-02:

#90

Sure.
I undid the workaround, installed and booted the kernel and will test it for a few days. I'll keep you posted on results.

Thanks Paolo!

Revision history for this message

Chris (cmavr8) wrote on 2017-07-05:

#91

Update: No sign of Out-of-memory errors or kills, after 3 days of testing the 4.4.0-1062-raspi2 kernel. I'll report back again next week.

Revision history for this message

kimo (ubuntu-oldfield) wrote on 2017-07-08:

#92

4.4.0-1062-raspi2 is looking good - I've had it running for a week without oom-killer being invoked.

Revision history for this message

Chris (cmavr8) wrote on 2017-07-11:

#93

Mine's also still stable (no OOMs), after running the patched kernel for 9 days, on a Raspberry pi 2 Model B v1.1.

Revision history for this message

Swe W Aung (sirswa) wrote on 2017-08-08:

#94

Download full text (4.8 KiB)

Hi

I am experiencing at one of our compute node hypervisor. kernel version we are using is 4.4.0-83, but seems to be having the issue reported in this report.

[Mon Aug 7 00:19:42 2017] nova-compute invoked oom-killer: gfp_mask=0x2c200ca, order=0, oom_score_adj=0
[Mon Aug 7 00:19:42 2017] nova-compute cpuset=/ mems_allowed=0-1
[Mon Aug 7 00:19:42 2017] CPU: 7 PID: 2164484 Comm: nova-compute Tainted: G OE 4.4.0-83-generic #106-Ubuntu
[Mon Aug 7 00:19:42 2017] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017
[Mon Aug 7 00:19:42 2017] 0000000000000286 00000000d6004dce ffff88014e753a50 ffffffff813f9513
[Mon Aug 7 00:19:42 2017] ffff88014e753c08 ffff883fecf88e00 ffff88014e753ac0 ffffffff8120b53e
[Mon Aug 7 00:19:42 2017] 0000000000000015 0000000000000000 ffff881fe883b740 ffff883fe94f7000
[Mon Aug 7 00:19:42 2017] Call Trace:
[Mon Aug 7 00:19:42 2017] [<ffffffff813f9513>] dump_stack+0x63/0x90
[Mon Aug 7 00:19:42 2017] [<ffffffff81391c64>] ? apparmor_capable+0xc4/0x1b0
[Mon Aug 7 00:19:42 2017] [<ffffffff81192ae2>] oom_kill_process+0x202/0x3c0
[Mon Aug 7 00:19:42 2017] [<ffffffff81192f09>] out_of_memory+0x219/0x460
[Mon Aug 7 00:19:42 2017] [<ffffffff81198ef8>] __alloc_pages_slowpath.constprop.88+0x938/0xad0
[Mon Aug 7 00:19:42 2017] [<ffffffff81199316>] __alloc_pages_nodemask+0x286/0x2a0
[Mon Aug 7 00:19:42 2017] [<ffffffff811e467d>] alloc_pages_vma+0xad/0x250
[Mon Aug 7 00:19:42 2017] [<ffffffff811fad53>] do_huge_pmd_wp_page+0x153/0xb70
[Mon Aug 7 00:19:42 2017] [<ffffffff811c1a5f>] handle_mm_fault+0x90f/0x1820
[Mon Aug 7 00:19:42 2017] [<ffffffff8106b802>] ? do_page_fault+0x22/0x30
[Mon Aug 7 00:19:42 2017] [<ffffffff81842cf8>] ? page_fault+0x28/0x30
[Mon Aug 7 00:19:42 2017] [<ffffffff8106b577>] __do_page_fault+0x197/0x400
[Mon Aug 7 00:19:42 2017] [<ffffffff8106b802>] do_page_fault+0x22/0x30
[Mon Aug 7 00:19:42 2017] [<ffffffff81842cf8>] page_fault+0x28/0x30
[Mon Aug 7 00:19:42 2017] Mem-Info:
[Mon Aug 7 00:19:42 2017] active_anon:61350709 inactive_anon:2118817 isolated_anon:0
                            active_file:0 inactive_file:0 isolated_file:32
                            unevictable:915 dirty:0 writeback:8 unstable:0
                            slab_reclaimable:14082 slab_unreclaimable:64456
                            mapped:3492 shmem:329012 pagetables:142167 bounce:0
                            free:260204 free_pcp:4111 free_cma:0

[Tue Aug 8 05:50:08 2017] apt-check invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
[Tue Aug 8 05:50:08 2017] apt-check cpuset=/ mems_allowed=0-1
[Tue Aug 8 05:50:08 2017] CPU: 11 PID: 2538289 Comm: apt-check Tainted: G OE 4.4.0-83-generic #106-Ubuntu
[Tue Aug 8 05:50:08 2017] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017
[Tue Aug 8 05:50:08 2017] 0000000000000286 000000005e467cc9 ffff8820b44a39f8 ffffffff813f9513
[Tue Aug 8 05:50:08 2017] ffff8820b44a3bb0 ffff881fec15b800 ffff8820b44a3a68 ffffffff8120b53e
[Tue Aug 8 05:50:08 2017] 0000000000000015 ffffffff81e42ac0 ffff883fe996f980 ffffffffffffff04
[Tue Aug 8 05:50:08 2017] Call Trace:
[Tue Aug 8 05:50:08 2017] [<ff...

Ubuntu
linux package

"Out of memory" errors after upgrade to 4.4.0-59

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
linux (Ubuntu)	Fix Released	High	Thadeu Lima de Souza Cascardo
Xenial	Fix Released	High	Thadeu Lima de Souza Cascardo
linux-aws (Ubuntu)	Confirmed	Undecided	Unassigned
Xenial	Confirmed	Undecided	Unassigned
linux-raspi2 (Ubuntu)	Fix Committed	Undecided	Paolo Pisati
Xenial	Fix Committed	Undecided	Unassigned

Changed in linux-raspi2 (Ubuntu):
status:	Confirmed → Fix Committed
Changed in linux-raspi2 (Ubuntu Xenial):
status:	Confirmed → Fix Committed

Changed in linux-aws (Ubuntu Xenial):
status:	New → Confirmed
Changed in linux-aws (Ubuntu):
status:	New → Confirmed

Ubuntulinux package

"Out of memory" errors after upgrade to 4.4.0-59

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package