Bug #1655842 ""Out of memory” errors after upgrade to 4.4.0-59” : Bugs : linux package : Ubuntu

Revision history for this message

Mike Williams (mdub) wrote on 2017-01-12:

#1

CRDA.txt Edit (422 bytes, text/plain; charset="utf-8")
CurrentDmesg.txt Edit (65.6 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (2.7 KiB, text/plain; charset="utf-8")
JournalErrors.txt Edit (23.2 KiB, text/plain; charset="utf-8")
Lspci.txt Edit (3.2 KiB, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (835 bytes, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (1.6 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (4.4 KiB, text/plain; charset="utf-8")
UdevDb.txt Edit (94.5 KiB, text/plain; charset="utf-8")
WifiSyslog.txt Edit (83.0 KiB, text/plain; charset="utf-8")

Revision history for this message

Brad Figg (brad-figg) wrote on 2017-01-12: Status changed to Confirmed

#2

This change was made by a bot.

Changed in linux (Ubuntu):
status:	New → Confirmed

Joseph Salisbury (jsalisbury) on 2017-01-12

Changed in linux (Ubuntu):
importance:	Undecided → High
Changed in linux (Ubuntu Xenial):
status:	New → Triaged
importance:	Undecided → High
Changed in linux (Ubuntu):
status:	Confirmed → Triaged

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-01-12:

#3

I built a Xenial test kernel with the following two commits reverted:

c630ec12d831 mm, oom: rework oom detection
57e9ef475661 mm: throttle on IO only when there are too many dirty and writeback pages

This test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1655842/

Can you test this kernel and see if it resolves this bug?

Thanks in advance!

Revision history for this message

Fabian Grünbichler (f-gruenbichler) wrote on 2017-01-13:

#4

you could also try cherry-picking https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=6b4e3181d7bd5ca5ab6f45929e4a5ffa7ab4ab7f , but that will probably need some more inbetween patches as well..

reverting the two commits fixed the issue for our users (Proxmox VE, which uses a kernel based on the 4.4.x one from 16.04)

Revision history for this message

David F. (malteworld) wrote on 2017-01-17:

#5

@f-gruenbichler: I already tried to cherry-pick that patch a while ago and it doesn't work because that patch is based on work that isn't in the 4.4.* kernel branch, not even including Canonical's backports from later branches.

Revision history for this message

Mike Williams (mdub) wrote on 2017-01-18:

#6

Thanks jsalisbury. We have deployed using your test kernel (from http://kernel.ubuntu.com/~jsalisbury/lp1655842/), and experienced no OOM issues.

Revision history for this message

Allen Wild (aswild) wrote on 2017-01-19:

#7

I manage a set of build servers for CPU/IO intensive builds using Yocto/OpenEmbedded. Ubuntu 14.04.5 with the 4.4 Xenial kernel. After updating to 4.4.0-59 the builds started failing because of the OOM killer.

Rolling back to 4.4.0-57 fixed the OOMs for me.

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-01-20:

#8

Can you try the kernel at [1], which includes the patches that are also at [1]?

[1] http://people.canonical.com/~cascardo/lp1655842/

Thanks.
Cascardo.

Changed in linux (Ubuntu):
assignee:	nobody → Thadeu Lima de Souza Cascardo (cascardo)
Changed in linux (Ubuntu Xenial):
assignee:	nobody → Thadeu Lima de Souza Cascardo (cascardo)

Revision history for this message

Stéphane Graber (stgraber) wrote on 2017-01-20:

#9

Just a note that Joe's armhf kernel has been working well for me.

I can't test cascardo's kernel as it's not built for armhf.

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-01-20:

#10

I will upload armhf binaries for those kernels and let you know. It's important to try those because they include an alternative solution that we would rather use instead of the one with the reverted patches.

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-01-20:

#11

Patched kernel for armhf available here: http://people.canonical.com/~cascardo/lp1655842/linux-image-4.4.0-62-generic_4.4.0-62.83_armhf.deb

Revision history for this message

Danny B (danny.b) wrote on 2017-01-21:

#12

Using Cascardo's kernel fixes the problem for me.

It was a bit of a hassle to install though because there's no linux-headers-4.4.0-62_4.4.0-62.83_all.deb at the link and linux-headers-generic depends on it.

Here's where to find it:
amd64: https://launchpad.net/ubuntu/xenial/amd64/linux-headers-4.4.0-62/4.4.0-62.83
armhf: https://launchpad.net/ubuntu/xenial/armhf/linux-headers-4.4.0-62/4.4.0-62.83

Ben French (octoamit) on 2017-01-21

Changed in linux (Ubuntu):
status:	Triaged → In Progress

Revision history for this message

Stéphane Graber (stgraber) wrote on 2017-01-23:

#13

I've had a few armhf systems running cascardo's kernel and so far no sign of the OOM or any other problem with it.

Revision history for this message

Mike Williams (mdub) wrote on 2017-01-24:

#14

Cascardo: we've tried your test kernel, and it looks good - we've seen no OOM problems.

Revision history for this message

Cris (cristianpeguero25) wrote on 2017-01-25:

#15

Hi I'd like to implement Cascardo kernel since I've been having the same issue, though not on all of
the xenial machines running 4.4.0-59-generic which is strange.
Could someone tell how to implement Cascardo kernel without completely messing up my machine.

Thanks

Revision history for this message

xb5i7o (xb5i7o) wrote on 2017-01-27:

#16

Hi, I am having the exact same issues on a PC with 18GB ram!! kernel 4.4.0-59-generic

Please can this be fixed as soon as possible with a release of the next kernel update.

Its killing processes such as firefox and virtualbox for no good reason while only 4gb is in use really.

Hope this can be fixed soon. its becoming worse as time passes.

Revision history for this message

Eric Desrochers (slashd) wrote on 2017-01-27:

#17

The patchset[1] for bug "LP #1655842" has been submitted on Jan 24th 2017 and acked by the kernel team on the same day[2].

The patch should be part of the following kernel release cycle :

cycle: 27-Jan through 18-Feb[3]
====
27-Jan Last day for kernel commits for this cycle
30-Jan - 04-Feb Kernel prep week.
05-Feb - 17-Feb Bug verification & Regression testing..
20-Feb Release to -updates.
====

[1] - "Xenial PATCH 00/11] Fixes OOM for LP #1655842"
[2] - "ACK: [Xenial PATCH 00/11] Fixes OOM for LP #1655842"
[3] - https://wiki.ubuntu.com/KernelTeam/Newsletter

- Eric

Changed in linux (Ubuntu Xenial):
status:	Triaged → In Progress

Revision history for this message

Eric Desrochers (slashd) wrote on 2017-01-27:

#18

Additional note :

Applied in master-next on Jan 26th 2017[2]

[1] - "APPLIED: [Xenial PATCH 00/11] Fixes OOM for LP #1655842"

- Eric

Eric Desrochers (slashd) on 2017-01-27

tags:

added: sts

Revision history for this message

Gaudenz Steinlin (gaudenz-debian) wrote on 2017-01-27:

#19

@slashd It sounds really strange to me that I should wait til 20-Feb for a fix for this bug while this is clearly a regression introduced with the latest kernel upgrade. Is there no way to speed things up to fix this regression.

Currently we had to downgrade all our xenial systems to linux-image-4.4.0-57-generic to avoid this bug.

Gaudenz

Revision history for this message

Eric Desrochers (slashd) wrote on 2017-01-30:

#20

@Gaudenz Steinlin (gaudenz-debian),

It will takes 3 weeks to land in -updates pocket, but you can expect to have a call for testing a proposed package by EOW.

- Eric

Tim Gardner (timg-tpi) on 2017-01-31

Changed in linux (Ubuntu Xenial):
status:	In Progress → Fix Committed
Changed in linux (Ubuntu):
status:	In Progress → Fix Released

Revision history for this message

Luk van den Borne (luk-vandenborne) wrote on 2017-02-02:

#21

This is a severe bug. It should be treated a high-priority bugfix that cannot wait 3 weeks.

Revision history for this message

Nate Eldredge (nate-thatsmathematics) wrote on 2017-02-03:

#22

Just as a note for newcomers reading this, I can confirm the bug is NOT fixed in the officially released 4.4.0-62.83.

Revision history for this message

Krzysztof Dryja (cih997) wrote on 2017-02-03:

#23

I could not reboot my machine and the ugly workaround for this issue was to login as root and clear system caches:

echo 3 > /proc/sys/vm/drop_caches

This made my machine stable again, at least for the time I needed.

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-02-03:

#24

This is fixed in 4.4.0-63.84, which will be available in proposed soon.

Revision history for this message

Shelby Cain (alyandon) wrote on 2017-02-03:

#25

@nate Thank you! You just saved me a lot of hassle as I was about to unpin the 4.4.0-57 kernel and update a bunch of machines on the assumption the fix was in that version.

Revision history for this message

Sebastian Unger (sebunger44) wrote on 2017-02-05:

#26

As a note: I believe this also affects the armhf kernel 4.4.0-1040-raspi2 for the Raspberry Pi.

Revision history for this message

David Glasser (glasser) wrote on 2017-02-07:

#27

I've been struggling with this bug for nearly a week and only now found this issue. Thanks for fixing it!

For the sake of others finding it, here's the stack trace part of the oom-killer log, which contains some terms I searched for a while ago that aren't mentioned here yet.

docker invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=-1000
docker cpuset=/ mems_allowed=0
CPU: 11 PID: 4472 Comm: docker Tainted: G W 4.4.0-62-generic #83-Ubuntu
Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
0000000000000286 0000000057f64c94 ffff880dfb5efaf0 ffffffff813f7c63
ffff880dfb5efcc8 ffff880fbfda0000 ffff880dfb5efb60 ffffffff8120ad4e
ffffffff81cd2d7f 0000000000000000 ffffffff81e67760 0000000000000206
Call Trace:
[<ffffffff813f7c63>] dump_stack+0x63/0x90
[<ffffffff8120ad4e>] dump_header+0x5a/0x1c5
[<ffffffff811926c2>] oom_kill_process+0x202/0x3c0
[<ffffffff81192ae9>] out_of_memory+0x219/0x460
[<ffffffff81198a5d>] __alloc_pages_slowpath.constprop.88+0x8fd/0xa70
[<ffffffff81198e56>] __alloc_pages_nodemask+0x286/0x2a0
[<ffffffff81198f0b>] alloc_kmem_pages_node+0x4b/0xc0
[<ffffffff8107ea5e>] copy_process+0x1be/0x1b70
[<ffffffff8139225c>] ? apparmor_file_alloc_security+0x5c/0x220
[<ffffffff811ed04a>] ? kmem_cache_alloc+0x1ca/0x1f0
[<ffffffff81348263>] ? security_file_alloc+0x33/0x50
[<ffffffff810caeb1>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[<ffffffff810805a0>] _do_fork+0x80/0x360
[<ffffffff81080929>] SyS_clone+0x19/0x20
[<ffffffff818385f2>] entry_SYSCALL_64_fastpath+0x16/0x71

Revision history for this message

Hajo Locke (hajo-locke) wrote on 2017-02-08:

#28

When this new kernel will be released? This bug is killing our MySQL Servers. Booting old kernels is only a bad workaround. I think a lot of people with busy servers will have a problem.

This is 2.nd time we were hit by a big bug within short time. In oct 2016 our nameservers got problems because of bug 1634892
Is LTS-Ubuntu still right system for servers?

Revision history for this message

Luk van den Borne (luk-vandenborne) wrote on 2017-02-08:

#29

This bug also appears to affect linux-image-4.8.0-34-generic in 16.04.1 Xenial.

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-02-08:

#30

Hi, Luk.

linux-image-4.8.0-34-generic should not be affected by this. If you see unexpected OOM problems, please open a new bug report and attach the kernel logs, please.

Thanks.
Cascardo.

Revision history for this message

xb5i7o (xb5i7o) wrote on 2017-02-08:

#31

Just by the way - 4.4.0-62-generic has the exact same problem. Even when uninstalling 4.4.0-59-generic, my system at some point auto-updated to 4.4.0-62-generic . Only 4.4.0-57-generic is safe for now.

Revision history for this message

Nick Maynard (nick-maynard) wrote on 2017-02-09:

#32

LTS Ubuntu with -updates shouldn't have this sort of issue - this is, frankly, unforgivable.

We need a new kernel urgently in -updates, and I'd expect serious discussions within the kernel team to understand what has caused this issue and avoid it reoccurring.

Revision history for this message

Anton Piatek (anton-piatek) wrote on 2017-02-09:

#33

If this kernel is not going to hit -updates shortly (i.e. days), can something be done to pull or downgrade the broken kernel? At least revert linux-image-generic to depend back on linux-image-4.4.0-57-generic which doesn't have the issues and will stop more people from upgrading to a broken kernel.

Having this sort of break in an LTS kernel is not inspiring at all.

Revision history for this message

Eric Desrochers (slashd) wrote on 2017-02-09:

#34

The fix is now available for testing in kernel version 4.4.0-63.84, if you enable proposed[1]

$ apt-cache policy linux-image-4.4.0-63-generic
linux-image-4.4.0-63-generic:
  Installed: (none)
  ==> Candidate: 4.4.0-63.84
  Version table:
     4.4.0-63.84 500
        500 http://archive.ubuntu.com/ubuntu ==>xenial-proposed/main amd64 Packages

$ apt-get changelog linux-image-4.4.0-63-generic | egrep "1655842"
==> * "Out of memory" errors after upgrade to 4.4.0-59 (LP: #1655842)

[1] - https://wiki.ubuntu.com/Testing/EnableProposed

- Eric

Revision history for this message

Oliver O. (oliver-o456i) wrote on 2017-02-09:

#35

Testing...

Enabled proposed (https://wiki.ubuntu.com/Testing/EnableProposed).

Installed kernel packages:

# apt-get install -s -t xenial-proposed 'linux-headers-4.4.0.63$' 'linux-headers-4.4.0.63-generic$' 'linux-image-4.4.0.63-generic$' 'linux-image-extra-4.4.0.63-generic$'

Rebooted.

# cat /proc/version_signature
Ubuntu 4.4.0-63.84-generic 4.4.44

Revision history for this message

kulwinder singh (kulwinder-careers) wrote on 2017-02-09:

#36

Who is the saver 4.4.0-57-generic or 4.4.0-63-generic now.

Revision history for this message

David Glasser (glasser) wrote on 2017-02-09:

#37

kulwinder singh: Either one, but nothing in between.

-57 will reintroduce a few (unrelated) security bugs as well as the bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1647400 whose fix caused this one, but is easier to enable and has been tested for longer.

-63 should fix this bug, the older bug, and the intermediary security bugs, but requires you to enable the "proposed" repository, and hasn't been tested for quite as long.

Anything in between has this bug.

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-02-09:

#38

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: verification-needed-xenial

Revision history for this message

David Glasser (glasser) wrote on 2017-02-10:

#39

Cascardo: Just to be clear, are you looking for verification from anyone in the world, or from specific kernel testers?

(I'd like to help, but I'm only able to reproduce the issue in production, and the process of debugging this issue when we ran into it was already more restarts than is good for my service right now (we settled on downgrading for the moment).)

Revision history for this message

David F. (malteworld) wrote on 2017-02-10:

#40

@nick-maynard: Why is such a bug unforgivable? You can just boot a previous kernel instead. If you're concerned about availability then don't reboot in the first place unless there's an important security patch.

Oliver O. (oliver-o456i) on 2017-02-11

tags:

added: verification-done-xenial
removed: verification-needed-xenial

Craig Francis (craig.francis) on 2017-02-17

description:

updated

Launchpad Janitor (janitor) on 2017-02-20

Changed in linux (Ubuntu Xenial):
status:	Fix Committed → Fix Released

Revision history for this message

Julian Kassat (j.kassat) wrote on 2017-03-03:

#72

Hi Cascardo,

there is no related dmesg output after the incident (just some lines from apt-daily.timer).

I filed a bug for the possible swap corruption issue here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1669707

Thanks so far.

Julian

Revision history for this message

Pete Cheslock (pete-cheslock) wrote on 2017-03-20:

#73

kern.log Edit (12.5 KiB, text/plain)

We have been seeing this issue recently as well. We are running 4.4.0-66-generic #87-Ubuntu - I can attempt to downgrade to 4.4.0-57 but its a large cluster with a lot of data so it may take some time. Attached a kern.log from this most recent oom.

Revision history for this message

Flemming Hoffmeyer (flemming-b-h) wrote on 2017-03-22:

#74

I am seeing this issue as well, on Arch kernel v 4.10.4-1

Revision history for this message

Michael Dye (dye.michael) wrote on 2017-04-03:

#75

This is plaguing Horizon project Pi2 and Pi3 devices running Xenial 16.04.2 w/ kernel 4.4.0-1050-raspi2. From a pi2:

root@horizon-00000000a17d2187:~# uname -a
Linux horizon-00000000a17d2187 4.4.0-1050-raspi2 #57-Ubuntu SMP Wed Mar 22 12:52:22 UTC 2017 armv7l armv7l armv7l GNU/Linux
root@horizon-00000000a17d2187:~# free
total used free shared buff/cache available
Mem: 942128 149548 35456 494084 757124 239716
Swap: 0 0 0

Under these circumstances, the kernel's oom-killer will kill Wifi processes (rtl_rpcd), systemd-udevd, our Ethereum client (geth), and other critical processes in attempt to stay afloat rather than using reclaimable RAM.

Revision history for this message

Mohammad Anwar Shah (mohammadanwarshah) wrote on 2017-04-09:

#76

I was using 4.4.0-21 as reported by `uname -r` which is default in Kubuntu 16.04. The same bug appears on mainline kernel 4.10 too!

Now, I'm in confusion. Which kernel should I upgrade to? Also I experience this only in KDE session with yandex or chrome browser opened

Revision history for this message

iKazmi (alikazmi-2040) wrote on 2017-04-09:

#77

I have 4.4.0-59 till 4.4.0-71 and 4.8.0-41 till 4.8.0-46 installed on my system and all are affected by this bug. Firefox, Chrome and Netbeans regularly get killed without a warning and for no reason (since I have something like 10GB+ RAM and all 16GB Swap free at the time the process gets killed). Even KDE has been killed a couple of times while the system still had over 6GB RAM and 16GB Swap free.

Yesterday, after the umpteenth time Netbeans was killed while I was in the middle of doing something, I finally decided to do something about this problem and installed Kernel 4.10.9-041009 from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10.9/. Sadly, that doesn't seem to resolve the problem either and Oom Killer is still overeager to kill user processes (Firefox and Netbeans have both been killed multiple times). At least KDE hasn't been killed so far.

Revision history for this message

kulwinder singh (kulwinder-careers) wrote on 2017-04-10:

#78

Anybody having successfully tested 4.4.0.63 for OOM-kill issue...

Revision history for this message

Anton (anton-powershop) wrote on 2017-04-10:

#79

Yes, 4.4.0.63 solved our OOM issues (and we had plenty after 4.4.0.59). Ours were all headless server (bare metal and VMs) related though - no desktop usage.

But I never experienced this issue with my home laptop either, but that had lots of RAM and was only lightly used during that period - not really a good data point.

Revision history for this message

Travisgevans (travisgevans) wrote on 2017-04-11:

#80

I also haven't personally encountered any further OOM issues on my home desktop (used daily) with 4.4.0.63.

Revision history for this message

Mohammad Anwar Shah (mohammadanwarshah) wrote on 2017-04-11:

#81

dmesg log in KDE with 4.4.0-21 kernel Edit (188.5 KiB, text/plain)

I'd like to emphasise that the OOM problem only happens with KDE. I have several DE installed including Unity, GNOME3, Cinnamon. But none of them caused a OOM, at least I never noticed. But in KDE, most of the time when chrome is opened, it triggers OOM. dmesg tells that, sometimes kwin_x11 invoked the OOM or plasmashell.

Most of the time plasmashell is crashed and the opened tab in chrome is killed. However, chrome application will be there. I need to start plasmashell by pressing Alt-F2 bringing the run command dialog and type plasmashell there.

Last night, Even firefox gave an OOM.

I'm attaching a dmesg log hoping that will be helpful.

Revision history for this message

Sebastian Unger (sebunger44) wrote on 2017-05-22:

#82

This is still an issue in the current linux-raspi2 version. Where those changes ported to that kernel?

Revision history for this message

Sebastian Unger (sebunger44) wrote on 2017-05-22:

#83

linux-raspi2 version 4.4.0.1055.56 that is.

Revision history for this message

kimo (ubuntu-oldfield) wrote on 2017-05-24:

#84

I'm seeing oom-killer being invoked despite having 2GB free swap when using the kernel from linux-image-4.4.0-1055-raspi2 version 4.4.0-1055.62.

kimo (ubuntu-oldfield) on 2017-05-25

Changed in linux-raspi2 (Ubuntu):
status:	New → Confirmed
Changed in linux-raspi2 (Ubuntu Xenial):
status:	New → Confirmed

Revision history for this message

Sebastian Unger (sebunger44) wrote on 2017-05-31:

#85

Also observed with 4.4.0-1054-raspi2. I'm now back on 4.4.0-1038-raspi2. I think that one was ok.

Revision history for this message

Nick Hatch (nicholas-hatch) wrote on 2017-06-16:

#86

kthreadd triggered order=2 allocation failure Edit (113.6 KiB, text/plain)

We're still having issues with higher-order allocations failing and triggering an OOM kill for unexplainable reasons. (on 4.4.0-78-generic).

I've attached the relevant OOM killer logs. It may be relevant to note that the server these logs are from is an Elasticsearch instance with a large (~32GB) mlock'ed heap.

Revision history for this message

Pete Cheslock (pete-cheslock) wrote on 2017-06-16:

#87

@nicholas-hatch - what file system are your disks formatted as? I was able to stop the OOM's on my ES hosts by moving from XFS to EXT4. My belief is that there was a memory fragmentation issue with ES and many small files on XFS formatted volumes.

Revision history for this message

Chris (cmavr8) wrote on 2017-06-30:

#88

The bug is still confirmed and not fixed for linux-raspi2 (Ubuntu), 5 months after getting fixed for the main Ubuntu.

Shouldn't this have some priority? Even apt upgrade breaks if I don't use the clear cache workaround. I can live with it (cron job to clear cache) but this is not great for LTS.

Currently affected: Ubuntu 16.04.2 LTS, 4.4.0-1059-raspi2 #67-Ubuntu

Paolo Pisati (p-pisati) on 2017-06-30

Changed in linux-raspi2 (Ubuntu):
assignee:	nobody → Paolo Pisati (p-pisati)

Revision history for this message

Paolo Pisati (p-pisati) wrote on 2017-06-30:

#89

Chris, can you test if this kernel solves your problem?

http://people.canonical.com/~ppisati/lp1655842/linux-image-4.4.0-1062-raspi2_4.4.0-1062.70~lp1655842_armhf.deb

Revision history for this message

Chris (cmavr8) wrote on 2017-07-02:

#90

Sure.
I undid the workaround, installed and booted the kernel and will test it for a few days. I'll keep you posted on results.

Thanks Paolo!

Revision history for this message

Chris (cmavr8) wrote on 2017-07-05:

#91

Update: No sign of Out-of-memory errors or kills, after 3 days of testing the 4.4.0-1062-raspi2 kernel. I'll report back again next week.

Revision history for this message

kimo (ubuntu-oldfield) wrote on 2017-07-08:

#92

4.4.0-1062-raspi2 is looking good - I've had it running for a week without oom-killer being invoked.

Revision history for this message

Chris (cmavr8) wrote on 2017-07-11:

#93

Mine's also still stable (no OOMs), after running the patched kernel for 9 days, on a Raspberry pi 2 Model B v1.1.

Revision history for this message

Swe W Aung (sirswa) wrote on 2017-08-08:

#94

Download full text (4.8 KiB)

Hi

I am experiencing at one of our compute node hypervisor. kernel version we are using is 4.4.0-83, but seems to be having the issue reported in this report.

[Mon Aug 7 00:19:42 2017] nova-compute invoked oom-killer: gfp_mask=0x2c200ca, order=0, oom_score_adj=0
[Mon Aug 7 00:19:42 2017] nova-compute cpuset=/ mems_allowed=0-1
[Mon Aug 7 00:19:42 2017] CPU: 7 PID: 2164484 Comm: nova-compute Tainted: G OE 4.4.0-83-generic #106-Ubuntu
[Mon Aug 7 00:19:42 2017] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017
[Mon Aug 7 00:19:42 2017] 0000000000000286 00000000d6004dce ffff88014e753a50 ffffffff813f9513
[Mon Aug 7 00:19:42 2017] ffff88014e753c08 ffff883fecf88e00 ffff88014e753ac0 ffffffff8120b53e
[Mon Aug 7 00:19:42 2017] 0000000000000015 0000000000000000 ffff881fe883b740 ffff883fe94f7000
[Mon Aug 7 00:19:42 2017] Call Trace:
[Mon Aug 7 00:19:42 2017] [<ffffffff813f9513>] dump_stack+0x63/0x90
[Mon Aug 7 00:19:42 2017] [<ffffffff81391c64>] ? apparmor_capable+0xc4/0x1b0
[Mon Aug 7 00:19:42 2017] [<ffffffff81192ae2>] oom_kill_process+0x202/0x3c0
[Mon Aug 7 00:19:42 2017] [<ffffffff81192f09>] out_of_memory+0x219/0x460
[Mon Aug 7 00:19:42 2017] [<ffffffff81198ef8>] __alloc_pages_slowpath.constprop.88+0x938/0xad0
[Mon Aug 7 00:19:42 2017] [<ffffffff81199316>] __alloc_pages_nodemask+0x286/0x2a0
[Mon Aug 7 00:19:42 2017] [<ffffffff811e467d>] alloc_pages_vma+0xad/0x250
[Mon Aug 7 00:19:42 2017] [<ffffffff811fad53>] do_huge_pmd_wp_page+0x153/0xb70
[Mon Aug 7 00:19:42 2017] [<ffffffff811c1a5f>] handle_mm_fault+0x90f/0x1820
[Mon Aug 7 00:19:42 2017] [<ffffffff8106b802>] ? do_page_fault+0x22/0x30
[Mon Aug 7 00:19:42 2017] [<ffffffff81842cf8>] ? page_fault+0x28/0x30
[Mon Aug 7 00:19:42 2017] [<ffffffff8106b577>] __do_page_fault+0x197/0x400
[Mon Aug 7 00:19:42 2017] [<ffffffff8106b802>] do_page_fault+0x22/0x30
[Mon Aug 7 00:19:42 2017] [<ffffffff81842cf8>] page_fault+0x28/0x30
[Mon Aug 7 00:19:42 2017] Mem-Info:
[Mon Aug 7 00:19:42 2017] active_anon:61350709 inactive_anon:2118817 isolated_anon:0
                            active_file:0 inactive_file:0 isolated_file:32
                            unevictable:915 dirty:0 writeback:8 unstable:0
                            slab_reclaimable:14082 slab_unreclaimable:64456
                            mapped:3492 shmem:329012 pagetables:142167 bounce:0
                            free:260204 free_pcp:4111 free_cma:0

[Tue Aug 8 05:50:08 2017] apt-check invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
[Tue Aug 8 05:50:08 2017] apt-check cpuset=/ mems_allowed=0-1
[Tue Aug 8 05:50:08 2017] CPU: 11 PID: 2538289 Comm: apt-check Tainted: G OE 4.4.0-83-generic #106-Ubuntu
[Tue Aug 8 05:50:08 2017] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017
[Tue Aug 8 05:50:08 2017] 0000000000000286 000000005e467cc9 ffff8820b44a39f8 ffffffff813f9513
[Tue Aug 8 05:50:08 2017] ffff8820b44a3bb0 ffff881fec15b800 ffff8820b44a3a68 ffffffff8120b53e
[Tue Aug 8 05:50:08 2017] 0000000000000015 ffffffff81e42ac0 ffff883fe996f980 ffffffffffffff04
[Tue Aug 8 05:50:08 2017] Call Trace:
[Tue Aug 8 05:50:08 2017] [<ff...

Hi

I am experiencing at one of our compute node hypervisor. kernel version we are using is 4.4.0-83, but seems to be having the issue reported in this report.

[Mon Aug  7 00:19:42 2017] nova-compute invoked oom-killer: gfp_mask=0x2c200ca, order=0, oom_score_adj=0
[Mon Aug  7 00:19:42 2017] nova-compute cpuset=/ mems_allowed=0-1
[Mon Aug  7 00:19:42 2017] CPU: 7 PID: 2164484 Comm: nova-compute Tainted: G           OE   4.4.0-83-generic #106-Ubuntu
[Mon Aug  7 00:19:42 2017] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017
[Mon Aug  7 00:19:42 2017]  0000000000000286 00000000d6004dce ffff88014e753a50 ffffffff813f9513
[Mon Aug  7 00:19:42 2017]  ffff88014e753c08 ffff883fecf88e00 ffff88014e753ac0 ffffffff8120b53e
[Mon Aug  7 00:19:42 2017]  0000000000000015 0000000000000000 ffff881fe883b740 ffff883fe94f7000
[Mon Aug  7 00:19:42 2017] Call Trace:
[Mon Aug  7 00:19:42 2017]  [<ffffffff813f9513>] dump_stack+0x63/0x90
[Mon Aug  7 00:19:42 2017]  [<ffffffff81391c64>] ? apparmor_capable+0xc4/0x1b0
[Mon Aug  7 00:19:42 2017]  [<ffffffff81192ae2>] oom_kill_process+0x202/0x3c0
[Mon Aug  7 00:19:42 2017]  [<ffffffff81192f09>] out_of_memory+0x219/0x460
[Mon Aug  7 00:19:42 2017]  [<ffffffff81198ef8>] __alloc_pages_slowpath.constprop.88+0x938/0xad0
[Mon Aug  7 00:19:42 2017]  [<ffffffff81199316>] __alloc_pages_nodemask+0x286/0x2a0
[Mon Aug  7 00:19:42 2017]  [<ffffffff811e467d>] alloc_pages_vma+0xad/0x250
[Mon Aug  7 00:19:42 2017]  [<ffffffff811fad53>] do_huge_pmd_wp_page+0x153/0xb70
[Mon Aug  7 00:19:42 2017]  [<ffffffff811c1a5f>] handle_mm_fault+0x90f/0x1820
[Mon Aug  7 00:19:42 2017]  [<ffffffff8106b802>] ? do_page_fault+0x22/0x30
[Mon Aug  7 00:19:42 2017]  [<ffffffff81842cf8>] ? page_fault+0x28/0x30
[Mon Aug  7 00:19:42 2017]  [<ffffffff8106b577>] __do_page_fault+0x197/0x400
[Mon Aug  7 00:19:42 2017]  [<ffffffff8106b802>] do_page_fault+0x22/0x30
[Mon Aug  7 00:19:42 2017]  [<ffffffff81842cf8>] page_fault+0x28/0x30
[Mon Aug  7 00:19:42 2017] Mem-Info:
[Mon Aug  7 00:19:42 2017] active_anon:61350709 inactive_anon:2118817 isolated_anon:0
                            active_file:0 inactive_file:0 isolated_file:32
                            unevictable:915 dirty:0 writeback:8 unstable:0
                            slab_reclaimable:14082 slab_unreclaimable:64456
                            mapped:3492 shmem:329012 pagetables:142167 bounce:0
                            free:260204 free_pcp:4111 free_cma:0

[Tue Aug  8 05:50:08 2017] apt-check invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
[Tue Aug  8 05:50:08 2017] apt-check cpuset=/ mems_allowed=0-1
[Tue Aug  8 05:50:08 2017] CPU: 11 PID: 2538289 Comm: apt-check Tainted: G           OE   4.4.0-83-generic #106-Ubuntu
[Tue Aug  8 05:50:08 2017] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017
[Tue Aug  8 05:50:08 2017]  0000000000000286 000000005e467cc9 ffff8820b44a39f8 ffffffff813f9513
[Tue Aug  8 05:50:08 2017]  ffff8820b44a3bb0 ffff881fec15b800 ffff8820b44a3a68 ffffffff8120b53e
[Tue Aug  8 05:50:08 2017]  0000000000000015 ffffffff81e42ac0 ffff883fe996f980 ffffffffffffff04
[Tue Aug  8 05:50:08 2017] Call Trace:
[Tue Aug  8 05:50:08 2017]  [<ffffffff813f9513>] dump_stack+0x63/0x90
[Tue Aug  8 05:50:08 2017]  [<ffffffff8134544d>] ? cap_capable+0xd/0x70
[Tue Aug  8 05:50:08 2017]  [<ffffffff81192ae2>] oom_kill_process+0x202/0x3c0
[Tue Aug  8 05:50:08 2017]  [<ffffffff81192f09>] out_of_memory+0x219/0x460
[Tue Aug  8 05:50:08 2017]  [<ffffffff81198ef8>] __alloc_pages_slowpath.constprop.88+0x938/0xad0
[Tue Aug  8 05:50:08 2017]  [<ffffffff81199316>] __alloc_pages_nodemask+0x286/0x2a0
[Tue Aug  8 05:50:08 2017]  [<ffffffff811e2e3c>] alloc_pages_current+0x8c/0x110
[Tue Aug  8 05:50:08 2017]  [<ffffffff8118f0ab>] __page_cache_alloc+0xab/0xc0
[Tue Aug  8 05:50:08 2017]  [<ffffffff811915ba>] filemap_fault+0x14a/0x3f0
[Tue Aug  8 05:50:08 2017]  [<ffffffff812a3506>] ext4_filemap_fault+0x36/0x50
[Tue Aug  8 05:50:08 2017]  [<ffffffff811be5d0>] __do_fault+0x50/0xe0
[Tue Aug  8 05:50:08 2017]  [<ffffffff811c20f2>] handle_mm_fault+0xfa2/0x1820
[Tue Aug  8 05:50:08 2017]  [<ffffffff8106b577>] __do_page_fault+0x197/0x400
[Tue Aug  8 05:50:08 2017]  [<ffffffff8106b802>] do_page_fault+0x22/0x30
[Tue Aug  8 05:50:08 2017]  [<ffffffff81842cf8>] page_fault+0x28/0x30
[Tue Aug  8 05:50:08 2017] Mem-Info:
[Tue Aug  8 05:50:08 2017] active_anon:61377850 inactive_anon:2049156 isolated_anon:0
                            active_file:0 inactive_file:0 isolated_file:0
                            unevictable:915 dirty:0 writeback:0 unstable:0
                            slab_reclaimable:15329 slab_unreclaimable:101408
                            mapped:3655 shmem:338468 pagetables:141874 bounce:0
                            free:260450 free_pcp:2714 free_cma:0

Linux rcgpudc1rh31-02 4.4.0-83-generic #106-Ubuntu SMP Mon Jun 26 17:54:43 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message

Jake Billo (ev98) wrote on 2017-08-16:

#95

kern.log from affected system Edit (10.0 KiB, text/plain)

We are also experiencing this issue running linux-aws 4.4.0-1028.37, which tracks Ubuntu kernel 4.4.0-89.112. Our use case is very similar to comment #86 (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1655842/comments/86). In our case ElasticSearch 2.4.5 is running under Java 1.8.0_131 with a ~29GB heap; we downsized from 31GB as a troubleshooting effort with no change to the frequency of OOM. The issue also occurs regardless of vm.overcommit_memory being set to 0, 1 or 2.

The relevant data from kern.log (with redacted hostname) is attached; I'm happy to provide additional logs or test different kernels, but since our use case is i3-class instances in AWS, we need the nvme enhancements and enhanced network I/O provided by the linux-aws package.

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-08-16:

#96

Please, do not cut the logs. Without the "invoked oom-killer" line, for example, it's hard to see the gfp flags and allocation order that failed.

Revision history for this message

Pete Cheslock (pete-cheslock) wrote on 2017-08-16:

#97

I have seemingly solved this issue with linux-aws version 4.4.0-1016-aws at the very least. The specific issue I was seeing was 2nd order allocations failing when OOMKiller triggered. At the time I was thinking the issue was due to XFS and memory fragmentation with lots and lots of memory mapped files in Elasticsearch/Lucene. When we moved to EXT4 the rate of oomkiller firing dropped, but did not stop. We made the following 2 changes to sysctls which have effectively stopped higher order memory allocaitons from failing and oomkiller firing.

Also these settings were used on i3.2xlarge hosts that have 60G of ram - your milage may vary. Also we do not run swap on our servers, so likely adding swap could have helped, but not an option for us.

vm.min_free_kbytes = 1000000 # We set this to leave about 1G of ram available for the kernel in the hope that even if the memory was heavily fragmented there might still be enough memory for linux to grab a higher order memory allocation fast enough before oomkiller does things.

vm.zone_reclaim_mode = 1 # our hope here was to get the kernel to get more aggressive in reclaiming memory

Revision history for this message

Jake Billo (ev98) wrote on 2017-08-16:

#98

full kern.log contents Edit (68.9 KiB, text/plain)

Apologies - the file was inadvertently split by logrotate. I have concatenated the entire contents of kern.log and kern.log.1 into the attached file; these are the only kern.log files in /var/log on the system.

I do have to redact the hostname in question, but it is a simple substitution of 'localhost' for the FQDN of the system.

Revision history for this message

Pete Cheslock (pete-cheslock) wrote on 2017-08-16:

#99

> kthreadd invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0

Yea - that 2nd order allocation failure is the exact same issue I was able to see (same GFP mask also)

Revision history for this message

Swe W Aung (sirswa) wrote on 2017-08-17:

#100

Kern log attached. Edit (81.7 KiB, text/plain)

We have another case of OOM at one of the host that we upgraded kernel to 4.4.0-89 a week ago.

kern.log attached.

Revision history for this message

Swe W Aung (sirswa) wrote on 2017-08-17:

#101

dmesg-kernel4.4.0-89.txt Edit (58.6 KiB, text/plain)

Attaching dmesg output

Revision history for this message

Jake Billo (ev98) wrote on 2017-08-22:

#102

With the sysctl settings provided by Pete (vm.min_free_kbytes = 1000000 and vm.zone_reclaim_mode = 1), we've been running the linux-aws 4.4.0-1028.37 kernel successfully without an OOM killer invocation for about four days now. Previously we would have seen three or more occurrences of this per day, so it's a positive indication.

Revision history for this message

Willem (wdekker) wrote on 2017-09-05:

#103

dmesg.txt Edit (62.9 KiB, text/plain)

We have found this issue on 4.4.0-92 too.
But only when the systems were put under stress.
Reverting back to 4.4.0-57 resolved it.

Revision history for this message

Willem (wdekker) wrote on 2017-09-05:

#104

kern.log Edit (156.8 KiB, text/plain)

Attached kern.log

Paolo Pisati (p-pisati) on 2017-09-11

Changed in linux-raspi2 (Ubuntu):
status:	Confirmed → Fix Committed
Changed in linux-raspi2 (Ubuntu Xenial):
status:	Confirmed → Fix Committed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2017-09-11:

#105

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-aws (Ubuntu Xenial):
status:	New → Confirmed
Changed in linux-aws (Ubuntu):
status:	New → Confirmed

Revision history for this message

Vladimir Nicolici (vnicolici) wrote on 2017-11-01:

#107

OOM log Edit (122.4 KiB, text/plain)

Download full text (5.7 KiB)

Not sure if it's the same issue, but we had an unexpected OOM with Ubuntu 16.04.3 LTS, 4.4.0-91.

Oct 31 23:52:25 db3 kernel: [6569272.882023] psql invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0

...

Oct 31 23:52:25 db3 kernel: [6569272.882154] Mem-Info:
Oct 31 23:52:25 db3 kernel: [6569272.882165] active_anon:38011018 inactive_anon:1422084 isolated_anon:0
Oct 31 23:52:25 db3 kernel: [6569272.882165] active_file:11699125 inactive_file:11727535 isolated_file:0
Oct 31 23:52:25 db3 kernel: [6569272.882165] unevictable:0 dirty:88019 writeback:2902991 unstable:23308
Oct 31 23:52:25 db3 kernel: [6569272.882165] slab_reclaimable:1455159 slab_unreclaimable:533985
Oct 31 23:52:25 db3 kernel: [6569272.882165] mapped:38499394 shmem:38495946 pagetables:33687177 bounce:0
Oct 31 23:52:25 db3 kernel: [6569272.882165] free:212612 free_pcp:0 free_cma:0
Oct 31 23:52:25 db3 kernel: [6569272.882172] Node 0 DMA free:13256kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15976kB managed:15892kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Oct 31 23:52:25 db3 kernel: [6569272.882182] lowmem_reserve[]: 0 1882 193368 193368 193368
Oct 31 23:52:25 db3 kernel: [6569272.882188] Node 0 DMA32 free:768204kB min:316kB low:392kB high:472kB active_anon:8kB inactive_anon:32kB active_file:20kB inactive_file:48kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2045556kB managed:1964868kB mlocked:0kB dirty:0kB writeback:44kB mapped:16kB shmem:12kB slab_reclaimable:729192kB slab_unreclaimable:35928kB kernel_stack:1920kB pagetables:415552kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Oct 31 23:52:25 db3 kernel: [6569272.882196] lowmem_reserve[]: 0 0 191486 191486 191486
Oct 31 23:52:25 db3 kernel: [6569272.882201] Node 0 Normal free:34260kB min:32432kB low:40540kB high:48648kB active_anon:58162056kB inactive_anon:2546400kB active_file:18254204kB inactive_file:18282192kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:199229440kB managed:196081724kB mlocked:0kB dirty:152124kB writeback:4685924kB mapped:58223800kB shmem:58229824kB slab_reclaimable:2362116kB slab_unreclaimable:1123984kB kernel_stack:11056kB pagetables:94580096kB unstable:22108kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Oct 31 23:52:25 db3 kernel: [6569272.882210] lowmem_reserve[]: 0 0 0 0 0
Oct 31 23:52:25 db3 kernel: [6569272.882215] Node 1 Normal free:34728kB min:32780kB low:40972kB high:49168kB active_anon:93882008kB inactive_anon:3141904kB active_file:28542276kB inactive_file:28627900kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:201326592kB managed:198178644kB mlocked:0kB dirty:199952kB writeback:6925996kB mapped:95773760kB shmem:95753948kB slab_reclaimable:2729328kB slab_unreclaimable:976028kB...

Ubuntu
linux package

"Out of memory" errors after upgrade to 4.4.0-59

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
linux (Ubuntu)	Fix Released	High	Thadeu Lima de Souza Cascardo
Xenial	Fix Released	High	Thadeu Lima de Souza Cascardo
linux-aws (Ubuntu)	Confirmed	Undecided	Unassigned
Xenial	Confirmed	Undecided	Unassigned
linux-raspi2 (Ubuntu)	Fix Committed	Undecided	Paolo Pisati
Xenial	Fix Committed	Undecided	Unassigned

Ubuntulinux package

"Out of memory" errors after upgrade to 4.4.0-59

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package