"Out of memory" errors after upgrade to 4.4.0-59

Bug #1655842 reported by Mike Williams on 2017-01-12
546
This bug affects 96 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Thadeu Lima de Souza Cascardo
Xenial
High
Thadeu Lima de Souza Cascardo
linux-aws (Ubuntu)
Undecided
Unassigned
Xenial
Undecided
Unassigned
linux-raspi2 (Ubuntu)
Undecided
Paolo Pisati
Xenial
Undecided
Unassigned

Bug Description

After a fix for LP#1647400, a bug that caused freezes under some workloads, some users noticed regular OOMs. Those regular OOMs were reported under this bug, and fixed after some releases.

Some of the affected kernels are documented below. In order to check your particular kernel, read its changelog and lookup for 1655842 and 1647400. If it has the fix for 1647400, but not the fix for 1655842, then it's affected.

It's still possible that you notice regressions compared to kernels that didn't have the fixes for any of the bugs. However, reverting all fixes would cause the freeze bug to come back. So, it's not a possible solution moving forward.

If you see any regressions, in the form of OOMs, mainly, please report a new bug. Different workloads may require different solutions, and it's possible that further fixes are needed, be them upstream or not. The best way to get such fixes applied is reporting that under a new bug, one that can be verified, so being able to reproduce the bug makes it possible to verify the fixes really fix the identified bug.

Kernels affected:

linux 4.4.0-58, 4.4.0-59, 4.4.0-60, 4.4.0-61, 4.4.0-62.
linux-raspi2 4.4.0-1039 to 4.4.0-1042 and 4.4.0-1044 to 4.4.0-1071

Particular kernels NOT affected by THIS bug:

linux-aws

To reiterate, if you find an OOM with an affected kernel, please upgrade.
If you find an OOM with a non-affected kernel, please report a new bug. We want to investigate it and fix it.

===================
I recently replaced some Xenial servers, and started experiencing "Out of memory" problems with the default kernel.

We bake Amazon AMIs based on an official Ubuntu-provided image (ami-e6b58e85, in ap-southeast-2, from https://cloud-images.ubuntu.com/locator/ec2/). Previous versions of our AMI included "4.4.0-57-generic", but the latest version picked up "4.4.0-59-generic" as part of a "dist-upgrade".

Instances booted using the new AMI have been using more memory, and experiencing OOM issues - sometimes during boot, and sometimes a while afterwards. An example from the system log is:

[ 130.113411] cloud-init[1560]: Cloud-init v. 0.7.8 running 'modules:final' at Wed, 11 Jan 2017 22:07:53 +0000. Up 29.28 seconds.
[ 130.124219] cloud-init[1560]: Cloud-init v. 0.7.8 finished at Wed, 11 Jan 2017 22:09:35 +0000. Datasource DataSourceEc2. Up 130.09 seconds
[29871.137128] Out of memory: Kill process 2920 (ruby) score 107 or sacrifice child
[29871.140816] Killed process 2920 (ruby) total-vm:675048kB, anon-rss:51184kB, file-rss:2164kB
[29871.449209] Out of memory: Kill process 3257 (splunkd) score 97 or sacrifice child
[29871.453282] Killed process 3258 (splunkd) total-vm:66272kB, anon-rss:6676kB, file-rss:0kB
[29871.677910] Out of memory: Kill process 2647 (fluentd) score 51 or sacrifice child
[29871.681872] Killed process 2647 (fluentd) total-vm:117944kB, anon-rss:23956kB, file-rss:1356kB

I have a hunch that this may be related to the fix for https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1647400, introduced in linux (4.4.0-58.79).

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.4.0-59-generic 4.4.0-59.80
ProcVersionSignature: User Name 4.4.0-59.80-generic 4.4.35
Uname: Linux 4.4.0-59-generic x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jan 12 06:29 seq
 crw-rw---- 1 root audio 116, 33 Jan 12 06:29 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.1-0ubuntu2.4
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Thu Jan 12 06:38:45 2017
Ec2AMI: ami-0f93966c
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: ap-southeast-2a
Ec2InstanceType: t2.nano
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb: Error: command ['lsusb'] failed with exit code 1:
MachineType: Xen HVM domU
PciMultimedia:

ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 cirrusdrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-59-generic root=UUID=fb0fef08-f3c5-40bf-9776-f7ba00fe72be ro console=tty1 console=ttyS0
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-59-generic N/A
 linux-backports-modules-4.4.0-59-generic N/A
 linux-firmware 1.157.6
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 12/09/2016
dmi.bios.vendor: Xen
dmi.bios.version: 4.2.amazon
dmi.chassis.type: 1
dmi.chassis.vendor: Xen
dmi.modalias: dmi:bvnXen:bvr4.2.amazon:bd12/09/2016:svnXen:pnHVMdomU:pvr4.2.amazon:cvnXen:ct1:cvr:
dmi.product.name: HVM domU
dmi.product.version: 4.2.amazon
dmi.sys.vendor: Xen

Mike Williams (mdub) wrote :

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Xenial):
status: New → Triaged
importance: Undecided → High
Changed in linux (Ubuntu):
status: Confirmed → Triaged
Joseph Salisbury (jsalisbury) wrote :

I built a Xenial test kernel with the following two commits reverted:

c630ec12d831 mm, oom: rework oom detection
57e9ef475661 mm: throttle on IO only when there are too many dirty and writeback pages

This test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1655842/

Can you test this kernel and see if it resolves this bug?

Thanks in advance!

you could also try cherry-picking https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=6b4e3181d7bd5ca5ab6f45929e4a5ffa7ab4ab7f , but that will probably need some more inbetween patches as well..

reverting the two commits fixed the issue for our users (Proxmox VE, which uses a kernel based on the 4.4.x one from 16.04)

David F. (malteworld) wrote :

@f-gruenbichler: I already tried to cherry-pick that patch a while ago and it doesn't work because that patch is based on work that isn't in the 4.4.* kernel branch, not even including Canonical's backports from later branches.

Mike Williams (mdub) wrote :

Thanks jsalisbury. We have deployed using your test kernel (from http://kernel.ubuntu.com/~jsalisbury/lp1655842/), and experienced no OOM issues.

Allen Wild (aswild) wrote :

I manage a set of build servers for CPU/IO intensive builds using Yocto/OpenEmbedded. Ubuntu 14.04.5 with the 4.4 Xenial kernel. After updating to 4.4.0-59 the builds started failing because of the OOM killer.

Rolling back to 4.4.0-57 fixed the OOMs for me.

Can you try the kernel at [1], which includes the patches that are also at [1]?

[1] http://people.canonical.com/~cascardo/lp1655842/

Thanks.
Cascardo.

Changed in linux (Ubuntu):
assignee: nobody → Thadeu Lima de Souza Cascardo (cascardo)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Thadeu Lima de Souza Cascardo (cascardo)
Stéphane Graber (stgraber) wrote :

Just a note that Joe's armhf kernel has been working well for me.

I can't test cascardo's kernel as it's not built for armhf.

I will upload armhf binaries for those kernels and let you know. It's important to try those because they include an alternative solution that we would rather use instead of the one with the reverted patches.

Danny B (danny.b) wrote :

Using Cascardo's kernel fixes the problem for me.

It was a bit of a hassle to install though because there's no linux-headers-4.4.0-62_4.4.0-62.83_all.deb at the link and linux-headers-generic depends on it.

Here's where to find it:
amd64: https://launchpad.net/ubuntu/xenial/amd64/linux-headers-4.4.0-62/4.4.0-62.83
armhf: https://launchpad.net/ubuntu/xenial/armhf/linux-headers-4.4.0-62/4.4.0-62.83

Ben French (octoamit) on 2017-01-21
Changed in linux (Ubuntu):
status: Triaged → In Progress
Stéphane Graber (stgraber) wrote :

I've had a few armhf systems running cascardo's kernel and so far no sign of the OOM or any other problem with it.

Mike Williams (mdub) wrote :

Cascardo: we've tried your test kernel, and it looks good - we've seen no OOM problems.

Cris (cristianpeguero25) wrote :

Hi I'd like to implement Cascardo kernel since I've been having the same issue, though not on all of
the xenial machines running 4.4.0-59-generic which is strange.
Could someone tell how to implement Cascardo kernel without completely messing up my machine.

Thanks

xb5i7o (xb5i7o) wrote :

Hi, I am having the exact same issues on a PC with 18GB ram!! kernel 4.4.0-59-generic

Please can this be fixed as soon as possible with a release of the next kernel update.

Its killing processes such as firefox and virtualbox for no good reason while only 4gb is in use really.

Hope this can be fixed soon. its becoming worse as time passes.

Eric Desrochers (slashd) wrote :

The patchset[1] for bug "LP #1655842" has been submitted on Jan 24th 2017 and acked by the kernel team on the same day[2].

The patch should be part of the following kernel release cycle :

cycle: 27-Jan through 18-Feb[3]
====
27-Jan Last day for kernel commits for this cycle
30-Jan - 04-Feb Kernel prep week.
05-Feb - 17-Feb Bug verification & Regression testing..
20-Feb Release to -updates.
====

[1] - "Xenial PATCH 00/11] Fixes OOM for LP #1655842"
[2] - "ACK: [Xenial PATCH 00/11] Fixes OOM for LP #1655842"
[3] - https://wiki.ubuntu.com/KernelTeam/Newsletter

- Eric

Changed in linux (Ubuntu Xenial):
status: Triaged → In Progress
Eric Desrochers (slashd) wrote :

Additional note :

Applied in master-next on Jan 26th 2017[2]

[1] - "APPLIED: [Xenial PATCH 00/11] Fixes OOM for LP #1655842"

- Eric

Eric Desrochers (slashd) on 2017-01-27
tags: added: sts

@slashd It sounds really strange to me that I should wait til 20-Feb for a fix for this bug while this is clearly a regression introduced with the latest kernel upgrade. Is there no way to speed things up to fix this regression.

Currently we had to downgrade all our xenial systems to linux-image-4.4.0-57-generic to avoid this bug.

Gaudenz

Eric Desrochers (slashd) wrote :

@Gaudenz Steinlin (gaudenz-debian),

It will takes 3 weeks to land in -updates pocket, but you can expect to have a call for testing a proposed package by EOW.

- Eric

Tim Gardner (timg-tpi) on 2017-01-31
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux (Ubuntu):
status: In Progress → Fix Released

This is a severe bug. It should be treated a high-priority bugfix that cannot wait 3 weeks.

Just as a note for newcomers reading this, I can confirm the bug is NOT fixed in the officially released 4.4.0-62.83.

Krzysztof Dryja (cih997) wrote :

I could not reboot my machine and the ugly workaround for this issue was to login as root and clear system caches:

echo 3 > /proc/sys/vm/drop_caches

This made my machine stable again, at least for the time I needed.

This is fixed in 4.4.0-63.84, which will be available in proposed soon.

Shelby Cain (alyandon) wrote :

@nate Thank you! You just saved me a lot of hassle as I was about to unpin the 4.4.0-57 kernel and update a bunch of machines on the assumption the fix was in that version.

Sebastian Unger (sebunger44) wrote :

As a note: I believe this also affects the armhf kernel 4.4.0-1040-raspi2 for the Raspberry Pi.

David Glasser (glasser) wrote :

I've been struggling with this bug for nearly a week and only now found this issue. Thanks for fixing it!

For the sake of others finding it, here's the stack trace part of the oom-killer log, which contains some terms I searched for a while ago that aren't mentioned here yet.

docker invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=-1000
docker cpuset=/ mems_allowed=0
CPU: 11 PID: 4472 Comm: docker Tainted: G W 4.4.0-62-generic #83-Ubuntu
Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
 0000000000000286 0000000057f64c94 ffff880dfb5efaf0 ffffffff813f7c63
 ffff880dfb5efcc8 ffff880fbfda0000 ffff880dfb5efb60 ffffffff8120ad4e
 ffffffff81cd2d7f 0000000000000000 ffffffff81e67760 0000000000000206
Call Trace:
 [<ffffffff813f7c63>] dump_stack+0x63/0x90
 [<ffffffff8120ad4e>] dump_header+0x5a/0x1c5
 [<ffffffff811926c2>] oom_kill_process+0x202/0x3c0
 [<ffffffff81192ae9>] out_of_memory+0x219/0x460
 [<ffffffff81198a5d>] __alloc_pages_slowpath.constprop.88+0x8fd/0xa70
 [<ffffffff81198e56>] __alloc_pages_nodemask+0x286/0x2a0
 [<ffffffff81198f0b>] alloc_kmem_pages_node+0x4b/0xc0
 [<ffffffff8107ea5e>] copy_process+0x1be/0x1b70
 [<ffffffff8139225c>] ? apparmor_file_alloc_security+0x5c/0x220
 [<ffffffff811ed04a>] ? kmem_cache_alloc+0x1ca/0x1f0
 [<ffffffff81348263>] ? security_file_alloc+0x33/0x50
 [<ffffffff810caeb1>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
 [<ffffffff810805a0>] _do_fork+0x80/0x360
 [<ffffffff81080929>] SyS_clone+0x19/0x20
 [<ffffffff818385f2>] entry_SYSCALL_64_fastpath+0x16/0x71

Hajo Locke (hajo-locke) wrote :

When this new kernel will be released? This bug is killing our MySQL Servers. Booting old kernels is only a bad workaround. I think a lot of people with busy servers will have a problem.

This is 2.nd time we were hit by a big bug within short time. In oct 2016 our nameservers got problems because of bug 1634892
Is LTS-Ubuntu still right system for servers?

This bug also appears to affect linux-image-4.8.0-34-generic in 16.04.1 Xenial.

Hi, Luk.

linux-image-4.8.0-34-generic should not be affected by this. If you see unexpected OOM problems, please open a new bug report and attach the kernel logs, please.

Thanks.
Cascardo.

xb5i7o (xb5i7o) wrote :

Just by the way - 4.4.0-62-generic has the exact same problem. Even when uninstalling 4.4.0-59-generic, my system at some point auto-updated to 4.4.0-62-generic . Only 4.4.0-57-generic is safe for now.

Nick Maynard (nick-maynard) wrote :

LTS Ubuntu with -updates shouldn't have this sort of issue - this is, frankly, unforgivable.

We need a new kernel urgently in -updates, and I'd expect serious discussions within the kernel team to understand what has caused this issue and avoid it reoccurring.

Anton Piatek (anton-piatek) wrote :

If this kernel is not going to hit -updates shortly (i.e. days), can something be done to pull or downgrade the broken kernel? At least revert linux-image-generic to depend back on linux-image-4.4.0-57-generic which doesn't have the issues and will stop more people from upgrading to a broken kernel.

Having this sort of break in an LTS kernel is not inspiring at all.

Eric Desrochers (slashd) wrote :

The fix is now available for testing in kernel version 4.4.0-63.84, if you enable proposed[1]

$ apt-cache policy linux-image-4.4.0-63-generic
linux-image-4.4.0-63-generic:
  Installed: (none)
  ==> Candidate: 4.4.0-63.84
  Version table:
     4.4.0-63.84 500
        500 http://archive.ubuntu.com/ubuntu ==>xenial-proposed/main amd64 Packages

$ apt-get changelog linux-image-4.4.0-63-generic | egrep "1655842"
 ==> * "Out of memory" errors after upgrade to 4.4.0-59 (LP: #1655842)

[1] - https://wiki.ubuntu.com/Testing/EnableProposed

- Eric

Oliver O. (oliver-o456i) wrote :

Testing...

Enabled proposed (https://wiki.ubuntu.com/Testing/EnableProposed).

Installed kernel packages:

# apt-get install -s -t xenial-proposed 'linux-headers-4.4.0.63$' 'linux-headers-4.4.0.63-generic$' 'linux-image-4.4.0.63-generic$' 'linux-image-extra-4.4.0.63-generic$'

Rebooted.

# cat /proc/version_signature
Ubuntu 4.4.0-63.84-generic 4.4.44

Who is the saver 4.4.0-57-generic or 4.4.0-63-generic now.

David Glasser (glasser) wrote :

kulwinder singh: Either one, but nothing in between.

-57 will reintroduce a few (unrelated) security bugs as well as the bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1647400 whose fix caused this one, but is easier to enable and has been tested for longer.

-63 should fix this bug, the older bug, and the intermediary security bugs, but requires you to enable the "proposed" repository, and hasn't been tested for quite as long.

Anything in between has this bug.

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
David Glasser (glasser) wrote :

Cascardo: Just to be clear, are you looking for verification from anyone in the world, or from specific kernel testers?

(I'd like to help, but I'm only able to reproduce the issue in production, and the process of debugging this issue when we ran into it was already more restarts than is good for my service right now (we settled on downgrading for the moment).)

David F. (malteworld) wrote :

@nick-maynard: Why is such a bug unforgivable? You can just boot a previous kernel instead. If you're concerned about availability then don't reboot in the first place unless there's an important security patch.

Oliver O. (oliver-o456i) on 2017-02-11
tags: added: verification-done-xenial
removed: verification-needed-xenial
description: updated
Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
30 comments hidden view all 110 comments

Julian, your logs indicate some possible swap corruption, would you mind opening a new bug and sending it using apport-bug?

Thanks.
Cascardo.

Julian Kassat (j.kassat) wrote :

Hi Cascardo,

there is no related dmesg output after the incident (just some lines from apt-daily.timer).

I filed a bug for the possible swap corruption issue here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1669707

Thanks so far.

Julian

Pete Cheslock (pete-cheslock) wrote :

We have been seeing this issue recently as well. We are running 4.4.0-66-generic #87-Ubuntu - I can attempt to downgrade to 4.4.0-57 but its a large cluster with a lot of data so it may take some time. Attached a kern.log from this most recent oom.

I am seeing this issue as well, on Arch kernel v 4.10.4-1

Michael Dye (dye.michael) wrote :

This is plaguing Horizon project Pi2 and Pi3 devices running Xenial 16.04.2 w/ kernel 4.4.0-1050-raspi2. From a pi2:

root@horizon-00000000a17d2187:~# uname -a
Linux horizon-00000000a17d2187 4.4.0-1050-raspi2 #57-Ubuntu SMP Wed Mar 22 12:52:22 UTC 2017 armv7l armv7l armv7l GNU/Linux
root@horizon-00000000a17d2187:~# free
              total used free shared buff/cache available
Mem: 942128 149548 35456 494084 757124 239716
Swap: 0 0 0

Under these circumstances, the kernel's oom-killer will kill Wifi processes (rtl_rpcd), systemd-udevd, our Ethereum client (geth), and other critical processes in attempt to stay afloat rather than using reclaimable RAM.

I was using 4.4.0-21 as reported by `uname -r` which is default in Kubuntu 16.04. The same bug appears on mainline kernel 4.10 too!

Now, I'm in confusion. Which kernel should I upgrade to? Also I experience this only in KDE session with yandex or chrome browser opened

iKazmi (alikazmi-2040) wrote :

I have 4.4.0-59 till 4.4.0-71 and 4.8.0-41 till 4.8.0-46 installed on my system and all are affected by this bug. Firefox, Chrome and Netbeans regularly get killed without a warning and for no reason (since I have something like 10GB+ RAM and all 16GB Swap free at the time the process gets killed). Even KDE has been killed a couple of times while the system still had over 6GB RAM and 16GB Swap free.

Yesterday, after the umpteenth time Netbeans was killed while I was in the middle of doing something, I finally decided to do something about this problem and installed Kernel 4.10.9-041009 from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10.9/. Sadly, that doesn't seem to resolve the problem either and Oom Killer is still overeager to kill user processes (Firefox and Netbeans have both been killed multiple times). At least KDE hasn't been killed so far.

Anybody having successfully tested 4.4.0.63 for OOM-kill issue...

Anton (anton-powershop) wrote :

Yes, 4.4.0.63 solved our OOM issues (and we had plenty after 4.4.0.59). Ours were all headless server (bare metal and VMs) related though - no desktop usage.

But I never experienced this issue with my home laptop either, but that had lots of RAM and was only lightly used during that period - not really a good data point.

Travisgevans (travisgevans) wrote :

I also haven't personally encountered any further OOM issues on my home desktop (used daily) with 4.4.0.63.

I'd like to emphasise that the OOM problem only happens with KDE. I have several DE installed including Unity, GNOME3, Cinnamon. But none of them caused a OOM, at least I never noticed. But in KDE, most of the time when chrome is opened, it triggers OOM. dmesg tells that, sometimes kwin_x11 invoked the OOM or plasmashell.

Most of the time plasmashell is crashed and the opened tab in chrome is killed. However, chrome application will be there. I need to start plasmashell by pressing Alt-F2 bringing the run command dialog and type plasmashell there.

Last night, Even firefox gave an OOM.

I'm attaching a dmesg log hoping that will be helpful.

Sebastian Unger (sebunger44) wrote :

This is still an issue in the current linux-raspi2 version. Where those changes ported to that kernel?

Sebastian Unger (sebunger44) wrote :

linux-raspi2 version 4.4.0.1055.56 that is.

kimo (ubuntu-oldfield) wrote :

I'm seeing oom-killer being invoked despite having 2GB free swap when using the kernel from linux-image-4.4.0-1055-raspi2 version 4.4.0-1055.62.

kimo (ubuntu-oldfield) on 2017-05-25
Changed in linux-raspi2 (Ubuntu):
status: New → Confirmed
Changed in linux-raspi2 (Ubuntu Xenial):
status: New → Confirmed
Sebastian Unger (sebunger44) wrote :

Also observed with 4.4.0-1054-raspi2. I'm now back on 4.4.0-1038-raspi2. I think that one was ok.

Nick Hatch (nicholas-hatch) wrote :

We're still having issues with higher-order allocations failing and triggering an OOM kill for unexplainable reasons. (on 4.4.0-78-generic).

I've attached the relevant OOM killer logs. It may be relevant to note that the server these logs are from is an Elasticsearch instance with a large (~32GB) mlock'ed heap.

Pete Cheslock (pete-cheslock) wrote :

@nicholas-hatch - what file system are your disks formatted as? I was able to stop the OOM's on my ES hosts by moving from XFS to EXT4. My belief is that there was a memory fragmentation issue with ES and many small files on XFS formatted volumes.

Chris (cmavr8) wrote :

The bug is still confirmed and not fixed for linux-raspi2 (Ubuntu), 5 months after getting fixed for the main Ubuntu.

Shouldn't this have some priority? Even apt upgrade breaks if I don't use the clear cache workaround. I can live with it (cron job to clear cache) but this is not great for LTS.

Currently affected: Ubuntu 16.04.2 LTS, 4.4.0-1059-raspi2 #67-Ubuntu

Paolo Pisati (p-pisati) on 2017-06-30
Changed in linux-raspi2 (Ubuntu):
assignee: nobody → Paolo Pisati (p-pisati)
Chris (cmavr8) wrote :

Sure.
I undid the workaround, installed and booted the kernel and will test it for a few days. I'll keep you posted on results.

Thanks Paolo!

Chris (cmavr8) wrote :

Update: No sign of Out-of-memory errors or kills, after 3 days of testing the 4.4.0-1062-raspi2 kernel. I'll report back again next week.

kimo (ubuntu-oldfield) wrote :

4.4.0-1062-raspi2 is looking good - I've had it running for a week without oom-killer being invoked.

Chris (cmavr8) wrote :

Mine's also still stable (no OOMs), after running the patched kernel for 9 days, on a Raspberry pi 2 Model B v1.1.

sirswa (sirswa) wrote :
Download full text (4.8 KiB)

Hi

I am experiencing at one of our compute node hypervisor. kernel version we are using is 4.4.0-83, but seems to be having the issue reported in this report.

[Mon Aug 7 00:19:42 2017] nova-compute invoked oom-killer: gfp_mask=0x2c200ca, order=0, oom_score_adj=0
[Mon Aug 7 00:19:42 2017] nova-compute cpuset=/ mems_allowed=0-1
[Mon Aug 7 00:19:42 2017] CPU: 7 PID: 2164484 Comm: nova-compute Tainted: G OE 4.4.0-83-generic #106-Ubuntu
[Mon Aug 7 00:19:42 2017] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017
[Mon Aug 7 00:19:42 2017] 0000000000000286 00000000d6004dce ffff88014e753a50 ffffffff813f9513
[Mon Aug 7 00:19:42 2017] ffff88014e753c08 ffff883fecf88e00 ffff88014e753ac0 ffffffff8120b53e
[Mon Aug 7 00:19:42 2017] 0000000000000015 0000000000000000 ffff881fe883b740 ffff883fe94f7000
[Mon Aug 7 00:19:42 2017] Call Trace:
[Mon Aug 7 00:19:42 2017] [<ffffffff813f9513>] dump_stack+0x63/0x90
[Mon Aug 7 00:19:42 2017] [<ffffffff81391c64>] ? apparmor_capable+0xc4/0x1b0
[Mon Aug 7 00:19:42 2017] [<ffffffff81192ae2>] oom_kill_process+0x202/0x3c0
[Mon Aug 7 00:19:42 2017] [<ffffffff81192f09>] out_of_memory+0x219/0x460
[Mon Aug 7 00:19:42 2017] [<ffffffff81198ef8>] __alloc_pages_slowpath.constprop.88+0x938/0xad0
[Mon Aug 7 00:19:42 2017] [<ffffffff81199316>] __alloc_pages_nodemask+0x286/0x2a0
[Mon Aug 7 00:19:42 2017] [<ffffffff811e467d>] alloc_pages_vma+0xad/0x250
[Mon Aug 7 00:19:42 2017] [<ffffffff811fad53>] do_huge_pmd_wp_page+0x153/0xb70
[Mon Aug 7 00:19:42 2017] [<ffffffff811c1a5f>] handle_mm_fault+0x90f/0x1820
[Mon Aug 7 00:19:42 2017] [<ffffffff8106b802>] ? do_page_fault+0x22/0x30
[Mon Aug 7 00:19:42 2017] [<ffffffff81842cf8>] ? page_fault+0x28/0x30
[Mon Aug 7 00:19:42 2017] [<ffffffff8106b577>] __do_page_fault+0x197/0x400
[Mon Aug 7 00:19:42 2017] [<ffffffff8106b802>] do_page_fault+0x22/0x30
[Mon Aug 7 00:19:42 2017] [<ffffffff81842cf8>] page_fault+0x28/0x30
[Mon Aug 7 00:19:42 2017] Mem-Info:
[Mon Aug 7 00:19:42 2017] active_anon:61350709 inactive_anon:2118817 isolated_anon:0
                            active_file:0 inactive_file:0 isolated_file:32
                            unevictable:915 dirty:0 writeback:8 unstable:0
                            slab_reclaimable:14082 slab_unreclaimable:64456
                            mapped:3492 shmem:329012 pagetables:142167 bounce:0
                            free:260204 free_pcp:4111 free_cma:0

[Tue Aug 8 05:50:08 2017] apt-check invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
[Tue Aug 8 05:50:08 2017] apt-check cpuset=/ mems_allowed=0-1
[Tue Aug 8 05:50:08 2017] CPU: 11 PID: 2538289 Comm: apt-check Tainted: G OE 4.4.0-83-generic #106-Ubuntu
[Tue Aug 8 05:50:08 2017] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017
[Tue Aug 8 05:50:08 2017] 0000000000000286 000000005e467cc9 ffff8820b44a39f8 ffffffff813f9513
[Tue Aug 8 05:50:08 2017] ffff8820b44a3bb0 ffff881fec15b800 ffff8820b44a3a68 ffffffff8120b53e
[Tue Aug 8 05:50:08 2017] 0000000000000015 ffffffff81e42ac0 ffff883fe996f980 ffffffffffffff04
[Tue Aug 8 05:50:08 2017] Call Trace:
[Tue Aug 8 05:50:08 2017] [<ff...

Read more...

Jake Billo (ev98) wrote :

We are also experiencing this issue running linux-aws 4.4.0-1028.37, which tracks Ubuntu kernel 4.4.0-89.112. Our use case is very similar to comment #86 (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1655842/comments/86). In our case ElasticSearch 2.4.5 is running under Java 1.8.0_131 with a ~29GB heap; we downsized from 31GB as a troubleshooting effort with no change to the frequency of OOM. The issue also occurs regardless of vm.overcommit_memory being set to 0, 1 or 2.

The relevant data from kern.log (with redacted hostname) is attached; I'm happy to provide additional logs or test different kernels, but since our use case is i3-class instances in AWS, we need the nvme enhancements and enhanced network I/O provided by the linux-aws package.

Please, do not cut the logs. Without the "invoked oom-killer" line, for example, it's hard to see the gfp flags and allocation order that failed.

Pete Cheslock (pete-cheslock) wrote :

I have seemingly solved this issue with linux-aws version 4.4.0-1016-aws at the very least. The specific issue I was seeing was 2nd order allocations failing when OOMKiller triggered. At the time I was thinking the issue was due to XFS and memory fragmentation with lots and lots of memory mapped files in Elasticsearch/Lucene. When we moved to EXT4 the rate of oomkiller firing dropped, but did not stop. We made the following 2 changes to sysctls which have effectively stopped higher order memory allocaitons from failing and oomkiller firing.

Also these settings were used on i3.2xlarge hosts that have 60G of ram - your milage may vary. Also we do not run swap on our servers, so likely adding swap could have helped, but not an option for us.

vm.min_free_kbytes = 1000000 # We set this to leave about 1G of ram available for the kernel in the hope that even if the memory was heavily fragmented there might still be enough memory for linux to grab a higher order memory allocation fast enough before oomkiller does things.

vm.zone_reclaim_mode = 1 # our hope here was to get the kernel to get more aggressive in reclaiming memory

Jake Billo (ev98) wrote :

Apologies - the file was inadvertently split by logrotate. I have concatenated the entire contents of kern.log and kern.log.1 into the attached file; these are the only kern.log files in /var/log on the system.

I do have to redact the hostname in question, but it is a simple substitution of 'localhost' for the FQDN of the system.

Pete Cheslock (pete-cheslock) wrote :

> kthreadd invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0

Yea - that 2nd order allocation failure is the exact same issue I was able to see (same GFP mask also)

sirswa (sirswa) wrote :

We have another case of OOM at one of the host that we upgraded kernel to 4.4.0-89 a week ago.

kern.log attached.

sirswa (sirswa) wrote :

Attaching dmesg output

Jake Billo (ev98) wrote :

With the sysctl settings provided by Pete (vm.min_free_kbytes = 1000000 and vm.zone_reclaim_mode = 1), we've been running the linux-aws 4.4.0-1028.37 kernel successfully without an OOM killer invocation for about four days now. Previously we would have seen three or more occurrences of this per day, so it's a positive indication.

Willem (wdekker) wrote :

We have found this issue on 4.4.0-92 too.
But only when the systems were put under stress.
Reverting back to 4.4.0-57 resolved it.

Willem (wdekker) wrote :

Attached kern.log

Paolo Pisati (p-pisati) on 2017-09-11
Changed in linux-raspi2 (Ubuntu):
status: Confirmed → Fix Committed
Changed in linux-raspi2 (Ubuntu Xenial):
status: Confirmed → Fix Committed
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-aws (Ubuntu Xenial):
status: New → Confirmed
Changed in linux-aws (Ubuntu):
status: New → Confirmed
1 comments hidden view all 110 comments
Vladimir Nicolici (vnicolici) wrote :
Download full text (5.7 KiB)

Not sure if it's the same issue, but we had an unexpected OOM with Ubuntu 16.04.3 LTS, 4.4.0-91.

Oct 31 23:52:25 db3 kernel: [6569272.882023] psql invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0

...

Oct 31 23:52:25 db3 kernel: [6569272.882154] Mem-Info:
Oct 31 23:52:25 db3 kernel: [6569272.882165] active_anon:38011018 inactive_anon:1422084 isolated_anon:0
Oct 31 23:52:25 db3 kernel: [6569272.882165] active_file:11699125 inactive_file:11727535 isolated_file:0
Oct 31 23:52:25 db3 kernel: [6569272.882165] unevictable:0 dirty:88019 writeback:2902991 unstable:23308
Oct 31 23:52:25 db3 kernel: [6569272.882165] slab_reclaimable:1455159 slab_unreclaimable:533985
Oct 31 23:52:25 db3 kernel: [6569272.882165] mapped:38499394 shmem:38495946 pagetables:33687177 bounce:0
Oct 31 23:52:25 db3 kernel: [6569272.882165] free:212612 free_pcp:0 free_cma:0
Oct 31 23:52:25 db3 kernel: [6569272.882172] Node 0 DMA free:13256kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15976kB managed:15892kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Oct 31 23:52:25 db3 kernel: [6569272.882182] lowmem_reserve[]: 0 1882 193368 193368 193368
Oct 31 23:52:25 db3 kernel: [6569272.882188] Node 0 DMA32 free:768204kB min:316kB low:392kB high:472kB active_anon:8kB inactive_anon:32kB active_file:20kB inactive_file:48kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2045556kB managed:1964868kB mlocked:0kB dirty:0kB writeback:44kB mapped:16kB shmem:12kB slab_reclaimable:729192kB slab_unreclaimable:35928kB kernel_stack:1920kB pagetables:415552kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Oct 31 23:52:25 db3 kernel: [6569272.882196] lowmem_reserve[]: 0 0 191486 191486 191486
Oct 31 23:52:25 db3 kernel: [6569272.882201] Node 0 Normal free:34260kB min:32432kB low:40540kB high:48648kB active_anon:58162056kB inactive_anon:2546400kB active_file:18254204kB inactive_file:18282192kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:199229440kB managed:196081724kB mlocked:0kB dirty:152124kB writeback:4685924kB mapped:58223800kB shmem:58229824kB slab_reclaimable:2362116kB slab_unreclaimable:1123984kB kernel_stack:11056kB pagetables:94580096kB unstable:22108kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Oct 31 23:52:25 db3 kernel: [6569272.882210] lowmem_reserve[]: 0 0 0 0 0
Oct 31 23:52:25 db3 kernel: [6569272.882215] Node 1 Normal free:34728kB min:32780kB low:40972kB high:49168kB active_anon:93882008kB inactive_anon:3141904kB active_file:28542276kB inactive_file:28627900kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:201326592kB managed:198178644kB mlocked:0kB dirty:199952kB writeback:6925996kB mapped:95773760kB shmem:95753948kB slab_reclaimable:2729328kB slab_unreclaimable:976028kB...

Read more...

William DeLuca (qops1981) wrote :

We believe that we are experiencing this issue on kernel 4.4.0-1030-aws as well. We recently move from 14 LTS to 16LTS and are now experience oom kills.

William DeLuca (qops1981) wrote :

Side Question... Is there something I can specifically look for on a Ubuntu install that would indicate if that kernel has the fix or not. I assume the "Fix or Not" indicates are manually triggered and the fix could be out for AWS but not indicated.

description: updated
Erik Hess (p-we-x) wrote :

In our production environment of ~1800 nodes we've seen oom-kill events that looked similar to this bug's pattern - oom-kills killing large server processes while resident memory was far lower than available physical memory.

We were affected by the original bug and saw that issue readily addressed in newer kernel versions, as mentioned in the earlier comments in this ticket. However, we still kept seeing oom-kill events, albeit in far lower numbers over time, that were happening on kernel-upgraded systems. These were a mystery for awhile, largely due to their infrequent occurrence.

After a lot of research we think we've pinned it down to a subset of our multi-socket servers that have >1 NUMA memory pools. After implementing some scripts to track NUMA stats we've observed that one of the two NUMA pools is being fully utilized while the other has large amounts of memory to spare (often 90-95%) Either our server app, the JVM its running on, or the kernel itself isn't handling the NUMA memory pooling well and we're ending up exhausting an entire NUMA pool.

Work is ongoing to see the causality chain that's leading to this. We don't yet have confirmation about whether its something our app (or its libraries) is doing, if we just need to make the JVM NUMA-aware with args, or if there's kernel tuning to be done. But I did want to mention it here as a warning to folks running on multi-NUMA-pool multi-socket systems seeing similar behavior.

Displaying first 40 and last 40 comments. View all 110 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers