"Out of memory" errors after upgrade to 4.4.0-59 + 4.8.0-34

Bug #1666260 reported by Iain Buclaw
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
High
Unassigned

Bug Description

Same as #1655842 - Started seeing oom-killer on multiple servers upgraded to 4.4.0-59.

Unlike #1655842, also seeing the same oom-killer on multiple servers updated to 4.8.0-34.

First upgraded them all 4.8 servers 4.8.0-36, then downgraded a few to 4.4.0-63. I am seeing an even more pronounced change in the memory usage, so I can only assume that 4.4.0-63 is also bugged with the same problem as 4.4.0-59 and 4.8.0-34. Either because #1655842 is not fixed, or it is only fixed for certain kinds of workloads.

These are the changes I'm seeing in our memory graphs between 4.4.0-59 and 4.4.0-63/4.8.0-34.

The symptoms I'm seeing are:

Upgrading 4.4.0-57 -> 4.4.0-59:
- /proc/meminfo:Buffers: Up from 9GB to 15GB
- /proc/meminfo:Cached: Up from 5GB to 10GB
- /proc/meminfo:SReclaimable: Down from 15GB to 5GB
- /proc/meminfo:SUnreclaim: Staying at 50MB

Upgrading 4.4.0-57 -> 4.4.0-63:
- /proc/meminfo:Buffers: Up from 9GB to 26GB
- /proc/meminfo:Cached: Down from 5GB to 300MB
- /proc/meminfo:SReclaimable: Down from 15GB to 2GB
- /proc/meminfo:SUnreclaim: Down from 50MB to 30MB

Upgrading 4.4.0-57 -> 4.8.0-34:
- /proc/meminfo:Buffers: Up from 9GB to 14GB
- /proc/meminfo:Cached: Down from 5GB to 2GB
- /proc/meminfo:SReclaimable: Down from 15GB to 14GB
- /proc/meminfo:SUnreclaim: Staying at 50MB

Setting vm.vfs_cache_pressure = 300 seems to have a positive effect of not causing OOMs.

Downgrading to 4.4.0-57 also works.

Will also note that I haven't had a definitive OOM in 4.4.0-63. But the shift in memory usage is far too much from what I expect to be normal on these particular servers where I'm experiencing crashes.

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.4.0-63-generic 4.4.0-63.84
ProcVersionSignature: Ubuntu 4.4.0-63.84-generic 4.4.44
Uname: Linux 4.4.0-63-generic x86_64
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access '/dev/snd/': No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.1-0ubuntu2.5
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
Date: Mon Feb 20 16:15:56 2017
InstallationDate: Installed on 2012-06-04 (1721 days ago)
InstallationMedia:

IwConfig:
 lo no wireless extensions.

 eth0 no wireless extensions.
Lsusb: Error: [Errno 2] No such file or directory: 'lsusb'
MachineType: System manufacturer System Product Name
PciMultimedia:

ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.4.0-63-generic root=UUID=b790930f-ad81-4b27-a353-a4b3d6a29007 ro nomodeset nomdmonddf nomdmonisw
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-63-generic N/A
 linux-backports-modules-4.4.0-63-generic N/A
 linux-firmware 1.157.8
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: Upgraded to xenial on 2017-02-16 (4 days ago)
dmi.bios.date: 10/17/2011
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 1106
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: P8H67-M PRO
dmi.board.vendor: ASUSTeK Computer INC.
dmi.board.version: Rev 1.xx
dmi.chassis.asset.tag: Asset-1234567890
dmi.chassis.type: 3
dmi.chassis.vendor: Chassis Manufacture
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr1106:bd10/17/2011:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKComputerINC.:rnP8H67-MPRO:rvrRev1.xx:cvnChassisManufacture:ct3:cvrChassisVersion:
dmi.product.name: System Product Name
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Revision history for this message
Iain Buclaw (iainb) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Iain Buclaw (iainb) wrote :

Attaching the call trace for 4.4.0-59, this is from Jan 16th.

Revision history for this message
Iain Buclaw (iainb) wrote :

Attaching the call trace for 4.8.0-34, this is from Feb 19th.

Revision history for this message
Iain Buclaw (iainb) wrote :

I think this is currently in the middle of happening right now on one of the servers running 4.4.0-63.

SReclaimable is down from 15GB to 5GB, and Buffers is has been slowly rising over the last hour from 12GB to approaching 25GB.

---
# cat /proc/meminfo
MemTotal: 32856000 kB
MemFree: 1331808 kB
MemAvailable: 30994136 kB
Buffers: 24951992 kB
Cached: 388368 kB
SwapCached: 188 kB
Active: 22271936 kB
Inactive: 4383080 kB
Active(anon): 786576 kB
Inactive(anon): 534244 kB
Active(file): 21485360 kB
Inactive(file): 3848836 kB
Unevictable: 32 kB
Mlocked: 32 kB
SwapTotal: 33553332 kB
SwapFree: 33552460 kB
Dirty: 2048 kB
Writeback: 0 kB
AnonPages: 1314500 kB
Mapped: 42820 kB
Shmem: 6164 kB
Slab: 4765816 kB
SReclaimable: 4730460 kB
SUnreclaim: 35356 kB
KernelStack: 4512 kB
PageTables: 10164 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 49981332 kB
Committed_AS: 1352800 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 856064 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 188008 kB
DirectMap2M: 33275904 kB
---

Revision history for this message
Iain Buclaw (iainb) wrote :

SLAB is still dropping.

Revision history for this message
Iain Buclaw (iainb) wrote :

Attaching current slabtop

---
# cat /proc/meminfo
MemTotal: 32856000 kB
MemFree: 381632 kB
MemAvailable: 30990804 kB
Buffers: 28709688 kB
Cached: 221664 kB
SwapCached: 184 kB
Active: 24854232 kB
Inactive: 5393552 kB
Active(anon): 783780 kB
Inactive(anon): 538772 kB
Active(file): 24070452 kB
Inactive(file): 4854780 kB
Unevictable: 32 kB
Mlocked: 32 kB
SwapTotal: 33553332 kB
SwapFree: 33552460 kB
Dirty: 2472 kB
Writeback: 0 kB
AnonPages: 1316236 kB
Mapped: 42768 kB
Shmem: 6164 kB
Slab: 2120976 kB
SReclaimable: 2086268 kB
SUnreclaim: 34708 kB
KernelStack: 4480 kB
PageTables: 9944 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 49981332 kB
Committed_AS: 1326540 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 868352 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 188008 kB
DirectMap2M: 33275904 kB

Revision history for this message
Iain Buclaw (iainb) wrote :

And 4.8.0-36 is affected by bug also.

Revision history for this message
Iain Buclaw (iainb) wrote :

No OOM on servers running 4.4.0-63 just yet, but the memory usage on them is weird to say the least. The Buffer/SLAB ratio is completely different from 4.4.0-57.

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key
Iain Buclaw (iainb)
description: updated
description: updated
description: updated
Revision history for this message
Iain Buclaw (iainb) wrote :

And again on 4.8.0-36. Upgraded server to 4.8.0-39.

Revision history for this message
Iain Buclaw (iainb) wrote :

And again on 4.8.0-36. Upgraded server to 4.8.0-39. (How many times must I keep on doing this?)

Revision history for this message
Iain Buclaw (iainb) wrote :

Another 4 servers OOM'd on 4.8.0-36 over the weekend. Upgraded to 4.8.0-39

Revision history for this message
Iain Buclaw (iainb) wrote :
Revision history for this message
Iain Buclaw (iainb) wrote :
Revision history for this message
Iain Buclaw (iainb) wrote :
Revision history for this message
Rasmus Larsen (rla-2) wrote :

I can confirm hitting this on 4.8.41 as well.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.11 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11-rc3

Revision history for this message
Pete Cheslock (pete-cheslock) wrote :

I've tried setting vm.vfs_cache_pressure = 300 per the top post and still seeing regular (daily) oom's on 4.4.0-66-generic

Revision history for this message
Rasmus Larsen (rla-2) wrote :

I have only been able to reproduce this on our production workloads, and we've recently downgraded from HWE (4.8.0-X) to GA (4.4.0-66), where we currently aren't seeing any issues.

If I see the issue reappear, I'll try to verify it with a mainline kernel, if someone finds a way to reproduce this consistently, I'll happily help bisect this.

Revision history for this message
Iain Buclaw (iainb) wrote :

As original poster, if I didn't continue to post oom dumps, perhaps things started to peter out on 4.8.0-39 or later.

What was particular about the load that triggered this bug was heavy IO putting cache pressure on ext4 on a system where there's zero locality of reference in anything read from or written to disk (ssd backed storage).

In any case, by May these data storage servers that had been triggering this issue had been decommissioned and IO strategy had changed. Now writes are written to a raw block device before being flushed to filesystem periodically using O_DSYNC, taking ext4 disk cache out of the equation.

The HWE kernel is now 4.10, and judging by the edge packages soon to be 4.13, so maybe its been fixed in that time. However I'm no longer able to confirm or deny that, as there's no possible way for me to reproduce it anyway. As per Rasmus' comment, its something that only happened on production workloads.

Revision history for this message
Iain Buclaw (iainb) wrote :

Servers are now running on 4.10 kernels, so I guess no one cares about 4.8 anymore.

Iain Buclaw (iainb)
tags: added: kernel-fixed-upstream
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.