memory leak on AWS kernels when using docker

Bug #1925261 reported by Paul Friel
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-aws (Ubuntu)
New
Undecided
Unassigned
Focal
New
Undecided
Unassigned

Bug Description

Ever since the "ubuntu-bionic-18.04-amd64-server-20200729" EC2 Ubuntu AMI was released which has the "5.3.0-1032-aws" kernel we have been hitting a 100% repro memory leak that causes our app that is running under docker to be OOM killed.

The scenario is that we have an app running in a docker container and it occasionally catches a crash happening within itself and when that happens it creates another process which triggers a gdb dump of that parent app. Normally this works fine but under these specific kernels it causes the memory usage to grow and grow until it hits the maximum allowed memory for the container at which point the container is killed.

I have tested using several of the latest available Ubuntu AMIs including the latest "ubuntu-bionic-18.04-amd64-server-20210415" which has the "5.4.0-1045-aws" kernel and the bug still exists.

I also tested a bunch of the mainline kernels and found the fix was introduced for this memory leak in the v5.9-rc4 kernel (https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.9-rc4/CHANGES).

Do you all have any idea if or when that set of changes will be backported into a supported kernel for Ubuntu 18.04 or 20.04?

Release we are running:
root@<redacted>:~# lsb_release -rd
Description: Ubuntu 18.04.5 LTS
Release: 18.04

Docker / containerd.io versions:
- containerd.io: 1.4.4-1
- docker-ce: 5:20.10.5~3-0~ubuntu-bionic

Latest supported kernel I tried which still sees the memory leak:
root@hostname:~# apt-cache policy linux-aws
linux-aws:
  Installed: 5.4.0.1045.27
  Candidate: 5.4.0.1045.27
  Version table:
 *** 5.4.0.1045.27 500
        500 http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages
        100 /var/lib/dpkg/status
     4.15.0.1007.7 500
        500 http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 Packages

Thanks,
Paul

Paul Friel (pfriel)
description: updated
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Thank you Paul for the bug report.

Just as an additional information, the 5.3.0 series kernels are not supported anymore. If you want to continue using Ubuntu 18.04 with the additional security and bug fixes you will need to eventually upgrade to a 5.4.0-based AWS kernel.

Are you able to point at which commit exactly from v5.9-rc4 fixes the memory leak you are experiencing? Most of the commits mentioning "leak" in their titles has already been applied to the 5.4 kernels, so if you can still reproduce the issue with the latest 5.4 AWS kernel we are probably missing another fix.

Thank you.

Revision history for this message
Paul Friel (pfriel) wrote :

Kleber,

I am not sure exactly which commit fixes the issue we are experiencing. I will put some time into bisecting the commits introduced in v5.9-rc4 and building/testing kernels with that code to see if I can narrow down the exact commit that introduced the fix.

Thanks,
Paul

Revision history for this message
Paul Friel (pfriel) wrote :

Kleber,

I finally had some time to narrow down what commit fixes this issue for us today, below is the commit:

commit 7514c0362ffdd9af953ae94334018e7356b31313
Merge: 9322c47b21b9 428fc0aff4e5
Author: Linus Torvalds <email address hidden>
Date: Sat Sep 5 13:28:40 2020 -0700

    Merge branch 'akpm' (patches from Andrew)

    Merge misc fixes from Andrew Morton:
     "19 patches.

      Subsystems affected by this patch series: MAINTAINERS, ipc, fork,
      checkpatch, lib, and mm (memcg, slub, pagemap, madvise, migration,
      hugetlb)"

    * emailed patches from Andrew Morton <email address hidden>:
      include/linux/log2.h: add missing () around n in roundup_pow_of_two()
      mm/khugepaged.c: fix khugepaged's request size in collapse_file
      mm/hugetlb: fix a race between hugetlb sysctl handlers
      mm/hugetlb: try preferred node first when alloc gigantic page from cma
      mm/migrate: preserve soft dirty in remove_migration_pte()
      mm/migrate: remove unnecessary is_zone_device_page() check
      mm/rmap: fixup copying of soft dirty and uffd ptes
      mm/migrate: fixup setting UFFD_WP flag
      mm: madvise: fix vma user-after-free
      checkpatch: fix the usage of capture group ( ... )
      fork: adjust sysctl_max_threads definition to match prototype
      ipc: adjust proc_ipc_sem_dointvec definition to match prototype
      mm: track page table modifications in __apply_to_page_range()
      MAINTAINERS: IA64: mark Status as Odd Fixes only
      MAINTAINERS: add LLVM maintainers
      MAINTAINERS: update Cavium/Marvell entries
      mm: slub: fix conversion of freelist_corrupted()
      mm: memcg: fix memcg reclaim soft lockup
      memcg: fix use-after-free in uncharge_batch

I also verified that the latest available Ubuntu 18.04 kernel as of today (5.4.0-1054.57~18.04.1) still hits this memory leak issue for us.

Please let me know if you need any further information from us to hopefully get this fix pulled into the supported Ubuntu 18.04 AWS kernel.

Thanks,
Paul

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Hi Paul,

Thank you for coming back to us. I will ask my colleagues that work with the AWS kernels to look at this.

Revision history for this message
Paul Friel (pfriel) wrote :

Kleber,

Sounds good, thank you!

Revision history for this message
Tim Gardner (timg-tpi) wrote (last edit ):

Paul - It seems unlikely that this merge commit is the fix since it contains no code changes, nor do any of the commits in that block reference memory leaks. However, you could try a git bisect on just the commits in that block. Bisecting a non-linear history can sometimes be quite difficult.

Revision history for this message
Paul Friel (pfriel) wrote :
Download full text (4.9 KiB)

Tim / Kleber,

Thanks for your response on this and I apologize, you are correct that commit 7514c0362ffdd9af953ae94334018e7356b31313 was not the fix for our issue. I had previously just tested the last handful of commits in 5.9.0-rc4 and didn't realize that 7514c0362ffdd9af953ae94334018e7356b31313 was a merge commit and the other commits that didn't include the fix had parent commits prior to this fix being implemented. I tested more kernels this week and narrowed in on commit a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a which appears to fix our issue:

commit a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a
Author: Peter Xu <email address hidden>
Date: Fri Aug 21 19:49:57 2020 -0400

    mm/gup: Remove enfornced COW mechanism

    With the more strict (but greatly simplified) page reuse logic in
    do_wp_page(), we can safely go back to the world where cow is not
    enforced with writes.

    This essentially reverts commit 17839856fd58 ("gup: document and work
    around 'COW can break either way' issue"). There are some context
    differences due to some changes later on around it:

      2170ecfa7688 ("drm/i915: convert get_user_pages() --> pin_user_pages()", 2020-06-03)
      376a34efa4ee ("mm/gup: refactor and de-duplicate gup_fast() code", 2020-06-03)

    Some lines moved back and forth with those, but this revert patch should
    have striped out and covered all the enforced cow bits anyways.

    Suggested-by: Linus Torvalds <email address hidden>
    Signed-off-by: Peter Xu <email address hidden>
    Signed-off-by: Linus Torvalds <email address hidden>

To verify this is the proper fix for the issue we are running into I built a kernel using the parent of this fix (1a0cf26323c80e2f1c58fc04f15686de61bfab0c) and verified it exhibited the broken behavior (memory spikes within our container while running gdb which eventually causes docker to oom kill the container due to hitting the hard memory limit we have set). I then pulled the 1a0cf26323c80e2f1c58fc04f15686de61bfab0c code, cherry-picked a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a and built/ran that kernel and verified that I could no longer repro the issue.

I also built kernels with cfc905f158eaa099d6258031614d11869e7ef71c, 4facb95b7adaf77e2da73aafb9ba60996fe42a12 and 9e2369c06c8a181478039258a4598c1ddd2cadfa and verified those exhibited the broken behavior. I then pulled those same commits and cherry picked the a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a fix into them and verified that fixed the behavior we are seeing.

Here is a list of all the commits that I tested in the past few days to narrow in on this commit as the fix:

9322c47b21b9e05d7f9c037aa2c472e9f0dc7f3b - FIXED
b17164e258e3888d376a7434415013175d637377 - FIXED
1ef6ea0efe8e68d0299dad44c39dc6ad9e5d1f39 - FIXED
c183edff33fdcd639d222a8f473bf44602adc655 - BROKEN - parent commits were based off rc1 branch, prior to a308c71 fix
c70672d8d316ebd46ea447effadfe57ab7a30a50 - FIXED
09274aed9021642cb3e5e0eb0e657a13ee3eafed - FIXED
16bf121b2ddebd4421bd73098eaae1500dd40389 - FIXED
41bef91c8aa351255cd19e7e72608ee86f7f4bab - FIXED
f162626a038ec06da98ac38ce3d6bdbd715e9c5f - FIXED
d824e0809ce3c9e935f3aa37381cda7fd4...

Read more...

Revision history for this message
Tim Gardner (timg-tpi) wrote :

Hi Paul - I'm having trouble figuring out exactly which kernel you're using. The backport of commit a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a ("mm/gup: Remove enfornced COW mechanism") to Focal:linux-aws 5.4.0-1056.59 is not clean, and doesn't look correct. There are likely prerequisite patches required before a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a can be applied. I'm leery of patching the mm subsystem too much since it could have unintended side effects.

Have you considered using the Focal:linux-aws-5.11 kernel ? I understand that could be a big change.

Revision history for this message
Paul Friel (pfriel) wrote :

Tim,

We are running Ubuntu 18.04 with the 5.3.0-1030-aws kernel because that is the last Ubuntu provided AMI (ubuntu-bionic-18.04-amd64-server-20200716) that does not contain this kernel bug. I tried installing the latest supported Ubuntu 18.04 kernel again yesterday (5.4.0-1055-aws) and verified that the bug still exists.

Today I confirmed that installing the linux-image-5.11.0-1016-aws focal kernel from https://packages.ubuntu.com/focal-updates/linux-image-5.11.0-1016-aws on a Ubuntu 18.04 instance does fix this issue for us. My understanding from https://ubuntu.com/about/release-cycle#ubuntu-kernel-release-cycle is this configuration isn't really supported/suggested though, right? We would prefer to use the fully supported Ubuntu 18.04 AWS kernel if possible.

Thanks,
Paul

Revision history for this message
Tim Gardner (timg-tpi) wrote :

Paul,

The 5.3 kernel you are using is completely unsupported. Until a fix patch is identified for 5.4 your only other choice for supported kernels are to install linux-aws-edge. That package will keep you on a supported kernel through the release of 22.04 (the next LTS). Note, however, that linux-aws-edge will transition you through 3 kernel versions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.