nr_writeback memory leak in kernel 4.15.0-137+
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
In Progress
|
Undecided
|
Unassigned | ||
Bionic |
In Progress
|
Medium
|
Tim Gardner |
Bug Description
SRU Justification
[Impact]
Ubuntu 18.04.5 4.15.0 LTS kernels at version 4.15.0-137 and above contain a memory leak due to the inclusion of patch from the upstream kernel, but not the fix for that patch which was released later.
Bad patch in bionic:linux 2c17fa778db8564
This issue manifests itself as an increasing amount of memory used by the writeback queue, which never returns to zero. This can been seen either as the value of `nr_writeback` in /proc/vmstat, or the value of `Writeback` in /proc/meminfo.
Ordinarily these values should be at or around zero, but on our servers we observe the `nr_writeback` value increasing to over 8 million, (32GB of memory), at which point it isn't long before the system IO slows to a crawl (tens of Kb/s). Our servers have 256GB of memory, and are performing many CI related activities - this issue appears to be related to concurrent writing to disk, and can be demonstrated with a simple testcase (see later).
On our heavily used systems this memory leak can result in an unstable server after 2-3 days, requiring a reboot to fix it.
After much investigation the issue appears to be because the patch "mm: memcontrol: fix excessive complexity in memory.stat reporting" was brought in to the 4.15.0-137 Ubuntu kernel (see https:/
The required patch is here: https:/
I have checked the release notes for Ubuntu versions -137 to -143, and none include this second patch that should fix the issue. (I checked https:/
We do not observe this on the 5.4.0 kernel (supported HWE kernel on 18.05.5), which includes this second patch. That kernel may also include other patches, so we do not know if any other fixes are also required, but the one we have linked above seems to definitely be needed, and seems to match our symptoms.
[Test Plan]
Testcase:
The following is enough to permanently increase the value of `nr_writeback` on our systems (by about 2000 during most executions):
```
date
grep nr_writeback /proc/vmstat
mkdir -p /docker/
seq -w 1 100000 | xargs -n1 -I% sh -c 'dd if=/dev/urandom of=/docker/
seq -w 1 100000 | xargs -n1 -I% sh -c 'dd if=/dev/urandom of=/docker/
seq -w 1 100000 | xargs -n1 -I% sh -c 'dd if=/dev/urandom of=/docker/
seq -w 1 100000 | xargs -n1 -I% sh -c 'dd if=/dev/urandom of=/docker/
seq -w 1 100000 | xargs -n1 -I% sh -c 'dd if=/dev/urandom of=/docker/
wait $(jobs -p)
grep nr_writeback /proc/vmstat
date
```
Subsequent iterations of the test raise it further, and on a system doing a lot of writing from a lot of different processes, it can rise quickly.
System details:
lsb_release -rd
Description: Ubuntu 18.04.5 LTS
Release: 18.04
Affected kernel: 4.15.0-137 onwards (current latest version tried was 4.15.0-142)
e.g.
apt-cache policy linux-image-
linux-image-
Installed: 4.15.0-141.145
Candidate: 4.15.0-141.145
Version table:
*** 4.15.0-141.145 500
500 http://
500 http://
100 /var/lib/
According to https:/
We likely have other servers used in other services that are not as heavily loaded that have not been as affected by this issue - and therefore and I may be able to get the equivalent diagnostics from there after confirming that they demonstrate the same issue with my testcase
Workaround:
After several weeks narrowing this down, our only option was to upgrade our servers to the 5.4 kernel, which is included as the HWE kernel in 18.04.5:
apt update && apt install --install-
We have now upgraded most of our heavily used systems where this is a major issue to the 5.4.0 kernel, which seemed to be our only option. We have a lot of other colleagues where this is not a possibility for them, and it seems to be affecting them to varying degrees depending on the nature of their workloads.
---
ProblemType: Bug
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Apr 27 04:12 seq
crw-rw---- 1 root audio 116, 33 Apr 27 04:12 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.9-0ubuntu7.23
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 18.04
HibernationDevice: RESUME=
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
MachineType: Supermicro PIO-848B-
Package: linux (not installed)
PciMultimedia:
ProcFB: 0 astdrmfb
ProcKernelCmdLine: BOOT_IMAGE=
ProcVersionSign
RelatedPackageV
linux-
linux-
linux-firmware 1.173.20
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
Tags: bionic
Uname: Linux 4.15.0-141-generic x86_64
UnreportableReason: This report is about a package that is not installed.
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:
WifiSyslog:
_MarkForUpload: False
dmi.bios.date: 10/18/2016
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 2.1
dmi.board.
dmi.board.name: X10QBi
dmi.board.vendor: Supermicro
dmi.board.version: 1.01A
dmi.chassis.
dmi.chassis.type: 1
dmi.chassis.vendor: Supermicro
dmi.chassis.
dmi.modalias: dmi:bvnAmerican
dmi.product.family: SMC X10
dmi.product.name: PIO-848B-
dmi.product.
dmi.sys.vendor: Supermicro
[Where problems could occur]
Memory leakage could continue. The new spinlocks could cause some performance degradation.
[Other Info]
These patches have been accepted to v4.14.y
description: | updated |
Changed in linux (Ubuntu Bionic): | |
status: | New → In Progress |
importance: | Undecided → Medium |
This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:
apport-collect 1926081
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.