Writeback not flushing to disk in 4.15.0-137-generic and above

Bug #1922466 reported by Christoph Dwertmann
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux-signed-hwe (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Hi!

We've come across some interesting behaviour in kernel 4.15.0-137.141~16.04.1 and above.

After booting a fresh Ubuntu 16.04 instance on AWS, we replace the AWS kernel with "linux-image-4.15.0-140-generic" (4.15.0-140.144~16.04.1) and reboot. Then we generate some I/O by running fio for a while:

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --size=256m --numjobs=16 --iodepth=16 --runtime=3600 --time_based --end_fsync=1

It does't matter whether fio is run against the boot disk or an attached secondary disk. After stopping fio we notice that some pages are stuck in "writeback" and are apparently not flushing to disk:

# lsb_release -rd
Description: Ubuntu 16.04.7 LTS
Release: 16.04
# cat /proc/vmstat | grep "nr_writeback "
nr_writeback 80
# cat /proc/meminfo | grep Writeback:
Writeback: 320 kB

This doesn't clear, not even days later. Running more fio only increases the amount of writeback pages.

Downgrading the kernel to 4.15.0-136.140~16.04.1 resolves the issue, no writeback pages getting stuck. Going over the kernel changelog, I can see that between -136 and -137 the following patchset was applied, but I'm not sure whether it is related to the issue: https://www.spinics.net/lists/stable/msg435893.html

Kernels 4.15.0-137-generic and above took down our Ceph cluster, because it seems that when the amount of "writeback" reaches the buffer ceiling of "dirty_bytes", all subsequent writes to the disk are incredibly slow. This is from an idle production system (not on AWS) running 16.04 with kernel 4.15.0-139-generic:

# lsb_release -rd
Description: Ubuntu 16.04.4 LTS
Release: 16.04
# cat /proc/sys/vm/dirty_bytes
629145600
# cat /proc/sys/vm/dirty_background_bytes
314572800
# cat /proc/meminfo | grep Writeback:
Writeback: 572108 kB
# dd if=/dev/zero of=/test bs=1M count=10; rm /test
10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 126.529 s, 82.9 kB/s

Could there be a bug in kernel 4.15.0-137-generic and above?

Thank you!
Kind regards,

Christoph Dwertmann

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.15.0-140-generic 4.15.0-140.144~16.04.1
ProcVersionSignature: User Name 4.15.0-140.144~16.04.1-generic 4.15.18
Uname: Linux 4.15.0-140-generic x86_64
ApportVersion: 2.20.1-0ubuntu2.30
Architecture: amd64
Date: Sun Apr 4 03:39:25 2021
Ec2AMI: ami-041e1cc8f4c429789
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: ap-southeast-2c
Ec2InstanceType: c5ad.xlarge
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-signed-hwe
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Christoph Dwertmann (cdwertmann) wrote :
Revision history for this message
Christoph Dwertmann (cdwertmann) wrote :

I'd like to add that this bug also affects 18.04 LTS (Bionic) as it uses the same kernel.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-signed-hwe (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.