Infinite loop in __blkdev_issue_discard() consumes the entire system memory when formatting a raid array
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Bionic |
Fix Released
|
Medium
|
Matthew Ruffell |
Bug Description
BugLink: https:/
[Impact]
There is a regression present in block in 4.15.0-56, which causes the machine to consume all system memory and cause the OOM reaper to come out, when attempting to format a newly created raid array on a series of NVMe cluster with an xfs filesystem.
The symptoms can be found with all system memory ending up in kmalloc-256 slab, with a nondescript call trace being printed.
The problem is caused by the below two commits:
commit 3c2f83d8bcbedeb
Author: Ming Lei <email address hidden>
Date: Fri Oct 12 15:53:10 2018 +0800
Subject: block: don't deal with discard limit in blkdev_
BugLink: https:/
Upstream-commit: 744889b7cbb56a6
Pastebin: https:/
commit b515257f186e532
Author: Ming Lei <email address hidden>
Date: Mon Oct 29 20:57:17 2018 +0800
Subject: block: make sure discard bio is aligned with logical block size
BugLink: https:/
Upstream-commit: 1adfc5e4136f596
Pastebin: https:/
Now, the fault was triggered in two stages. Firstly, in "block: don't deal with discard limit in blkdev_
int __blkdev_
{
...
while (nr_sects) {
unsigned int req_sects = nr_sects;
sector_t end_sect;
end_sect = sector + req_sects;
...
nr_sects -= req_sects;
sector = end_sect;
...
}
if req_sects is 0, then end_sect is always equal to sector, and the most important part, nr_sects is only decremented in one place, by req_sects, which if 0, would lead to the infinite loop condition.
Now, since req_sects is initially equal to nr_sects, the loop would never be entered in the first place if nr_sects is 0.
This is where the second commit, "block: make sure discard bio is aligned with logical block size" comes in.
This commit adds a line to the above loop, to allow req_sects to be set to a new value:
int __blkdev_
{
...
while (nr_sects) {
unsigned int req_sects = nr_sects;
sector_t end_sect;
req_sects = min(req_sects, bio_allowed_
end_sect = sector + req_sects;
...
nr_sects -= req_sects;
sector = end_sect;
...
}
We see that req_sects will now be the minimum of itself and bio_allowed_
static inline unsigned int bio_allowed_
{
return round_down(
}
queue_logical_
static inline unsigned short queue_logical_
{
int retval = 512;
if (q && q->limits.
retval = q->limits.
return retval;
}
if q->limits.
bio_allowed_
This causes nr_sects to never be decremented since req_sects is 0, and req_sects will never change since the min() that takes in itself will always favour the 0.
From there the infinite loop iterates and fills up the kmalloc-256 slab with newly created bio entries, until all memory is exhausted and the OOM reaper comes out and starts killing processes, which is ineffective since this is a kernel memory leak.
[Fix]
The fix comes in the form of:
commit a55264933f12c2f
Author: Mikulas Patocka <email address hidden>
Date: Tue Jul 3 13:34:22 2018 -0400
Subject: block: fix infinite loop if the device loses discard capability
BugLink: https:/
Upstream-commit: b88aef36b87c978
Pastebin: https:/
This adds a check right after the min(req_sects, bio_allowed_
...
req_sects = min(req_sects, bio_allowed_
if (!req_sects)
goto fail;
...
From there things work as normal. As "block: fix infinite loop if the device loses discard capability" points out, all of this is triggered due to a race where if underlying device is reloaded with a metadata table that doesn't support the discard operation, then q->limits.
Now, as mentioned before, the fix is a part of 4.15.0-59, or linux-gcp 4.15.0-1041, which is currently sitting in -proposed.
[Testcase]
We need a machine which can have a series of NVMe drives attached to it, and I used a default instance on Google Cloud Platform.
Select the 18.04 image, and add a dsik to the system. Select from the dropbox local scratch storage, then NVMe based storage, and then 8 disks.
Start the instance. Make sure it is running linux-gcp-1040 or linux-generic 4.15.0-56.
Create a raid array with:
# mdadm --create /dev/md0 --level=0 --raid-devices=8 /dev/nvme0n1 /dev/nvme0n2 /dev/nvme0n3 /dev/nvme0n4 /dev/nvme0n5 /dev/nvme0n6 /dev/nvme0n7 /dev/nvme0n8
# mkfs.xfs -f /dev/md0
The call to mkfs.xfs will consume all memory, and the ssh session will be disconnected since sshd gets killed by the oom reaper.
If you reconnect, and look at dmesg, all system memory will be in kmalloc-256.
This is fixed in linux-generic 4.15.0-59 and linux-gcp 4.15.0-1041, which is currently in -proposed.
If you install the above kernel and retest, things go along as expected.
[Regression Potential]
The fix was accepted in upstream -stable and is a direct fix to the two commits which caused the issues, meaning there is a low probability of any new regressions being added.
All changes are limited to discarding on block devices, and while a fairly core part of the kernel, the changes are small and are focused on the 0 condition.
In case of regression, simply revert all three offending commits.
Changed in linux (Ubuntu Bionic): | |
status: | New → Fix Committed |
importance: | Undecided → Medium |
assignee: | nobody → Matthew Ruffell (mruffell) |
description: | updated |
tags: | added: sts |
This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:
apport-collect 1842271
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.