Infinite loop in __blkdev_issue_discard() consumes the entire system memory when formatting a raid array

Bug #1842271 reported by Matthew Ruffell on 2019-09-02
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
Bionic
Medium
Matthew Ruffell

Bug Description

BugLink: https://bugs.launchpad.net/bugs/1842271

[Impact]

There is a regression present in block in 4.15.0-56, which causes the machine to consume all system memory and cause the OOM reaper to come out, when attempting to format a newly created raid array on a series of NVMe cluster with an xfs filesystem.

The symptoms can be found with all system memory ending up in kmalloc-256 slab, with a nondescript call trace being printed.

The problem is caused by the below two commits:

commit 3c2f83d8bcbedeb89efcaf55ae64a99dce9d7e34
Author: Ming Lei <email address hidden>
Date: Fri Oct 12 15:53:10 2018 +0800
Subject: block: don't deal with discard limit in blkdev_issue_discard()
BugLink: https://bugs.launchpad.net/bugs/1836802
Upstream-commit: 744889b7cbb56a64f957e65ade7cb65fe3f35714
Pastebin: https://paste.ubuntu.com/p/WKsmwKgCMc/

commit b515257f186e532e0668f7deabcb04b5d27505cf
Author: Ming Lei <email address hidden>
Date: Mon Oct 29 20:57:17 2018 +0800
Subject: block: make sure discard bio is aligned with logical block size
BugLink: https://bugs.launchpad.net/bugs/1836802
Upstream-commit: 1adfc5e4136f5967d591c399aff95b3b035f16b7
Pastebin: https://paste.ubuntu.com/p/TcnTck64PJ/

Now, the fault was triggered in two stages. Firstly, in "block: don't deal with discard limit in blkdev_issue_discard()" a while loop was changed such that there is a possibility of an infinite loop if __blkdev_issue_discard() is called with nr_sects > 0 and req_sects somehow becomes 0:

int __blkdev_issue_discard(..., sector_t nr_sects, ...)
{
...
while (nr_sects) {
    unsigned int req_sects = nr_sects;
    sector_t end_sect;

    end_sect = sector + req_sects;
...
    nr_sects -= req_sects;
    sector = end_sect;
...
}

if req_sects is 0, then end_sect is always equal to sector, and the most important part, nr_sects is only decremented in one place, by req_sects, which if 0, would lead to the infinite loop condition.

Now, since req_sects is initially equal to nr_sects, the loop would never be entered in the first place if nr_sects is 0.

This is where the second commit, "block: make sure discard bio is aligned with logical block size" comes in.

This commit adds a line to the above loop, to allow req_sects to be set to a new value:

int __blkdev_issue_discard(..., sector_t nr_sects, ...)
{
...
while (nr_sects) {
    unsigned int req_sects = nr_sects;
    sector_t end_sect;

    req_sects = min(req_sects, bio_allowed_max_sectors(q));

    end_sect = sector + req_sects;
...
    nr_sects -= req_sects;
    sector = end_sect;
...
}

We see that req_sects will now be the minimum of itself and bio_allowed_max_sectors(q), a new function introduced by the same commit.

static inline unsigned int bio_allowed_max_sectors(struct request_queue *q)
{
       return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9;
}

queue_logical_block_size(q) queries the hardware for the logical block size of the underlying device.

static inline unsigned short queue_logical_block_size(struct request_queue *q)
{
    int retval = 512;

    if (q && q->limits.logical_block_size)
        retval = q->limits.logical_block_size;

    return retval;
}

if q->limits.logical_block_size is 512 or smaller, then bit shifted right by 9 yields 0.

bio_allowed_max_sectors will return 0, and the min with req_sects == nr_sects will favour the new 0.

This causes nr_sects to never be decremented since req_sects is 0, and req_sects will never change since the min() that takes in itself will always favour the 0.

From there the infinite loop iterates and fills up the kmalloc-256 slab with newly created bio entries, until all memory is exhausted and the OOM reaper comes out and starts killing processes, which is ineffective since this is a kernel memory leak.

[Fix]

The fix comes in the form of:

commit a55264933f12c2fdc28a66841c4724021e8c1caf
Author: Mikulas Patocka <email address hidden>
Date: Tue Jul 3 13:34:22 2018 -0400
Subject: block: fix infinite loop if the device loses discard capability
BugLink: https://bugs.launchpad.net/bugs/1837257
Upstream-commit: b88aef36b87c9787a4db724923ec4f57dfd513f3
Pastebin: https://paste.ubuntu.com/p/bgC9vQnB8x/

This adds a check right after the min(req_sects, bio_allowed_max_sectors(q)); to test if req_sects has been set to 0, and if it has, to exit the loop and move into failure handling:

...
req_sects = min(req_sects, bio_allowed_max_sectors(q));
if (!req_sects)
    goto fail;
...

From there things work as normal. As "block: fix infinite loop if the device loses discard capability" points out, all of this is triggered due to a race where if underlying device is reloaded with a metadata table that doesn't support the discard operation, then q->limits.max_discard_sectors is set to 0 and has a knock on effect of setting q->limits.logical_block_size to strange values and leads to the infinite loop and out of memory condition.

Now, as mentioned before, the fix is a part of 4.15.0-59, or linux-gcp 4.15.0-1041, which is currently sitting in -proposed.

[Testcase]

We need a machine which can have a series of NVMe drives attached to it, and I used a default instance on Google Cloud Platform.

Select the 18.04 image, and add a dsik to the system. Select from the dropbox local scratch storage, then NVMe based storage, and then 8 disks.

Start the instance. Make sure it is running linux-gcp-1040 or linux-generic 4.15.0-56.

Create a raid array with:

# mdadm --create /dev/md0 --level=0 --raid-devices=8 /dev/nvme0n1 /dev/nvme0n2 /dev/nvme0n3 /dev/nvme0n4 /dev/nvme0n5 /dev/nvme0n6 /dev/nvme0n7 /dev/nvme0n8
# mkfs.xfs -f /dev/md0

The call to mkfs.xfs will consume all memory, and the ssh session will be disconnected since sshd gets killed by the oom reaper.

If you reconnect, and look at dmesg, all system memory will be in kmalloc-256.

This is fixed in linux-generic 4.15.0-59 and linux-gcp 4.15.0-1041, which is currently in -proposed.

If you install the above kernel and retest, things go along as expected.

[Regression Potential]

The fix was accepted in upstream -stable and is a direct fix to the two commits which caused the issues, meaning there is a low probability of any new regressions being added.

All changes are limited to discarding on block devices, and while a fairly core part of the kernel, the changes are small and are focused on the 0 condition.

In case of regression, simply revert all three offending commits.

Tags: sts Edit Tag help
Changed in linux (Ubuntu Bionic):
status: New → Fix Committed
importance: Undecided → Medium
assignee: nobody → Matthew Ruffell (mruffell)
description: updated
tags: added: sts

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1842271

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Matthew Ruffell (mruffell) wrote :

This was released in 4.15.0-59-generic and works as intended. Marking Fix Released. Note, this still needs to land in linux-gcp for Bionic due to some technical reasons.

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Po-Hsu Lin (cypressyew) wrote :

Patch could be found in Bionic-gcp master branch and Eoan, mark this one as Fix-released.

Changed in linux (Ubuntu):
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers