Activity log for bug #1842271

Date Who What changed Old value New value Message
2019-09-02 05:39:16 Matthew Ruffell bug added bug
2019-09-02 05:39:27 Matthew Ruffell nominated for series Ubuntu Bionic
2019-09-02 05:39:27 Matthew Ruffell bug task added linux (Ubuntu Bionic)
2019-09-02 05:39:39 Matthew Ruffell linux (Ubuntu Bionic): status New Fix Committed
2019-09-02 05:39:42 Matthew Ruffell linux (Ubuntu Bionic): importance Undecided Medium
2019-09-02 05:39:45 Matthew Ruffell linux (Ubuntu Bionic): assignee Matthew Ruffell (mruffell)
2019-09-02 05:40:49 Matthew Ruffell description BugLink: [Impact] There is a regression present in block in 4.15.0-56, which causes the machine to consume all system memory and cause the OOM reaper to come out, when attempting to format a newly created raid array on a series of NVMe cluster with an xfs filesystem. The symptoms can be found with all system memory ending up in kmalloc-256 slab, with a nondescript call trace being printed. The problem is caused by the below two commits: commit 3c2f83d8bcbedeb89efcaf55ae64a99dce9d7e34 Author: Ming Lei <ming.lei@redhat.com> Date: Fri Oct 12 15:53:10 2018 +0800 Subject: block: don't deal with discard limit in blkdev_issue_discard() BugLink: https://bugs.launchpad.net/bugs/1836802 Upstream-commit: 744889b7cbb56a64f957e65ade7cb65fe3f35714 Pastebin: https://paste.ubuntu.com/p/WKsmwKgCMc/ commit b515257f186e532e0668f7deabcb04b5d27505cf Author: Ming Lei <ming.lei@redhat.com> Date: Mon Oct 29 20:57:17 2018 +0800 Subject: block: make sure discard bio is aligned with logical block size BugLink: https://bugs.launchpad.net/bugs/1836802 Upstream-commit: 1adfc5e4136f5967d591c399aff95b3b035f16b7 Pastebin: https://paste.ubuntu.com/p/TcnTck64PJ/ Now, the fault was triggered in two stages. Firstly, in "block: don't deal with discard limit in blkdev_issue_discard()" a while loop was changed such that there is a possibility of an infinite loop if __blkdev_issue_discard() is called with nr_sects > 0 and req_sects somehow becomes 0: int __blkdev_issue_discard(..., sector_t nr_sects, ...) { ... while (nr_sects) { unsigned int req_sects = nr_sects; sector_t end_sect; end_sect = sector + req_sects; ... nr_sects -= req_sects; sector = end_sect; ... } if req_sects is 0, then end_sect is always equal to sector, and the most important part, nr_sects is only decremented in one place, by req_sects, which if 0, would lead to the infinite loop condition. Now, since req_sects is initially equal to nr_sects, the loop would never be entered in the first place if nr_sects is 0. This is where the second commit, "block: make sure discard bio is aligned with logical block size" comes in. This commit adds a line to the above loop, to allow req_sects to be set to a new value: int __blkdev_issue_discard(..., sector_t nr_sects, ...) { ... while (nr_sects) { unsigned int req_sects = nr_sects; sector_t end_sect; req_sects = min(req_sects, bio_allowed_max_sectors(q)); end_sect = sector + req_sects; ... nr_sects -= req_sects; sector = end_sect; ... } We see that req_sects will now be the minimum of itself and bio_allowed_max_sectors(q), a new function introduced by the same commit. static inline unsigned int bio_allowed_max_sectors(struct request_queue *q) { return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9; } queue_logical_block_size(q) queries the hardware for the logical block size of the underlying device. static inline unsigned short queue_logical_block_size(struct request_queue *q) { int retval = 512; if (q && q->limits.logical_block_size) retval = q->limits.logical_block_size; return retval; } if q->limits.logical_block_size is 512 or smaller, then bit shifted right by 9 yields 0. bio_allowed_max_sectors will return 0, and the min with req_sects == nr_sects will favour the new 0. This causes nr_sects to never be decremented since req_sects is 0, and req_sects will never change since the min() that takes in itself will always favour the 0. From there the infinite loop iterates and fills up the kmalloc-256 slab with newly created bio entries, until all memory is exhausted and the OOM reaper comes out and starts killing processes, which is ineffective since this is a kernel memory leak. [Fix] The fix comes in the form of: commit a55264933f12c2fdc28a66841c4724021e8c1caf Author: Mikulas Patocka <mpatocka@redhat.com> Date: Tue Jul 3 13:34:22 2018 -0400 Subject: block: fix infinite loop if the device loses discard capability BugLink: https://bugs.launchpad.net/bugs/1837257 Upstream-commit: b88aef36b87c9787a4db724923ec4f57dfd513f3 Pastebin: https://paste.ubuntu.com/p/bgC9vQnB8x/ This adds a check right after the min(req_sects, bio_allowed_max_sectors(q)); to test if req_sects has been set to 0, and if it has, to exit the loop and move into failure handling: ... req_sects = min(req_sects, bio_allowed_max_sectors(q)); if (!req_sects) goto fail; ... From there things work as normal. As "block: fix infinite loop if the device loses discard capability" points out, all of this is triggered due to a race where if underlying device is reloaded with a metadata table that doesn't support the discard operation, then q->limits.max_discard_sectors is set to 0 and has a knock on effect of setting q->limits.logical_block_size to strange values and leads to the infinite loop and out of memory condition. Now, as mentioned before, the fix is a part of 4.15.0-59, or linux-gcp 4.15.0-1041, which is currently sitting in -proposed. [Testcase] We need a machine which can have a series of NVMe drives attached to it, and I used a default instance on Google Cloud Platform. Select the 18.04 image, and add a dsik to the system. Select from the dropbox local scratch storage, then NVMe based storage, and then 8 disks. Start the instance. Make sure it is running linux-gcp-1040 or linux-generic 4.15.0-56. Create a raid array with: # mdadm --create /dev/md0 --level=0 --raid-devices=8 /dev/nvme0n1 /dev/nvme0n2 /dev/nvme0n3 /dev/nvme0n4 /dev/nvme0n5 /dev/nvme0n6 /dev/nvme0n7 /dev/nvme0n8 # mkfs.xfs -f /dev/md0 The call to mkfs.xfs will consume all memory, and the ssh session will be disconnected since sshd gets killed by the oom reaper. If you reconnect, and look at dmesg, all system memory will be in kmalloc-256. This is fixed in linux-generic 4.15.0-59 and linux-gcp 4.15.0-1041, which is currently in -proposed. If you install the above kernel and retest, things go along as expected. [Regression Potential] The fix was accepted in upstream -stable and is a direct fix to the two commits which caused the issues, meaning there is a low probability of any new regressions being added. All changes are limited to discarding on block devices, and while a fairly core part of the kernel, the changes are small and are focused on the 0 condition. In case of regression, simply revert all three offending commits. BugLink: https://bugs.launchpad.net/bugs/1842271 [Impact] There is a regression present in block in 4.15.0-56, which causes the machine to consume all system memory and cause the OOM reaper to come out, when attempting to format a newly created raid array on a series of NVMe cluster with an xfs filesystem. The symptoms can be found with all system memory ending up in kmalloc-256 slab, with a nondescript call trace being printed. The problem is caused by the below two commits: commit 3c2f83d8bcbedeb89efcaf55ae64a99dce9d7e34 Author: Ming Lei <ming.lei@redhat.com> Date: Fri Oct 12 15:53:10 2018 +0800 Subject: block: don't deal with discard limit in blkdev_issue_discard() BugLink: https://bugs.launchpad.net/bugs/1836802 Upstream-commit: 744889b7cbb56a64f957e65ade7cb65fe3f35714 Pastebin: https://paste.ubuntu.com/p/WKsmwKgCMc/ commit b515257f186e532e0668f7deabcb04b5d27505cf Author: Ming Lei <ming.lei@redhat.com> Date: Mon Oct 29 20:57:17 2018 +0800 Subject: block: make sure discard bio is aligned with logical block size BugLink: https://bugs.launchpad.net/bugs/1836802 Upstream-commit: 1adfc5e4136f5967d591c399aff95b3b035f16b7 Pastebin: https://paste.ubuntu.com/p/TcnTck64PJ/ Now, the fault was triggered in two stages. Firstly, in "block: don't deal with discard limit in blkdev_issue_discard()" a while loop was changed such that there is a possibility of an infinite loop if __blkdev_issue_discard() is called with nr_sects > 0 and req_sects somehow becomes 0: int __blkdev_issue_discard(..., sector_t nr_sects, ...) { ... while (nr_sects) {     unsigned int req_sects = nr_sects;     sector_t end_sect;     end_sect = sector + req_sects; ...     nr_sects -= req_sects;     sector = end_sect; ... } if req_sects is 0, then end_sect is always equal to sector, and the most important part, nr_sects is only decremented in one place, by req_sects, which if 0, would lead to the infinite loop condition. Now, since req_sects is initially equal to nr_sects, the loop would never be entered in the first place if nr_sects is 0. This is where the second commit, "block: make sure discard bio is aligned with logical block size" comes in. This commit adds a line to the above loop, to allow req_sects to be set to a new value: int __blkdev_issue_discard(..., sector_t nr_sects, ...) { ... while (nr_sects) {     unsigned int req_sects = nr_sects;     sector_t end_sect;     req_sects = min(req_sects, bio_allowed_max_sectors(q));     end_sect = sector + req_sects; ...     nr_sects -= req_sects;     sector = end_sect; ... } We see that req_sects will now be the minimum of itself and bio_allowed_max_sectors(q), a new function introduced by the same commit. static inline unsigned int bio_allowed_max_sectors(struct request_queue *q) {        return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9; } queue_logical_block_size(q) queries the hardware for the logical block size of the underlying device. static inline unsigned short queue_logical_block_size(struct request_queue *q) {     int retval = 512;     if (q && q->limits.logical_block_size)         retval = q->limits.logical_block_size;     return retval; } if q->limits.logical_block_size is 512 or smaller, then bit shifted right by 9 yields 0. bio_allowed_max_sectors will return 0, and the min with req_sects == nr_sects will favour the new 0. This causes nr_sects to never be decremented since req_sects is 0, and req_sects will never change since the min() that takes in itself will always favour the 0. From there the infinite loop iterates and fills up the kmalloc-256 slab with newly created bio entries, until all memory is exhausted and the OOM reaper comes out and starts killing processes, which is ineffective since this is a kernel memory leak. [Fix] The fix comes in the form of: commit a55264933f12c2fdc28a66841c4724021e8c1caf Author: Mikulas Patocka <mpatocka@redhat.com> Date: Tue Jul 3 13:34:22 2018 -0400 Subject: block: fix infinite loop if the device loses discard capability BugLink: https://bugs.launchpad.net/bugs/1837257 Upstream-commit: b88aef36b87c9787a4db724923ec4f57dfd513f3 Pastebin: https://paste.ubuntu.com/p/bgC9vQnB8x/ This adds a check right after the min(req_sects, bio_allowed_max_sectors(q)); to test if req_sects has been set to 0, and if it has, to exit the loop and move into failure handling: ... req_sects = min(req_sects, bio_allowed_max_sectors(q)); if (!req_sects)     goto fail; ... From there things work as normal. As "block: fix infinite loop if the device loses discard capability" points out, all of this is triggered due to a race where if underlying device is reloaded with a metadata table that doesn't support the discard operation, then q->limits.max_discard_sectors is set to 0 and has a knock on effect of setting q->limits.logical_block_size to strange values and leads to the infinite loop and out of memory condition. Now, as mentioned before, the fix is a part of 4.15.0-59, or linux-gcp 4.15.0-1041, which is currently sitting in -proposed. [Testcase] We need a machine which can have a series of NVMe drives attached to it, and I used a default instance on Google Cloud Platform. Select the 18.04 image, and add a dsik to the system. Select from the dropbox local scratch storage, then NVMe based storage, and then 8 disks. Start the instance. Make sure it is running linux-gcp-1040 or linux-generic 4.15.0-56. Create a raid array with: # mdadm --create /dev/md0 --level=0 --raid-devices=8 /dev/nvme0n1 /dev/nvme0n2 /dev/nvme0n3 /dev/nvme0n4 /dev/nvme0n5 /dev/nvme0n6 /dev/nvme0n7 /dev/nvme0n8 # mkfs.xfs -f /dev/md0 The call to mkfs.xfs will consume all memory, and the ssh session will be disconnected since sshd gets killed by the oom reaper. If you reconnect, and look at dmesg, all system memory will be in kmalloc-256. This is fixed in linux-generic 4.15.0-59 and linux-gcp 4.15.0-1041, which is currently in -proposed. If you install the above kernel and retest, things go along as expected. [Regression Potential] The fix was accepted in upstream -stable and is a direct fix to the two commits which caused the issues, meaning there is a low probability of any new regressions being added. All changes are limited to discarding on block devices, and while a fairly core part of the kernel, the changes are small and are focused on the 0 condition. In case of regression, simply revert all three offending commits.
2019-09-02 05:41:05 Matthew Ruffell tags sts
2019-09-02 06:30:05 Ubuntu Kernel Bot linux (Ubuntu): status New Incomplete
2019-09-08 22:56:27 Matthew Ruffell linux (Ubuntu Bionic): status Fix Committed Fix Released
2019-10-03 06:59:24 Po-Hsu Lin linux (Ubuntu): status Incomplete Fix Released