Activity log for bug #1896578

Date Who What changed Old value New value Message
2020-09-22 07:02:43 Matthew Ruffell bug added bug
2020-09-22 07:02:55 Matthew Ruffell nominated for series Ubuntu Groovy
2020-09-22 07:02:55 Matthew Ruffell bug task added linux (Ubuntu Groovy)
2020-09-22 07:02:55 Matthew Ruffell nominated for series Ubuntu Focal
2020-09-22 07:02:55 Matthew Ruffell bug task added linux (Ubuntu Focal)
2020-09-22 07:02:55 Matthew Ruffell nominated for series Ubuntu Bionic
2020-09-22 07:02:55 Matthew Ruffell bug task added linux (Ubuntu Bionic)
2020-09-22 07:03:04 Matthew Ruffell linux (Ubuntu Bionic): status New In Progress
2020-09-22 07:03:07 Matthew Ruffell linux (Ubuntu Focal): status New In Progress
2020-09-22 07:03:10 Matthew Ruffell linux (Ubuntu Groovy): status New In Progress
2020-09-22 07:03:16 Matthew Ruffell linux (Ubuntu Bionic): importance Undecided Medium
2020-09-22 07:03:18 Matthew Ruffell linux (Ubuntu Focal): importance Undecided Medium
2020-09-22 07:03:20 Matthew Ruffell linux (Ubuntu Groovy): importance Undecided Medium
2020-09-22 07:03:24 Matthew Ruffell linux (Ubuntu Bionic): assignee Matthew Ruffell (mruffell)
2020-09-22 07:03:26 Matthew Ruffell linux (Ubuntu Focal): assignee Matthew Ruffell (mruffell)
2020-09-22 07:03:29 Matthew Ruffell linux (Ubuntu Groovy): assignee Matthew Ruffell (mruffell)
2020-09-22 07:03:40 Matthew Ruffell tags sts
2020-09-22 07:04:09 Matthew Ruffell description BugLink: https://bugs.launchpad.net/bugs/ [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. It is currently in the md-next tree [1], and I am expecting the commits to be merged during the 5.10 merge window. [1] https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=md-next commit 5b2374a6c221f28c74913d208bb5376a7ee3bf70 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=5b2374a6c221f28c74913d208bb5376a7ee3bf70 commit 8f694215ae4c7abf1e6c985803a1aad0db748d07 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=8f694215ae4c7abf1e6c985803a1aad0db748d07 commit 6fcfa8732a8cfea7828a9444c855691c481ee557 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=6fcfa8732a8cfea7828a9444c855691c481ee557 commit 6f4fed152a5e483af2227156ce7b6263aeeb5c84 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=6f4fed152a5e483af2227156ce7b6263aeeb5c84 commit 7197f1a616caf85508d81c7f5c9f065ffaebf027 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=7197f1a616caf85508d81c7f5c9f065ffaebf027 It follows a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <shli@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s I built a test kernel based on 5.9-rc6 with the above patches, and we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely. BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. It is currently in the md-next tree [1], and I am expecting the commits to be merged during the 5.10 merge window. [1] https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=md-next commit 5b2374a6c221f28c74913d208bb5376a7ee3bf70 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=5b2374a6c221f28c74913d208bb5376a7ee3bf70 commit 8f694215ae4c7abf1e6c985803a1aad0db748d07 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=8f694215ae4c7abf1e6c985803a1aad0db748d07 commit 6fcfa8732a8cfea7828a9444c855691c481ee557 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=6fcfa8732a8cfea7828a9444c855691c481ee557 commit 6f4fed152a5e483af2227156ce7b6263aeeb5c84 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=6f4fed152a5e483af2227156ce7b6263aeeb5c84 commit 7197f1a616caf85508d81c7f5c9f065ffaebf027 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=7197f1a616caf85508d81c7f5c9f065ffaebf027 It follows a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <shli@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s I built a test kernel based on 5.9-rc6 with the above patches, and we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely.
2020-09-22 07:15:17 Matthew Ruffell description BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. It is currently in the md-next tree [1], and I am expecting the commits to be merged during the 5.10 merge window. [1] https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=md-next commit 5b2374a6c221f28c74913d208bb5376a7ee3bf70 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=5b2374a6c221f28c74913d208bb5376a7ee3bf70 commit 8f694215ae4c7abf1e6c985803a1aad0db748d07 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=8f694215ae4c7abf1e6c985803a1aad0db748d07 commit 6fcfa8732a8cfea7828a9444c855691c481ee557 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=6fcfa8732a8cfea7828a9444c855691c481ee557 commit 6f4fed152a5e483af2227156ce7b6263aeeb5c84 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=6f4fed152a5e483af2227156ce7b6263aeeb5c84 commit 7197f1a616caf85508d81c7f5c9f065ffaebf027 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=7197f1a616caf85508d81c7f5c9f065ffaebf027 It follows a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <shli@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s I built a test kernel based on 5.9-rc6 with the above patches, and we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely. BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. It is currently in the md-next tree [1], and I am expecting the commits to be merged during the 5.10 merge window. [1] https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=md-next commit 5b2374a6c221f28c74913d208bb5376a7ee3bf70 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=5b2374a6c221f28c74913d208bb5376a7ee3bf70 commit 8f694215ae4c7abf1e6c985803a1aad0db748d07 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=8f694215ae4c7abf1e6c985803a1aad0db748d07 commit 6fcfa8732a8cfea7828a9444c855691c481ee557 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=6fcfa8732a8cfea7828a9444c855691c481ee557 commit 6f4fed152a5e483af2227156ce7b6263aeeb5c84 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=6f4fed152a5e483af2227156ce7b6263aeeb5c84 commit 7197f1a616caf85508d81c7f5c9f065ffaebf027 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=7197f1a616caf85508d81c7f5c9f065ffaebf027 It follows a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <shli@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s I built a test kernel based on 5.9-rc6 with the above patches, and we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely.
2020-10-21 23:45:29 Matthew Ruffell description BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. It is currently in the md-next tree [1], and I am expecting the commits to be merged during the 5.10 merge window. [1] https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=md-next commit 5b2374a6c221f28c74913d208bb5376a7ee3bf70 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=5b2374a6c221f28c74913d208bb5376a7ee3bf70 commit 8f694215ae4c7abf1e6c985803a1aad0db748d07 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=8f694215ae4c7abf1e6c985803a1aad0db748d07 commit 6fcfa8732a8cfea7828a9444c855691c481ee557 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=6fcfa8732a8cfea7828a9444c855691c481ee557 commit 6f4fed152a5e483af2227156ce7b6263aeeb5c84 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=6f4fed152a5e483af2227156ce7b6263aeeb5c84 commit 7197f1a616caf85508d81c7f5c9f065ffaebf027 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=7197f1a616caf85508d81c7f5c9f065ffaebf027 It follows a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <shli@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s I built a test kernel based on 5.9-rc6 with the above patches, and we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely. BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 commit f046f5d0d79cdb968f219ce249e497fd1accf484 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commit enables Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 16:40:12 2020 -0400 Subject: dm raid: remove unnecessary discard limits for raid10 Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 All the commits mentioned follow a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block discard performance issues in Raid0: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <shli@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s I built a test kernel based on 5.9-rc6 with the above patches, and we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely.
2020-10-25 02:34:14 Matthew Ruffell description BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 commit f046f5d0d79cdb968f219ce249e497fd1accf484 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commit enables Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 16:40:12 2020 -0400 Subject: dm raid: remove unnecessary discard limits for raid10 Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 All the commits mentioned follow a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block discard performance issues in Raid0: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <shli@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s I built a test kernel based on 5.9-rc6 with the above patches, and we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely. BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 commit f046f5d0d79cdb968f219ce249e497fd1accf484 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commits enable Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 13:14:52 2020 -0400 Subject: dm raid: fix discard limits for raid1 and raid10 Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 16:40:12 2020 -0400 Subject: dm raid: remove unnecessary discard limits for raid10 Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 All the commits mentioned follow a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block discard performance issues in Raid0: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <shli@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s I built a test kernel based on 5.9-rc6 with the above patches, and we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely.
2020-10-25 02:54:35 Matthew Ruffell description BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 commit f046f5d0d79cdb968f219ce249e497fd1accf484 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commits enable Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 13:14:52 2020 -0400 Subject: dm raid: fix discard limits for raid1 and raid10 Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 16:40:12 2020 -0400 Subject: dm raid: remove unnecessary discard limits for raid10 Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 All the commits mentioned follow a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block discard performance issues in Raid0: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <shli@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s I built a test kernel based on 5.9-rc6 with the above patches, and we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely. BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 commit f046f5d0d79cdb968f219ce249e497fd1accf484 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commits enable Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 13:14:52 2020 -0400 Subject: dm raid: fix discard limits for raid1 and raid10 Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 16:40:12 2020 -0400 Subject: dm raid: remove unnecessary discard limits for raid10 Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 All the commits mentioned follow a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block discard performance issues in Raid0: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <shli@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s I built a test kernel based on 5.9-rc6 with the above patches, and we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely.
2020-10-25 08:07:01 Matthew Ruffell description BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 commit f046f5d0d79cdb968f219ce249e497fd1accf484 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commits enable Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 13:14:52 2020 -0400 Subject: dm raid: fix discard limits for raid1 and raid10 Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 16:40:12 2020 -0400 Subject: dm raid: remove unnecessary discard limits for raid10 Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 All the commits mentioned follow a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block discard performance issues in Raid0: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <shli@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s I built a test kernel based on 5.9-rc6 with the above patches, and we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely. BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 commit f046f5d0d79cdb968f219ce249e497fd1accf484 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commits enable Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 13:14:52 2020 -0400 Subject: dm raid: fix discard limits for raid1 and raid10 Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 16:40:12 2020 -0400 Subject: dm raid: remove unnecessary discard limits for raid10 Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 All the commits mentioned follow a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block discard performance issues in Raid0: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <shli@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 The commits more or less cherry pick to the 5.8, 5.4 and 4.15 kernels, with the following minor fixups: 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it was recently changed in: commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead Author: Christoph Hellwig <hch@lst.de> Date: Wed Jul 1 10:59:44 2020 +0200 Subject: block: rename generic_make_request to submit_bio_noacct Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead 2) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of" '&' removed for one of their arguments for the 4.15 kernel, due to changes made in: commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8 Author: Kent Overstreet <kent.overstreet@gmail.com> Date: Sun May 20 18:25:52 2018 -0400 Subject: md: convert to bioset_init()/mempool_init() Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8 3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10" and "dm raid: remove unnecessary discard limits for raid10" due to not having the following commit, which was merged in 5.1-rc1: commit 61697a6abd24acba941359c6268a94f4afe4a53d Author: Mike Snitzer <snitzer@redhat.com> Date: Fri Jan 18 14:19:26 2019 -0500 Subject: dm: eliminate 'split_discard_bios' flag from DM target interface Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA: https://launchpad.net/~mruffell/+archive/ubuntu/sf291726-test If you install a test kernel, we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. Performance Matrix (AWS i3.8xlarge): Kernel | mkfs.xfs | fstrim --------------------------------- 4.15 | 7m23.449s | 7m20.678s 5.4 | 8m23.219s | 8m23.927s 5.8 | 2m54.990s | 8m22.010s 4.15-test | 5.4-test | 0m6.075s | 0m3.150s 5.8-test | 0m2.753s | 0m2.999s The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely.
2020-10-27 22:51:47 Matthew Ruffell description BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 commit f046f5d0d79cdb968f219ce249e497fd1accf484 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commits enable Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 13:14:52 2020 -0400 Subject: dm raid: fix discard limits for raid1 and raid10 Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 16:40:12 2020 -0400 Subject: dm raid: remove unnecessary discard limits for raid10 Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 All the commits mentioned follow a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block discard performance issues in Raid0: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <shli@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 The commits more or less cherry pick to the 5.8, 5.4 and 4.15 kernels, with the following minor fixups: 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it was recently changed in: commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead Author: Christoph Hellwig <hch@lst.de> Date: Wed Jul 1 10:59:44 2020 +0200 Subject: block: rename generic_make_request to submit_bio_noacct Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead 2) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of" '&' removed for one of their arguments for the 4.15 kernel, due to changes made in: commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8 Author: Kent Overstreet <kent.overstreet@gmail.com> Date: Sun May 20 18:25:52 2018 -0400 Subject: md: convert to bioset_init()/mempool_init() Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8 3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10" and "dm raid: remove unnecessary discard limits for raid10" due to not having the following commit, which was merged in 5.1-rc1: commit 61697a6abd24acba941359c6268a94f4afe4a53d Author: Mike Snitzer <snitzer@redhat.com> Date: Fri Jan 18 14:19:26 2019 -0500 Subject: dm: eliminate 'split_discard_bios' flag from DM target interface Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA: https://launchpad.net/~mruffell/+archive/ubuntu/sf291726-test If you install a test kernel, we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. Performance Matrix (AWS i3.8xlarge): Kernel | mkfs.xfs | fstrim --------------------------------- 4.15 | 7m23.449s | 7m20.678s 5.4 | 8m23.219s | 8m23.927s 5.8 | 2m54.990s | 8m22.010s 4.15-test | 5.4-test | 0m6.075s | 0m3.150s 5.8-test | 0m2.753s | 0m2.999s The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely. BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 commit f046f5d0d79cdb968f219ce249e497fd1accf484 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commits enable Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 13:14:52 2020 -0400 Subject: dm raid: fix discard limits for raid1 and raid10 Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 16:40:12 2020 -0400 Subject: dm raid: remove unnecessary discard limits for raid10 Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 All the commits mentioned follow a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block discard performance issues in Raid0: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <shli@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 The commits more or less cherry pick to the 5.8, 5.4 and 4.15 kernels, with the following minor fixups: 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it was recently changed in: commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead Author: Christoph Hellwig <hch@lst.de> Date: Wed Jul 1 10:59:44 2020 +0200 Subject: block: rename generic_make_request to submit_bio_noacct Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead 2) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of" '&' removed for one of their arguments for the 4.15 kernel, due to changes made in: commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8 Author: Kent Overstreet <kent.overstreet@gmail.com> Date: Sun May 20 18:25:52 2018 -0400 Subject: md: convert to bioset_init()/mempool_init() Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8 3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10" and "dm raid: remove unnecessary discard limits for raid10" due to not having the following commit, which was merged in 5.1-rc1: commit 61697a6abd24acba941359c6268a94f4afe4a53d Author: Mike Snitzer <snitzer@redhat.com> Date: Fri Jan 18 14:19:26 2019 -0500 Subject: dm: eliminate 'split_discard_bios' flag from DM target interface Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA: https://launchpad.net/~mruffell/+archive/ubuntu/sf291726-test If you install a test kernel, we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. Performance Matrix (AWS i3.8xlarge): Kernel | mkfs.xfs | fstrim --------------------------------- 4.15 | 7m23.449s | 7m20.678s 5.4 | 8m23.219s | 8m23.927s 5.8 | 2m54.990s | 8m22.010s 4.15-test | 0m4.286s | 0m1.657s 5.4-test | 0m6.075s | 0m3.150s 5.8-test | 0m2.753s | 0m2.999s The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely.
2020-10-28 23:07:00 Matthew Ruffell description BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 commit f046f5d0d79cdb968f219ce249e497fd1accf484 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commits enable Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 13:14:52 2020 -0400 Subject: dm raid: fix discard limits for raid1 and raid10 Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 16:40:12 2020 -0400 Subject: dm raid: remove unnecessary discard limits for raid10 Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 All the commits mentioned follow a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block discard performance issues in Raid0: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <shli@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 The commits more or less cherry pick to the 5.8, 5.4 and 4.15 kernels, with the following minor fixups: 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it was recently changed in: commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead Author: Christoph Hellwig <hch@lst.de> Date: Wed Jul 1 10:59:44 2020 +0200 Subject: block: rename generic_make_request to submit_bio_noacct Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead 2) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of" '&' removed for one of their arguments for the 4.15 kernel, due to changes made in: commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8 Author: Kent Overstreet <kent.overstreet@gmail.com> Date: Sun May 20 18:25:52 2018 -0400 Subject: md: convert to bioset_init()/mempool_init() Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8 3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10" and "dm raid: remove unnecessary discard limits for raid10" due to not having the following commit, which was merged in 5.1-rc1: commit 61697a6abd24acba941359c6268a94f4afe4a53d Author: Mike Snitzer <snitzer@redhat.com> Date: Fri Jan 18 14:19:26 2019 -0500 Subject: dm: eliminate 'split_discard_bios' flag from DM target interface Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA: https://launchpad.net/~mruffell/+archive/ubuntu/sf291726-test If you install a test kernel, we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. Performance Matrix (AWS i3.8xlarge): Kernel | mkfs.xfs | fstrim --------------------------------- 4.15 | 7m23.449s | 7m20.678s 5.4 | 8m23.219s | 8m23.927s 5.8 | 2m54.990s | 8m22.010s 4.15-test | 0m4.286s | 0m1.657s 5.4-test | 0m6.075s | 0m3.150s 5.8-test | 0m2.753s | 0m2.999s The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely. BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 commit f046f5d0d79cdb968f219ce249e497fd1accf484 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commits enable Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 13:14:52 2020 -0400 Subject: dm raid: fix discard limits for raid1 and raid10 Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 16:40:12 2020 -0400 Subject: dm raid: remove unnecessary discard limits for raid10 Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 All the commits mentioned follow a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block discard performance issues in Raid0: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <shli@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 The commits more or less cherry pick to the 5.8, 5.4 and 4.15 kernels, with the following minor fixups: 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it was recently changed in: commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead Author: Christoph Hellwig <hch@lst.de> Date: Wed Jul 1 10:59:44 2020 +0200 Subject: block: rename generic_make_request to submit_bio_noacct Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead 2) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of" '&' removed for one of their arguments for the 4.15 kernel, due to changes made in: commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8 Author: Kent Overstreet <kent.overstreet@gmail.com> Date: Sun May 20 18:25:52 2018 -0400 Subject: md: convert to bioset_init()/mempool_init() Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8 3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10" and "dm raid: remove unnecessary discard limits for raid10" due to not having the following commit, which was merged in 5.1-rc1: commit 61697a6abd24acba941359c6268a94f4afe4a53d Author: Mike Snitzer <snitzer@redhat.com> Date: Fri Jan 18 14:19:26 2019 -0500 Subject: dm: eliminate 'split_discard_bios' flag from DM target interface Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d 4) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to bio_clone_blkcg_association() due to it changing in: commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1 Author: Dennis Zhou <dennis@kernel.org> Date: Wed Dec 5 12:10:35 2018 -0500 Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA: https://launchpad.net/~mruffell/+archive/ubuntu/sf291726-test If you install a test kernel, we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. Performance Matrix (AWS i3.8xlarge): Kernel | mkfs.xfs | fstrim --------------------------------- 4.15 | 7m23.449s | 7m20.678s 5.4 | 8m23.219s | 8m23.927s 5.8 | 2m54.990s | 8m22.010s 4.15-test | 0m4.286s | 0m1.657s 5.4-test | 0m6.075s | 0m3.150s 5.8-test | 0m2.753s | 0m2.999s The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely.
2020-11-06 05:26:53 Ian May linux (Ubuntu Groovy): status In Progress Fix Committed
2020-11-06 06:05:09 Ian May linux (Ubuntu Focal): status In Progress Fix Committed
2020-11-06 06:20:53 Ian May linux (Ubuntu Bionic): status In Progress Fix Committed
2020-11-17 10:03:30 Ubuntu Kernel Bot tags sts sts verification-needed-bionic
2020-11-17 10:05:17 Ubuntu Kernel Bot tags sts verification-needed-bionic sts verification-needed-bionic verification-needed-focal
2020-11-17 10:07:37 Ubuntu Kernel Bot tags sts verification-needed-bionic verification-needed-focal sts verification-needed-bionic verification-needed-focal verification-needed-groovy
2020-11-18 03:57:12 Matthew Ruffell tags sts verification-needed-bionic verification-needed-focal verification-needed-groovy sts verification-done-groovy verification-needed-bionic verification-needed-focal
2020-11-18 04:04:40 Matthew Ruffell tags sts verification-done-groovy verification-needed-bionic verification-needed-focal sts verification-done-focal verification-done-groovy verification-needed-bionic
2020-11-18 04:12:39 Matthew Ruffell tags sts verification-done-focal verification-done-groovy verification-needed-bionic sts verification-done-bionic verification-done-focal verification-done-groovy
2020-11-30 15:46:09 Launchpad Janitor linux (Ubuntu Focal): status Fix Committed Fix Released
2020-11-30 15:46:09 Launchpad Janitor cve linked 2020-14351
2020-11-30 15:46:09 Launchpad Janitor cve linked 2020-4788
2020-12-01 17:43:22 Launchpad Janitor linux (Ubuntu Groovy): status Fix Committed Fix Released
2020-12-02 05:53:32 Launchpad Janitor linux (Ubuntu Bionic): status Fix Committed Fix Released
2020-12-09 02:17:01 Nivedita Singhvi bug added subscriber Nivedita Singhvi
2020-12-09 13:00:36 Eric Desrochers bug added subscriber Eric Desrochers
2021-01-11 14:58:14 Launchpad Janitor linux (Ubuntu): status In Progress Fix Released
2021-01-11 14:58:14 Launchpad Janitor cve linked 2021-1052
2021-01-11 14:58:14 Launchpad Janitor cve linked 2021-1053
2021-01-11 20:26:34 Matthew Ruffell linux (Ubuntu): status Fix Released In Progress
2021-01-11 20:26:37 Matthew Ruffell linux (Ubuntu Bionic): status Fix Released In Progress
2021-01-11 20:26:39 Matthew Ruffell linux (Ubuntu Focal): status Fix Released In Progress
2021-01-11 20:26:46 Matthew Ruffell linux (Ubuntu Groovy): status Fix Released In Progress
2021-02-11 02:24:05 Matthew Ruffell description BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 commit f046f5d0d79cdb968f219ce249e497fd1accf484 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commits enable Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 13:14:52 2020 -0400 Subject: dm raid: fix discard limits for raid1 and raid10 Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 16:40:12 2020 -0400 Subject: dm raid: remove unnecessary discard limits for raid10 Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 All the commits mentioned follow a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block discard performance issues in Raid0: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <shli@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 The commits more or less cherry pick to the 5.8, 5.4 and 4.15 kernels, with the following minor fixups: 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it was recently changed in: commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead Author: Christoph Hellwig <hch@lst.de> Date: Wed Jul 1 10:59:44 2020 +0200 Subject: block: rename generic_make_request to submit_bio_noacct Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead 2) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of" '&' removed for one of their arguments for the 4.15 kernel, due to changes made in: commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8 Author: Kent Overstreet <kent.overstreet@gmail.com> Date: Sun May 20 18:25:52 2018 -0400 Subject: md: convert to bioset_init()/mempool_init() Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8 3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10" and "dm raid: remove unnecessary discard limits for raid10" due to not having the following commit, which was merged in 5.1-rc1: commit 61697a6abd24acba941359c6268a94f4afe4a53d Author: Mike Snitzer <snitzer@redhat.com> Date: Fri Jan 18 14:19:26 2019 -0500 Subject: dm: eliminate 'split_discard_bios' flag from DM target interface Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d 4) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to bio_clone_blkcg_association() due to it changing in: commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1 Author: Dennis Zhou <dennis@kernel.org> Date: Wed Dec 5 12:10:35 2018 -0500 Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA: https://launchpad.net/~mruffell/+archive/ubuntu/sf291726-test If you install a test kernel, we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. Performance Matrix (AWS i3.8xlarge): Kernel | mkfs.xfs | fstrim --------------------------------- 4.15 | 7m23.449s | 7m20.678s 5.4 | 8m23.219s | 8m23.927s 5.8 | 2m54.990s | 8m22.010s 4.15-test | 0m4.286s | 0m1.657s 5.4-test | 0m6.075s | 0m3.150s 5.8-test | 0m2.753s | 0m2.999s The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely. BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 commit f046f5d0d79cdb968f219ce249e497fd1accf484 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commits enable Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 13:14:52 2020 -0400 Subject: dm raid: fix discard limits for raid1 and raid10 Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 16:40:12 2020 -0400 Subject: dm raid: remove unnecessary discard limits for raid10 Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 All the commits mentioned follow a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block discard performance issues in Raid0: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <shli@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 The commits more or less cherry pick to the 5.8, 5.4 and 4.15 kernels, with the following minor fixups: 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it was recently changed in: commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead Author: Christoph Hellwig <hch@lst.de> Date: Wed Jul 1 10:59:44 2020 +0200 Subject: block: rename generic_make_request to submit_bio_noacct Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead 2) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of" '&' removed for one of their arguments for the 4.15 kernel, due to changes made in: commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8 Author: Kent Overstreet <kent.overstreet@gmail.com> Date: Sun May 20 18:25:52 2018 -0400 Subject: md: convert to bioset_init()/mempool_init() Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8 3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10" and "dm raid: remove unnecessary discard limits for raid10" due to not having the following commit, which was merged in 5.1-rc1: commit 61697a6abd24acba941359c6268a94f4afe4a53d Author: Mike Snitzer <snitzer@redhat.com> Date: Fri Jan 18 14:19:26 2019 -0500 Subject: dm: eliminate 'split_discard_bios' flag from DM target interface Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d 4) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to bio_clone_blkcg_association() due to it changing in: commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1 Author: Dennis Zhou <dennis@kernel.org> Date: Wed Dec 5 12:10:35 2018 -0500 Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA: https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test If you install a test kernel, we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. Performance Matrix (AWS i3.8xlarge): Kernel | mkfs.xfs | fstrim --------------------------------- 4.15 | 7m23.449s | 7m20.678s 5.4 | 8m23.219s | 8m23.927s 5.8 | 2m54.990s | 8m22.010s 4.15-test | 0m4.286s | 0m1.657s 5.4-test | 0m6.075s | 0m3.150s 5.8-test | 0m2.753s | 0m2.999s The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely.
2021-04-15 13:50:54 Evan Hoffman bug added subscriber Evan Hoffman
2021-05-03 05:08:32 Matthew Ruffell description BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 commit f046f5d0d79cdb968f219ce249e497fd1accf484 Author: Xiao Ni <xni@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 Author: Xiao Ni <xni@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commits enable Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 13:14:52 2020 -0400 Subject: dm raid: fix discard limits for raid1 and raid10 Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 Author: Mike Snitzer <snitzer@redhat.com> Date: Thu Sep 24 16:40:12 2020 -0400 Subject: dm raid: remove unnecessary discard limits for raid10 Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 All the commits mentioned follow a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block discard performance issues in Raid0: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <shli@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 The commits more or less cherry pick to the 5.8, 5.4 and 4.15 kernels, with the following minor fixups: 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it was recently changed in: commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead Author: Christoph Hellwig <hch@lst.de> Date: Wed Jul 1 10:59:44 2020 +0200 Subject: block: rename generic_make_request to submit_bio_noacct Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead 2) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of" '&' removed for one of their arguments for the 4.15 kernel, due to changes made in: commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8 Author: Kent Overstreet <kent.overstreet@gmail.com> Date: Sun May 20 18:25:52 2018 -0400 Subject: md: convert to bioset_init()/mempool_init() Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8 3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10" and "dm raid: remove unnecessary discard limits for raid10" due to not having the following commit, which was merged in 5.1-rc1: commit 61697a6abd24acba941359c6268a94f4afe4a53d Author: Mike Snitzer <snitzer@redhat.com> Date: Fri Jan 18 14:19:26 2019 -0500 Subject: dm: eliminate 'split_discard_bios' flag from DM target interface Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d 4) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to bio_clone_blkcg_association() due to it changing in: commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1 Author: Dennis Zhou <dennis@kernel.org> Date: Wed Dec 5 12:10:35 2018 -0500 Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA: https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test If you install a test kernel, we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. Performance Matrix (AWS i3.8xlarge): Kernel | mkfs.xfs | fstrim --------------------------------- 4.15 | 7m23.449s | 7m20.678s 5.4 | 8m23.219s | 8m23.927s 5.8 | 2m54.990s | 8m22.010s 4.15-test | 0m4.286s | 0m1.657s 5.4-test | 0m6.075s | 0m3.150s 5.8-test | 0m2.753s | 0m2.999s The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely. BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit cf78408f937a67f59f5e90ee8e6cadeed7c128a8 Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:43 2021 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/cf78408f937a67f59f5e90ee8e6cadeed7c128a8 commit c2968285925adb97b9aa4ede94c1f1ab61ce0925 Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:44 2021 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/c2968285925adb97b9aa4ede94c1f1ab61ce0925 commit f2e7e269a7525317752d472bb48a549780e87d22 Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:45 2021 +0800 Subject: md/raid10: pull the code that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f2e7e269a7525317752d472bb48a549780e87d22 commit d30588b2731fb01e1616cf16c3fe79a1443e29aa Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:46 2021 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/d30588b2731fb01e1616cf16c3fe79a1443e29aa commit 254c271da0712ea8914f187588e0f81f7678ee2f Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:47 2021 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/254c271da0712ea8914f187588e0f81f7678ee2f There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commits enable Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit ca4a4e9a55beeb138bb06e3867f5e486da896d44 Author: Mike Snitzer <snitzer@redhat.com> Date: Fri Apr 30 14:38:37 2021 -0400 Subject: dm raid: remove unnecessary discard limits for raid0 and raid10 Link: https://github.com/torvalds/linux/commit/ca4a4e9a55beeb138bb06e3867f5e486da896d44 The commits more or less cherry pick to the 5.11, 5.8, 5.4 and 4.15 kernels, with the following minor backports: 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it was recently changed in: commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead Author: Christoph Hellwig <hch@lst.de> Date: Wed Jul 1 10:59:44 2020 +0200 Subject: block: rename generic_make_request to submit_bio_noacct Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead 2) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of" '&' removed for one of their arguments for the 4.15 kernel, due to changes made in: commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8 Author: Kent Overstreet <kent.overstreet@gmail.com> Date: Sun May 20 18:25:52 2018 -0400 Subject: md: convert to bioset_init()/mempool_init() Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8 3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10" and "dm raid: remove unnecessary discard limits for raid10" due to not having the following commit, which was merged in 5.1-rc1: commit 61697a6abd24acba941359c6268a94f4afe4a53d Author: Mike Snitzer <snitzer@redhat.com> Date: Fri Jan 18 14:19:26 2019 -0500 Subject: dm: eliminate 'split_discard_bios' flag from DM target interface Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d 4) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to bio_clone_blkcg_association() due to it changing in: commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1 Author: Dennis Zhou <dennis@kernel.org> Date: Wed Dec 5 12:10:35 2018 -0500 Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA: https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test If you install a test kernel, we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. Performance Matrix (AWS i3.8xlarge): Kernel | mkfs.xfs | fstrim --------------------------------- 4.15 | 7m23.449s | 7m20.678s 5.4 | 8m23.219s | 8m23.927s 5.8 | 2m54.990s | 8m22.010s 4.15-test | 0m4.286s | 0m1.657s 5.4-test | 0m6.075s | 0m3.150s 5.8-test | 0m2.753s | 0m2.999s The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Where problems can occur] A problem has occurred once before, with the previous revision of this patchset. This has been documented in bug 1907262, and caused a worst case scenario of data loss for some users, in this particular case, on the second and onward disks. This was due to two two faults: the first, incorrectly calculating the start offset for block discard for the second and extra disks. The second bug was an incorrect stripe size for far layouts. The kernel team was forced to revert the patches in an emergency and the faulty kernel was removed from the archive, and community users urged to avoid the faulty kernel. These bugs and a few other minor issues have now been corrected, and we have been testing the new patches since mid February. The patches have been tested against the testcase in bug 1907262 and do not cause the disks to become corrupted. The regression potential is still the same for this patchset though. If a regression were to occur, it could lead to data loss on Raid10 arrays backed by NVMe or SSD disks that support block discard. If a regression happens, users need to disable the fstrim systemd service as soon as possible, plan an emergency maintainance window, and downgrade the kernel to a previous release, or upgrade to a corrected kernel.
2021-05-03 05:08:47 Matthew Ruffell tags sts verification-done-bionic verification-done-focal verification-done-groovy sts
2021-05-03 05:32:41 Matthew Ruffell nominated for series Ubuntu Hirsute
2021-05-03 05:32:41 Matthew Ruffell bug task added linux (Ubuntu Hirsute)
2021-05-03 05:32:49 Matthew Ruffell linux (Ubuntu Hirsute): status New In Progress
2021-05-03 05:32:51 Matthew Ruffell linux (Ubuntu Hirsute): importance Undecided Medium
2021-05-03 05:32:53 Matthew Ruffell linux (Ubuntu Hirsute): assignee Matthew Ruffell (mruffell)
2021-05-03 23:16:08 Matthew Ruffell description BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit cf78408f937a67f59f5e90ee8e6cadeed7c128a8 Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:43 2021 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/cf78408f937a67f59f5e90ee8e6cadeed7c128a8 commit c2968285925adb97b9aa4ede94c1f1ab61ce0925 Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:44 2021 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/c2968285925adb97b9aa4ede94c1f1ab61ce0925 commit f2e7e269a7525317752d472bb48a549780e87d22 Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:45 2021 +0800 Subject: md/raid10: pull the code that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f2e7e269a7525317752d472bb48a549780e87d22 commit d30588b2731fb01e1616cf16c3fe79a1443e29aa Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:46 2021 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/d30588b2731fb01e1616cf16c3fe79a1443e29aa commit 254c271da0712ea8914f187588e0f81f7678ee2f Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:47 2021 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/254c271da0712ea8914f187588e0f81f7678ee2f There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commits enable Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit ca4a4e9a55beeb138bb06e3867f5e486da896d44 Author: Mike Snitzer <snitzer@redhat.com> Date: Fri Apr 30 14:38:37 2021 -0400 Subject: dm raid: remove unnecessary discard limits for raid0 and raid10 Link: https://github.com/torvalds/linux/commit/ca4a4e9a55beeb138bb06e3867f5e486da896d44 The commits more or less cherry pick to the 5.11, 5.8, 5.4 and 4.15 kernels, with the following minor backports: 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it was recently changed in: commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead Author: Christoph Hellwig <hch@lst.de> Date: Wed Jul 1 10:59:44 2020 +0200 Subject: block: rename generic_make_request to submit_bio_noacct Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead 2) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of" '&' removed for one of their arguments for the 4.15 kernel, due to changes made in: commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8 Author: Kent Overstreet <kent.overstreet@gmail.com> Date: Sun May 20 18:25:52 2018 -0400 Subject: md: convert to bioset_init()/mempool_init() Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8 3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10" and "dm raid: remove unnecessary discard limits for raid10" due to not having the following commit, which was merged in 5.1-rc1: commit 61697a6abd24acba941359c6268a94f4afe4a53d Author: Mike Snitzer <snitzer@redhat.com> Date: Fri Jan 18 14:19:26 2019 -0500 Subject: dm: eliminate 'split_discard_bios' flag from DM target interface Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d 4) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to bio_clone_blkcg_association() due to it changing in: commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1 Author: Dennis Zhou <dennis@kernel.org> Date: Wed Dec 5 12:10:35 2018 -0500 Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA: https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test If you install a test kernel, we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. Performance Matrix (AWS i3.8xlarge): Kernel | mkfs.xfs | fstrim --------------------------------- 4.15 | 7m23.449s | 7m20.678s 5.4 | 8m23.219s | 8m23.927s 5.8 | 2m54.990s | 8m22.010s 4.15-test | 0m4.286s | 0m1.657s 5.4-test | 0m6.075s | 0m3.150s 5.8-test | 0m2.753s | 0m2.999s The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Where problems can occur] A problem has occurred once before, with the previous revision of this patchset. This has been documented in bug 1907262, and caused a worst case scenario of data loss for some users, in this particular case, on the second and onward disks. This was due to two two faults: the first, incorrectly calculating the start offset for block discard for the second and extra disks. The second bug was an incorrect stripe size for far layouts. The kernel team was forced to revert the patches in an emergency and the faulty kernel was removed from the archive, and community users urged to avoid the faulty kernel. These bugs and a few other minor issues have now been corrected, and we have been testing the new patches since mid February. The patches have been tested against the testcase in bug 1907262 and do not cause the disks to become corrupted. The regression potential is still the same for this patchset though. If a regression were to occur, it could lead to data loss on Raid10 arrays backed by NVMe or SSD disks that support block discard. If a regression happens, users need to disable the fstrim systemd service as soon as possible, plan an emergency maintainance window, and downgrade the kernel to a previous release, or upgrade to a corrected kernel. BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit cf78408f937a67f59f5e90ee8e6cadeed7c128a8 Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:43 2021 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/cf78408f937a67f59f5e90ee8e6cadeed7c128a8 commit c2968285925adb97b9aa4ede94c1f1ab61ce0925 Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:44 2021 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/c2968285925adb97b9aa4ede94c1f1ab61ce0925 commit f2e7e269a7525317752d472bb48a549780e87d22 Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:45 2021 +0800 Subject: md/raid10: pull the code that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f2e7e269a7525317752d472bb48a549780e87d22 commit d30588b2731fb01e1616cf16c3fe79a1443e29aa Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:46 2021 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/d30588b2731fb01e1616cf16c3fe79a1443e29aa commit 254c271da0712ea8914f187588e0f81f7678ee2f Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:47 2021 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/254c271da0712ea8914f187588e0f81f7678ee2f There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commits enable Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit ca4a4e9a55beeb138bb06e3867f5e486da896d44 Author: Mike Snitzer <snitzer@redhat.com> Date: Fri Apr 30 14:38:37 2021 -0400 Subject: dm raid: remove unnecessary discard limits for raid0 and raid10 Link: https://github.com/torvalds/linux/commit/ca4a4e9a55beeb138bb06e3867f5e486da896d44 The commits more or less cherry pick to the 5.11, 5.8, 5.4 and 4.15 kernels, with the following minor backports: 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it was recently changed in: commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead Author: Christoph Hellwig <hch@lst.de> Date: Wed Jul 1 10:59:44 2020 +0200 Subject: block: rename generic_make_request to submit_bio_noacct Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead 2) In the 4.15, 5.4 and 5.8 kernels, trace_block_bio_remap() needs to have its request_queue argument put back in place. It was recently removed in: commit 1c02fca620f7273b597591065d366e2cca948d8f Author: Christoph Hellwig <hch@lst.de> Date: Thu Dec 3 17:21:38 2020 +0100 Subject: block: remove the request_queue argument to the block_bio_remap tracepoint Link: https://github.com/torvalds/linux/commit/1c02fca620f7273b597591065d366e2cca948d8f 3) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of" '&' removed for one of their arguments for the 4.15 kernel, due to changes made in: commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8 Author: Kent Overstreet <kent.overstreet@gmail.com> Date: Sun May 20 18:25:52 2018 -0400 Subject: md: convert to bioset_init()/mempool_init() Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8 4) The 4.15 kernel does not need "dm raid: remove unnecessary discard limits for raid0 and raid10" due to not having the following commit, which was merged in 5.1-rc1: commit 61697a6abd24acba941359c6268a94f4afe4a53d Author: Mike Snitzer <snitzer@redhat.com> Date: Fri Jan 18 14:19:26 2019 -0500 Subject: dm: eliminate 'split_discard_bios' flag from DM target interface Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d 5) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to bio_clone_blkcg_association() due to it changing in: commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1 Author: Dennis Zhou <dennis@kernel.org> Date: Wed Dec 5 12:10:35 2018 -0500 Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA: https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test If you install a test kernel, we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. Performance Matrix (AWS i3.8xlarge): Kernel | mkfs.xfs | fstrim --------------------------------- 4.15 | 7m23.449s | 7m20.678s 5.4 | 8m23.219s | 8m23.927s 5.8 | 2m54.990s | 8m22.010s 4.15-test | 0m4.286s | 0m1.657s 5.4-test | 0m6.075s | 0m3.150s 5.8-test | 0m2.753s | 0m2.999s The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Where problems can occur] A problem has occurred once before, with the previous revision of this patchset. This has been documented in bug 1907262, and caused a worst case scenario of data loss for some users, in this particular case, on the second and onward disks. This was due to two two faults: the first, incorrectly calculating the start offset for block discard for the second and extra disks. The second bug was an incorrect stripe size for far layouts. The kernel team was forced to revert the patches in an emergency and the faulty kernel was removed from the archive, and community users urged to avoid the faulty kernel. These bugs and a few other minor issues have now been corrected, and we have been testing the new patches since mid February. The patches have been tested against the testcase in bug 1907262 and do not cause the disks to become corrupted. The regression potential is still the same for this patchset though. If a regression were to occur, it could lead to data loss on Raid10 arrays backed by NVMe or SSD disks that support block discard. If a regression happens, users need to disable the fstrim systemd service as soon as possible, plan an emergency maintainance window, and downgrade the kernel to a previous release, or upgrade to a corrected kernel.
2021-05-06 01:58:55 Matthew Ruffell description BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit cf78408f937a67f59f5e90ee8e6cadeed7c128a8 Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:43 2021 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/cf78408f937a67f59f5e90ee8e6cadeed7c128a8 commit c2968285925adb97b9aa4ede94c1f1ab61ce0925 Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:44 2021 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/c2968285925adb97b9aa4ede94c1f1ab61ce0925 commit f2e7e269a7525317752d472bb48a549780e87d22 Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:45 2021 +0800 Subject: md/raid10: pull the code that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f2e7e269a7525317752d472bb48a549780e87d22 commit d30588b2731fb01e1616cf16c3fe79a1443e29aa Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:46 2021 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/d30588b2731fb01e1616cf16c3fe79a1443e29aa commit 254c271da0712ea8914f187588e0f81f7678ee2f Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:47 2021 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/254c271da0712ea8914f187588e0f81f7678ee2f There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commits enable Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit ca4a4e9a55beeb138bb06e3867f5e486da896d44 Author: Mike Snitzer <snitzer@redhat.com> Date: Fri Apr 30 14:38:37 2021 -0400 Subject: dm raid: remove unnecessary discard limits for raid0 and raid10 Link: https://github.com/torvalds/linux/commit/ca4a4e9a55beeb138bb06e3867f5e486da896d44 The commits more or less cherry pick to the 5.11, 5.8, 5.4 and 4.15 kernels, with the following minor backports: 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it was recently changed in: commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead Author: Christoph Hellwig <hch@lst.de> Date: Wed Jul 1 10:59:44 2020 +0200 Subject: block: rename generic_make_request to submit_bio_noacct Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead 2) In the 4.15, 5.4 and 5.8 kernels, trace_block_bio_remap() needs to have its request_queue argument put back in place. It was recently removed in: commit 1c02fca620f7273b597591065d366e2cca948d8f Author: Christoph Hellwig <hch@lst.de> Date: Thu Dec 3 17:21:38 2020 +0100 Subject: block: remove the request_queue argument to the block_bio_remap tracepoint Link: https://github.com/torvalds/linux/commit/1c02fca620f7273b597591065d366e2cca948d8f 3) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of" '&' removed for one of their arguments for the 4.15 kernel, due to changes made in: commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8 Author: Kent Overstreet <kent.overstreet@gmail.com> Date: Sun May 20 18:25:52 2018 -0400 Subject: md: convert to bioset_init()/mempool_init() Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8 4) The 4.15 kernel does not need "dm raid: remove unnecessary discard limits for raid0 and raid10" due to not having the following commit, which was merged in 5.1-rc1: commit 61697a6abd24acba941359c6268a94f4afe4a53d Author: Mike Snitzer <snitzer@redhat.com> Date: Fri Jan 18 14:19:26 2019 -0500 Subject: dm: eliminate 'split_discard_bios' flag from DM target interface Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d 5) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to bio_clone_blkcg_association() due to it changing in: commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1 Author: Dennis Zhou <dennis@kernel.org> Date: Wed Dec 5 12:10:35 2018 -0500 Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA: https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test If you install a test kernel, we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. Performance Matrix (AWS i3.8xlarge): Kernel | mkfs.xfs | fstrim --------------------------------- 4.15 | 7m23.449s | 7m20.678s 5.4 | 8m23.219s | 8m23.927s 5.8 | 2m54.990s | 8m22.010s 4.15-test | 0m4.286s | 0m1.657s 5.4-test | 0m6.075s | 0m3.150s 5.8-test | 0m2.753s | 0m2.999s The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Where problems can occur] A problem has occurred once before, with the previous revision of this patchset. This has been documented in bug 1907262, and caused a worst case scenario of data loss for some users, in this particular case, on the second and onward disks. This was due to two two faults: the first, incorrectly calculating the start offset for block discard for the second and extra disks. The second bug was an incorrect stripe size for far layouts. The kernel team was forced to revert the patches in an emergency and the faulty kernel was removed from the archive, and community users urged to avoid the faulty kernel. These bugs and a few other minor issues have now been corrected, and we have been testing the new patches since mid February. The patches have been tested against the testcase in bug 1907262 and do not cause the disks to become corrupted. The regression potential is still the same for this patchset though. If a regression were to occur, it could lead to data loss on Raid10 arrays backed by NVMe or SSD disks that support block discard. If a regression happens, users need to disable the fstrim systemd service as soon as possible, plan an emergency maintainance window, and downgrade the kernel to a previous release, or upgrade to a corrected kernel. BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.13-rc1. commit cf78408f937a67f59f5e90ee8e6cadeed7c128a8 Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:43 2021 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/cf78408f937a67f59f5e90ee8e6cadeed7c128a8 commit c2968285925adb97b9aa4ede94c1f1ab61ce0925 Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:44 2021 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/c2968285925adb97b9aa4ede94c1f1ab61ce0925 commit f2e7e269a7525317752d472bb48a549780e87d22 Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:45 2021 +0800 Subject: md/raid10: pull the code that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f2e7e269a7525317752d472bb48a549780e87d22 commit d30588b2731fb01e1616cf16c3fe79a1443e29aa Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:46 2021 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/d30588b2731fb01e1616cf16c3fe79a1443e29aa commit 254c271da0712ea8914f187588e0f81f7678ee2f Author: Xiao Ni <xni@redhat.com> Date: Thu Feb 4 15:50:47 2021 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/254c271da0712ea8914f187588e0f81f7678ee2f There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commit enables Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit ca4a4e9a55beeb138bb06e3867f5e486da896d44 Author: Mike Snitzer <snitzer@redhat.com> Date: Fri Apr 30 14:38:37 2021 -0400 Subject: dm raid: remove unnecessary discard limits for raid0 and raid10 Link: https://github.com/torvalds/linux/commit/ca4a4e9a55beeb138bb06e3867f5e486da896d44 The commits more or less cherry pick to the 5.11, 5.8, 5.4 and 4.15 kernels, with the following minor backports: 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it was recently changed in: commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead Author: Christoph Hellwig <hch@lst.de> Date: Wed Jul 1 10:59:44 2020 +0200 Subject: block: rename generic_make_request to submit_bio_noacct Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead 2) In the 4.15, 5.4 and 5.8 kernels, trace_block_bio_remap() needs to have its request_queue argument put back in place. It was recently removed in: commit 1c02fca620f7273b597591065d366e2cca948d8f Author: Christoph Hellwig <hch@lst.de> Date: Thu Dec 3 17:21:38 2020 +0100 Subject: block: remove the request_queue argument to the block_bio_remap tracepoint Link: https://github.com/torvalds/linux/commit/1c02fca620f7273b597591065d366e2cca948d8f 3) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of" '&' removed for one of their arguments for the 4.15 kernel, due to changes made in: commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8 Author: Kent Overstreet <kent.overstreet@gmail.com> Date: Sun May 20 18:25:52 2018 -0400 Subject: md: convert to bioset_init()/mempool_init() Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8 4) The 4.15 kernel does not need "dm raid: remove unnecessary discard limits for raid0 and raid10" due to not having the following commit, which was merged in 5.1-rc1: commit 61697a6abd24acba941359c6268a94f4afe4a53d Author: Mike Snitzer <snitzer@redhat.com> Date: Fri Jan 18 14:19:26 2019 -0500 Subject: dm: eliminate 'split_discard_bios' flag from DM target interface Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d 5) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to bio_clone_blkcg_association() due to it changing in: commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1 Author: Dennis Zhou <dennis@kernel.org> Date: Wed Dec 5 12:10:35 2018 -0500 Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA: https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test If you install a test kernel, we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. Performance Matrix (AWS i3.8xlarge): Kernel | mkfs.xfs | fstrim --------------------------------- 4.15 | 7m23.449s | 7m20.678s 5.4 | 8m23.219s | 8m23.927s 5.8 | 2m54.990s | 8m22.010s 4.15-test | 0m4.286s | 0m1.657s 5.4-test | 0m6.075s | 0m3.150s 5.8-test | 0m2.753s | 0m2.999s The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Where problems can occur] A problem has occurred once before, with the previous revision of this patchset. This has been documented in bug 1907262, and caused a worst case scenario of data loss for some users, in this particular case, on the second and onward disks. This was due to two two faults: the first, incorrectly calculating the start offset for block discard for the second and extra disks. The second bug was an incorrect stripe size for far layouts. The kernel team was forced to revert the patches in an emergency and the faulty kernel was removed from the archive, and community users urged to avoid the faulty kernel. These bugs and a few other minor issues have now been corrected, and we have been testing the new patches since mid February. The patches have been tested against the testcase in bug 1907262 and do not cause the disks to become corrupted. The regression potential is still the same for this patchset though. If a regression were to occur, it could lead to data loss on Raid10 arrays backed by NVMe or SSD disks that support block discard. If a regression happens, users need to disable the fstrim systemd service as soon as possible, plan an emergency maintenance window, and downgrade the kernel to a previous release, or upgrade to a corrected kernel.
2021-05-20 12:51:59 Jerome Charaoui bug added subscriber Jerome Charaoui
2021-05-27 10:14:23 Kleber Sacilotto de Souza linux (Ubuntu Bionic): status In Progress Fix Committed
2021-05-27 10:17:55 Kleber Sacilotto de Souza linux (Ubuntu Focal): status In Progress Fix Committed
2021-05-27 10:19:37 Kleber Sacilotto de Souza linux (Ubuntu Groovy): status In Progress Fix Committed
2021-05-27 10:22:23 Kleber Sacilotto de Souza linux (Ubuntu Hirsute): status In Progress Fix Committed
2021-06-02 19:58:47 Ubuntu Kernel Bot tags sts sts verification-needed-hirsute
2021-06-03 03:28:16 Ubuntu Kernel Bot tags sts verification-needed-hirsute sts verification-needed-bionic verification-needed-hirsute
2021-06-03 03:30:30 Ubuntu Kernel Bot tags sts verification-needed-bionic verification-needed-hirsute sts verification-needed-bionic verification-needed-focal verification-needed-hirsute
2021-06-05 17:23:15 Ubuntu Kernel Bot tags sts verification-needed-bionic verification-needed-focal verification-needed-hirsute sts verification-needed-bionic verification-needed-focal verification-needed-groovy verification-needed-hirsute
2021-06-11 05:15:06 Matthew Ruffell tags sts verification-needed-bionic verification-needed-focal verification-needed-groovy verification-needed-hirsute sts verification-done-hirsute verification-needed-bionic verification-needed-focal verification-needed-groovy
2021-06-11 05:15:37 Matthew Ruffell tags sts verification-done-hirsute verification-needed-bionic verification-needed-focal verification-needed-groovy sts verification-done-groovy verification-done-hirsute verification-needed-bionic verification-needed-focal
2021-06-11 05:16:06 Matthew Ruffell tags sts verification-done-groovy verification-done-hirsute verification-needed-bionic verification-needed-focal sts verification-done-focal verification-done-groovy verification-done-hirsute verification-needed-bionic
2021-06-11 05:16:38 Matthew Ruffell tags sts verification-done-focal verification-done-groovy verification-done-hirsute verification-needed-bionic sts verification-done-bionic verification-done-focal verification-done-groovy verification-done-hirsute
2021-06-21 23:22:10 Launchpad Janitor linux (Ubuntu): status In Progress Fix Released
2021-06-21 23:22:10 Launchpad Janitor cve linked 2020-24586
2021-06-21 23:22:10 Launchpad Janitor cve linked 2020-24587
2021-06-21 23:22:10 Launchpad Janitor cve linked 2020-24588
2021-06-21 23:22:10 Launchpad Janitor cve linked 2020-26139
2021-06-21 23:22:10 Launchpad Janitor cve linked 2020-26141
2021-06-21 23:22:10 Launchpad Janitor cve linked 2020-26145
2021-06-21 23:22:10 Launchpad Janitor cve linked 2020-26147
2021-06-21 23:22:10 Launchpad Janitor cve linked 2021-20288
2021-06-21 23:22:10 Launchpad Janitor cve linked 2021-33200
2021-06-21 23:22:10 Launchpad Janitor cve linked 2021-3489
2021-06-21 23:22:10 Launchpad Janitor cve linked 2021-3490
2021-06-22 16:01:10 Launchpad Janitor linux (Ubuntu Hirsute): status Fix Committed Fix Released
2021-06-22 16:03:26 Launchpad Janitor linux (Ubuntu Groovy): status Fix Committed Fix Released
2021-06-22 16:03:26 Launchpad Janitor cve linked 2021-23133
2021-06-22 16:03:26 Launchpad Janitor cve linked 2021-31440
2021-06-22 16:05:32 Launchpad Janitor linux (Ubuntu Focal): status Fix Committed Fix Released
2021-06-22 16:11:02 Launchpad Janitor linux (Ubuntu Bionic): status Fix Committed Fix Released
2021-06-22 16:11:02 Launchpad Janitor cve linked 2021-3444
2021-06-22 16:11:02 Launchpad Janitor cve linked 2021-3600