Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time.
For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds.
The bigger the devices, the longer it takes.
The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests.
For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once:
Xiao Ni has developed a patchset which resolves the block discard performance problems. It is currently in the md-next tree [1], and I am expecting the commits to be merged during the 5.10 merge window.
You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things.
$ lsblk
xvda 202:0 0 8G 0 disk
└─xvda1 202:1 0 8G 0 part /
nvme0n1 259:2 0 1.7T 0 disk
nvme1n1 259:0 0 1.7T 0 disk
nvme2n1 259:1 0 1.7T 0 disk
nvme3n1 259:3 0 1.7T 0 disk
$ time sudo mkfs.xfs /dev/md0
real 0m4.226s
user 0m0.020s
sys 0m0.148s
$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk
$ time sudo fstrim /mnt/disk
real 0m1.991s
user 0m0.020s
sys 0m0.000s
The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds.
[Regression Potential]
If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only.
Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10.
The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected.
If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely.
BugLink: https:/ /bugs.launchpad .net/bugs/
[Impact]
Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time.
For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds.
The bigger the devices, the longer it takes.
The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests.
For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once:
$ cat /sys/block/ nvme0n1/ queue/discard_ max_bytes nvme0n1/ queue/discard_ max_hw_ bytes
2199023255040
$ cat /sys/block/
2199023255040
Where the Raid10 md device only supports 512k:
$ cat /sys/block/ md0/queue/ discard_ max_bytes md0/queue/ discard_ max_hw_ bytes
524288
$ cat /sys/block/
524288
If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_ issue_discard( )
$ sudo cat /proc/1626/stack 0x14c/0x230 [raid10] request_ wait+0x39/ 0x150 [raid10] write_request+ 0x11e/0x850 [raid10] make_request+ 0xd7/0x150 [raid10] request+ 0x123/0x1a0 bio+0xda/ 0x120 bio_noacct+ 0xde/0x320 bio_noacct+ 0x4d/0x90 bio+0x4f/ 0x1b0 issue_discard+ 0x154/0x290 issue_discard+ 0x5d/0xc0 discard+ 0xc4/0x110 common_ ioctl+0x56c/ 0x840 ioctl+0xeb/ 0x270 0x3d/0x50 ioctl+0x91/ 0xc0 64+0x38/ 0x90 64_after_ hwframe+ 0x44/0xa9
[<0>] wait_barrier+
[<0>] regular_
[<0>] raid10_
[<0>] raid10_
[<0>] md_handle_
[<0>] md_submit_
[<0>] __submit_
[<0>] submit_
[<0>] submit_
[<0>] __blkdev_
[<0>] blkdev_
[<0>] blk_ioctl_
[<0>] blkdev_
[<0>] blkdev_
[<0>] block_ioctl+
[<0>] __x64_sys_
[<0>] do_syscall_
[<0>] entry_SYSCALL_
[Fix]
Xiao Ni has developed a patchset which resolves the block discard performance problems. It is currently in the md-next tree [1], and I am expecting the commits to be merged during the 5.10 merge window.
[1] https:/ /git.kernel. org/pub/ scm/linux/ kernel/ git/song/ md.git/ log/?h= md-next
commit 5b2374a6c221f28 c74913d208bb537 6a7ee3bf70 /git.kernel. org/pub/ scm/linux/ kernel/ git/song/ md.git/ commit/ ?h=md-next& id=5b2374a6c221 f28c74913d208bb 5376a7ee3bf70
Author: Xiao Ni <email address hidden>
Date: Wed Sep 2 20:00:23 2020 +0800
Subject: md/raid10: improve discard request for far layout
Link: https:/
commit 8f694215ae4c7ab f1e6c985803a1aa d0db748d07 /git.kernel. org/pub/ scm/linux/ kernel/ git/song/ md.git/ commit/ ?h=md-next& id=8f694215ae4c 7abf1e6c985803a 1aad0db748d07
Author: Xiao Ni <email address hidden>
Date: Wed Sep 2 20:00:22 2020 +0800
Subject: md/raid10: improve raid10 discard request
Link: https:/
commit 6fcfa8732a8cfea 7828a9444c85569 1c481ee557 /git.kernel. org/pub/ scm/linux/ kernel/ git/song/ md.git/ commit/ ?h=md-next& id=6fcfa8732a8c fea7828a9444c85 5691c481ee557
Author: Xiao Ni <email address hidden>
Date: Tue Aug 25 13:43:01 2020 +0800
Subject: md/raid10: pull codes that wait for blocked dev into one function
Link: https:/
commit 6f4fed152a5e483 af2227156ce7b62 63aeeb5c84 /git.kernel. org/pub/ scm/linux/ kernel/ git/song/ md.git/ commit/ ?h=md-next& id=6f4fed152a5e 483af2227156ce7 b6263aeeb5c84
Author: Xiao Ni <email address hidden>
Date: Tue Aug 25 13:43:00 2020 +0800
Subject: md/raid10: extend r10bio devs to raid disks
Link: https:/
commit 7197f1a616caf85 508d81c7f5c9f06 5ffaebf027 discard_ bio() for submitting discard bio /git.kernel. org/pub/ scm/linux/ kernel/ git/song/ md.git/ commit/ ?h=md-next& id=7197f1a616ca f85508d81c7f5c9 f065ffaebf027
Author: Xiao Ni <email address hidden>
Date: Tue Aug 25 13:42:59 2020 +0800
Subject: md: add md_submit_
Link: https:/
It follows a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2:
commit 29efc390b946258 2ae95eb9a0b8cd1 7ab956afc0 /github. com/torvalds/ linux/commit/ 29efc390b946258 2ae95eb9a0b8cd1 7ab956afc0
Author: Shaohua Li <email address hidden>
Date: Sun May 7 17:36:24 2017 -0700
Subject: md/md0: optimize raid0 discard handling
Link: https:/
[Testcase]
You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things.
$ lsblk
xvda 202:0 0 8G 0 disk
└─xvda1 202:1 0 8G 0 part /
nvme0n1 259:2 0 1.7T 0 disk
nvme1n1 259:0 0 1.7T 0 disk
nvme2n1 259:1 0 1.7T 0 disk
nvme3n1 259:3 0 1.7T 0 disk
Create a Raid10 array:
$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
Format the array with XFS:
$ time sudo mkfs.xfs /dev/md0
real 11m14.734s
$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk
Optional, do a fstrim:
$ time sudo fstrim /mnt/disk
real 11m37.643s
I built a test kernel based on 5.9-rc6 with the above patches, and we can see that performance dramatically improves:
$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
$ time sudo mkfs.xfs /dev/md0
real 0m4.226s
user 0m0.020s
sys 0m0.148s
$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk
$ time sudo fstrim /mnt/disk
real 0m1.991s
user 0m0.020s
sys 0m0.000s
The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds.
[Regression Potential]
If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only.
Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10.
The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected.
If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely.