Comment 0 for bug 1896578

Revision history for this message
Matthew Ruffell (mruffell) wrote :

BugLink: https://bugs.launchpad.net/bugs/

[Impact]

Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time.

For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds.

The bigger the devices, the longer it takes.

The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests.

For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once:

$ cat /sys/block/nvme0n1/queue/discard_max_bytes
2199023255040
$ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
2199023255040

Where the Raid10 md device only supports 512k:

$ cat /sys/block/md0/queue/discard_max_bytes
524288
$ cat /sys/block/md0/queue/discard_max_hw_bytes
524288

If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard()

$ sudo cat /proc/1626/stack
[<0>] wait_barrier+0x14c/0x230 [raid10]
[<0>] regular_request_wait+0x39/0x150 [raid10]
[<0>] raid10_write_request+0x11e/0x850 [raid10]
[<0>] raid10_make_request+0xd7/0x150 [raid10]
[<0>] md_handle_request+0x123/0x1a0
[<0>] md_submit_bio+0xda/0x120
[<0>] __submit_bio_noacct+0xde/0x320
[<0>] submit_bio_noacct+0x4d/0x90
[<0>] submit_bio+0x4f/0x1b0
[<0>] __blkdev_issue_discard+0x154/0x290
[<0>] blkdev_issue_discard+0x5d/0xc0
[<0>] blk_ioctl_discard+0xc4/0x110
[<0>] blkdev_common_ioctl+0x56c/0x840
[<0>] blkdev_ioctl+0xeb/0x270
[<0>] block_ioctl+0x3d/0x50
[<0>] __x64_sys_ioctl+0x91/0xc0
[<0>] do_syscall_64+0x38/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[Fix]

Xiao Ni has developed a patchset which resolves the block discard performance problems. It is currently in the md-next tree [1], and I am expecting the commits to be merged during the 5.10 merge window.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=md-next

commit 5b2374a6c221f28c74913d208bb5376a7ee3bf70
Author: Xiao Ni <email address hidden>
Date: Wed Sep 2 20:00:23 2020 +0800
Subject: md/raid10: improve discard request for far layout
Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=5b2374a6c221f28c74913d208bb5376a7ee3bf70

commit 8f694215ae4c7abf1e6c985803a1aad0db748d07
Author: Xiao Ni <email address hidden>
Date: Wed Sep 2 20:00:22 2020 +0800
Subject: md/raid10: improve raid10 discard request
Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=8f694215ae4c7abf1e6c985803a1aad0db748d07

commit 6fcfa8732a8cfea7828a9444c855691c481ee557
Author: Xiao Ni <email address hidden>
Date: Tue Aug 25 13:43:01 2020 +0800
Subject: md/raid10: pull codes that wait for blocked dev into one function
Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=6fcfa8732a8cfea7828a9444c855691c481ee557

commit 6f4fed152a5e483af2227156ce7b6263aeeb5c84
Author: Xiao Ni <email address hidden>
Date: Tue Aug 25 13:43:00 2020 +0800
Subject: md/raid10: extend r10bio devs to raid disks
Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=6f4fed152a5e483af2227156ce7b6263aeeb5c84

commit 7197f1a616caf85508d81c7f5c9f065ffaebf027
Author: Xiao Ni <email address hidden>
Date: Tue Aug 25 13:42:59 2020 +0800
Subject: md: add md_submit_discard_bio() for submitting discard bio
Link: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=7197f1a616caf85508d81c7f5c9f065ffaebf027

It follows a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2:

commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0
Author: Shaohua Li <email address hidden>
Date: Sun May 7 17:36:24 2017 -0700
Subject: md/md0: optimize raid0 discard handling
Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0

[Testcase]

You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things.

$ lsblk
xvda 202:0 0 8G 0 disk
└─xvda1 202:1 0 8G 0 part /
nvme0n1 259:2 0 1.7T 0 disk
nvme1n1 259:0 0 1.7T 0 disk
nvme2n1 259:1 0 1.7T 0 disk
nvme3n1 259:3 0 1.7T 0 disk

Create a Raid10 array:

$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

Format the array with XFS:

$ time sudo mkfs.xfs /dev/md0
real 11m14.734s

$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk

Optional, do a fstrim:

$ time sudo fstrim /mnt/disk

real 11m37.643s

I built a test kernel based on 5.9-rc6 with the above patches, and we can see that performance dramatically improves:

$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

$ time sudo mkfs.xfs /dev/md0
real 0m4.226s
user 0m0.020s
sys 0m0.148s

$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk
$ time sudo fstrim /mnt/disk

real 0m1.991s
user 0m0.020s
sys 0m0.000s

The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds.

[Regression Potential]

If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only.

Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10.

The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected.

If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely.