Performance issue on mdraid5 when the number of devices more than 4

Bug #2031383 reported by Vladimir Khristenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
mdadm (Ubuntu)
New
Undecided
Unassigned

Bug Description

Hi there.

I have encountered a significant increase in the max latency on the 4k random write pattern on mdraid5 in the case when the number of devices in the array becomes more than 4.

Environment:
OS: Ubuntu 20.04
kernel: 5.15.0-79 (HWE)
NVMe: 5x Solidigm D7-5620 1.6TB (FW: 9CV10410)

group_thread_cnt and stripe_cache_size parameters are set via the udev rules file:
cat /etc/udev/rules.d/60-md-stripe-cache.rules
SUBSYSTEM=="block", KERNEL=="md*", ACTION=="add|change", ATTR{md/group_thread_cnt}="6"
SUBSYSTEM=="block", KERNEL=="md*", ACTION=="add|change", ATTR{md/stripe_cache_size}="512"

mdraid5 on top of 4x NVMe drives:
#---------------
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 nvme3n1p1[3] nvme2n1p1[2] nvme1n1p1[1] nvme0n1p1[0]
      4688040960 blocks super 1.2 level 5, 4k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 0/12 pages [0KB], 65536KB chunk
#---------------

Then run fio tests:
for i in {1..3}; do echo test "$i"; fio --name=nvme --numjobs=8 --iodepth=32 --bs=4k --rw=randwrite --ioengine=libaio --direct=1 --group_reporting=1 --filename=/dev/md0p1 --runtime=600 --time_based=1 --ramp_time=0; done
fio results:
Test 1:
...
write: IOPS=250k, BW=976MiB/s (1023MB/s)(572GiB/600002msec);
lat (usec): min=58, max=9519, avg=1024.02, stdev=1036.23

Test 2:
...
write: IOPS=291k, BW=1138MiB/s (1193MB/s)(667GiB/600002msec); 0 zone resets
lat (usec): min=43, max=19160, avg=878.25, stdev=820.79

Test 3:
...
write: IOPS=301k, BW=1176MiB/s (1233MB/s)(689GiB/600003msec); 0 zone resets
lat (usec): min=48, max=7900, avg=850.05, stdev=763.24
...

Max latency is 19160 usec (test 2).

mdraid5 on top of 5x NVMe drives:
#---------------
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 nvme4n1p1[4] nvme3n1p1[3] nvme2n1p1[2] nvme1n1p1[1] nvme0n1p1[0]
      6250721280 blocks super 1.2 level 5, 4k chunk, algorithm 2 [5/5] [UUUUU]
      bitmap: 10/12 pages [40KB], 65536KB chunk
#---------------
Running the same test:
for i in {1..3}; do echo test "$i"; fio --name=nvme --numjobs=8 --iodepth=32 --bs=4k --rw=randwrite --ioengine=libaio --direct=1 --group_reporting=1 --filename=/dev/md0p1 --runtime=600 --time_based=1 --ramp_time=0; done

fio results:
Test 1:
...
write: IOPS=375k, BW=1466MiB/s (1537MB/s)(859GiB/600002msec); 0 zone resets
lat (usec): min=78, max=28966k, avg=681.56, stdev=3300.12

Test 2:
...
write: IOPS=390k, BW=1524MiB/s (1598MB/s)(893GiB/600001msec); 0 zone resets
lat (usec): min=77, max=63847k, avg=655.85, stdev=6565.15
...

Test 3:
...
write: IOPS=391k, BW=1526MiB/s (1600MB/s)(894GiB/600002msec); 0 zone resets
lat (usec): min=79, max=60377k, avg=654.74, stdev=6081.22
...

Final:
mdraid5 on top of 4x NVMe drives: max latency - 19160 usec.
mdraid5 on top of 5x NVMe drives: max latency - 63847k usec.

As you can see the max latency is a significant increase to 63847k usec (test 2).

If I increase the runtime to 3600/7200 sec, I have see a hung task in dmesg:
...
[11480.292296] INFO: task fio:2501 blocked for more than 120 seconds.
[11480.292320] Not tainted 5.15.0-79-generic #85-Ubuntu
[11480.292341] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11480.292369] task:fio state:D stack: 0 pid: 2501 ppid: 2465 flags:0x00004002
...

To eliminate the problem with my NVMe drives, I have built an array on RAM drives and got the same behavior.

modprobe brd rd_nr=6 rd_size=10485760

mdraid5 on top of 3x RAM drives:
mdadm --create /dev/md0 --level=5 --chunk=4K --bitmap=internal --raid-devices=3 /dev/ram0 /dev/ram1 /dev/ram2
#---------------
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 ram2[3] ram1[1] ram0[0]
      20953088 blocks super 1.2 level 5, 4k chunk, algorithm 2 [3/3] [UUU]
      bitmap: 1/1 pages [4KB], 65536KB chunk
#---------------

for i in {1..3}; do echo test "$i"; date; fio --name=nvme --numjobs=16 --iodepth=32 --bs=4k --rw=randwrite --ioengine=libaio --direct=1 --group_reporting=1 --filename=/dev/md0 --runtime=600 --time_based=1 --ramp_time=0; done

fio results:
Test 1:
...
write: IOPS=497k, BW=1939MiB/s (2034MB/s)(1136GiB/600003msec); 0 zone resets
lat (usec): min=466, max=6171, avg=1030.71, stdev=39.32
...

Test 2:
...
write: IOPS=497k, BW=1941MiB/s (2035MB/s)(1137GiB/600003msec); 0 zone resets
lat (usec): min=461, max=6223, avg=1030.06, stdev=39.38
...

Test 3:
...
write: IOPS=497k, BW=1940MiB/s (2034MB/s)(1136GiB/600002msec); 0 zone resets
lat (usec): min=474, max=6179, avg=1030.68, stdev=39.29
...

Max latency is 6223 usec (test 2).

mdraid5 on top 4x RAM drives:
mdadm --create /dev/md0 --level=5 --chunk=4K --bitmap=internal --raid-devices=4 /dev/ram0 /dev/ram1 /dev/ram2 /dev/ram3
#---------------
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 ram3[4] ram2[2] ram1[1] ram0[0]
      31429632 blocks super 1.2 level 5, 4k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 1/1 pages [4KB], 65536KB chunk
#---------------

for i in {1..3}; do echo test "$i"; date; fio --name=nvme --numjobs=16 --iodepth=32 --bs=4k --rw=randwrite --ioengine=libaio --direct=1 --group_reporting=1 --filename=/dev/md0 --runtime=600 --time_based=1 --ramp_time=0; done

fio results:
Test 1:
...
write: IOPS=438k, BW=1712MiB/s (1796MB/s)(1003GiB/600002msec); 0 zone resets
lat (usec): min=468, max=6902, avg=1167.45, stdev=46.17
...

Test 2:
...
write: IOPS=438k, BW=1711MiB/s (1794MB/s)(1002GiB/600004msec); 0 zone resets
lat (usec): min=470, max=7689, avg=1168.49, stdev=46.14
...

Test 3:
...
write: IOPS=438k, BW=1712MiB/s (1796MB/s)(1003GiB/600003msec); 0 zone resets
lat (usec): min=479, max=6376, avg=1167.40, stdev=46.18
...

Max latency is 7689 usec (test 2).

mdraid5 on top 5x RAM drives:
mdadm --create /dev/md0 --level=5 --chunk=4K --bitmap=internal --raid-devices=5 /dev/ram0 /dev/ram1 /dev/ram2 /dev/ram3 /dev/ram4
#---------------
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 ram4[5] ram3[3] ram2[2] ram1[1] ram0[0]
      41906176 blocks super 1.2 level 5, 4k chunk, algorithm 2 [5/5] [UUUUU]
      bitmap: 0/1 pages [0KB], 65536KB chunk
#---------------
for i in {1..3}; do echo test "$i"; date; fio --name=nvme --numjobs=16 --iodepth=32 --bs=4k --rw=randwrite --ioengine=libaio --direct=1 --group_reporting=1 --filename=/dev/md0 --runtime=600 --time_based=1 --ramp_time=0; done

fio results:
Test 1:
...
write: IOPS=452k, BW=1764MiB/s (1850MB/s)(1034GiB/600001msec); 0 zone resets
lat (usec): min=13, max=68868k, avg=1133.11, stdev=79882.97
...

Test 2:
...
write: IOPS=451k, BW=1763MiB/s (1849MB/s)(1033GiB/600001msec); 0 zone resets
lat (usec): min=11, max=45339k, avg=1134.04, stdev=78829.34
...

Test 3:
...
write: IOPS=453k, BW=1770MiB/s (1856MB/s)(1037GiB/600001msec); 0 zone resets
lat (usec): min=12, max=63593k, avg=1129.34, stdev=84268.37
...

Max latency is 68868k usec (test 1).

Final:
mdraid5 on top of 3x RAM drives: max latency - 6223 usec.
mdraid5 on top of 4x RAM drives: max latency - 7689 usec.
mdraid5 on top of 5x RAM drives: max latency - 68868k usec.

I also reproduced this behavior on mdraid4 and mdraid5 in CentOS 7, CentOS 9, and Ubuntu 22.04 with kernels 5.15.0-79 and 6.4(mainline).

But I can't reproduce this behavior on mdraid6.

Could you please help me to understand why it happens and if there is any chance to fix that?
Let me know if you need more detailed information about my environment or needed to run more tests.

Thank you in advance.

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: mdadm 4.1-5ubuntu1.2
ProcVersionSignature: Ubuntu 5.15.0-79.86~20.04.2-generic 5.15.111
Uname: Linux 5.15.0-79-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
ApportVersion: 2.20.11-0ubuntu27.27
Architecture: amd64
CasperMD5CheckResult: skip
Date: Tue Aug 15 08:54:39 2023
Lsusb: Error: command ['lsusb'] failed with exit code 1:
Lsusb-t:

Lsusb-v: Error: command ['lsusb', '-v'] failed with exit code 1:
MDadmExamine.dev.sda:
 /dev/sda:
    MBR Magic : aa55
 Partition[0] : 62914559 sectors at 1 (type ee)
MDadmExamine.dev.sda1: Error: command ['/sbin/mdadm', '-E', '/dev/sda1'] failed with exit code 1: mdadm: No md superblock detected on /dev/sda1.
MDadmExamine.dev.sda2:
 /dev/sda2:
    MBR Magic : aa55
MDadmExamine.dev.sda3: Error: command ['/sbin/mdadm', '-E', '/dev/sda3'] failed with exit code 1: mdadm: No md superblock detected on /dev/sda3.
MDadmExamine.dev.sda4: Error: command ['/sbin/mdadm', '-E', '/dev/sda4'] failed with exit code 1: mdadm: No md superblock detected on /dev/sda4.
MachineType: VMware, Inc. VMware Virtual Platform
ProcEnviron:
 LANGUAGE=en_US:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.15.0-79-generic root=/dev/mapper/main-root ro quiet
ProcMounts: Error: [Errno 40] Too many levels of symbolic links: '/proc/mounts'
SourcePackage: mdadm
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 11/12/2020
dmi.bios.release: 4.6
dmi.bios.vendor: Phoenix Technologies LTD
dmi.bios.version: 6.00
dmi.board.name: 440BX Desktop Reference Platform
dmi.board.vendor: Intel Corporation
dmi.board.version: None
dmi.chassis.asset.tag: No Asset Tag
dmi.chassis.type: 1
dmi.chassis.vendor: No Enclosure
dmi.chassis.version: N/A
dmi.ec.firmware.release: 0.0
dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvr6.00:bd11/12/2020:br4.6:efr0.0:svnVMware,Inc.:pnVMwareVirtualPlatform:pvrNone:rvnIntelCorporation:rn440BXDesktopReferencePlatform:rvrNone:cvnNoEnclosure:ct1:cvrN/A:sku:
dmi.product.name: VMware Virtual Platform
dmi.product.version: None
dmi.sys.vendor: VMware, Inc.
etc.blkid.tab: Error: [Errno 2] No such file or directory: '/etc/blkid.tab'

Revision history for this message
Vladimir Khristenko (vkhristenko) wrote :
Revision history for this message
Vladimir Khristenko (vkhristenko) wrote :
summary: - RAID5 performance issue on mdraid5 when the number of devices more than
- 4
+ Performance issue on mdraid5 when the number of devices more than 4
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.