RAID write performance is horrible. max_sectors_kb is set to the odd value of 127

Bug #1031260 reported by Freaky
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Medium
Unassigned

Bug Description

RAID-5/6 write performance utterly sucks. Initially we were on RAID-6, since RAID write performance was really really poor I converted it to RAID-5. This only took around 9 days...

After I pointed out write performance was bad on the LIO mailing list (export it as iSCSI volume through LIO) that the write performance was horribly bad they stated I should have a look at /sys/block/md?/queue/max_sectors_kb (and oc max_hw_sectors_kb). To my surprise both were set at 127, which is horrible performance wise as it's not a binary multiple.

Since max_hw_sectors_kb has 127 as value as well I can't set it higher. All block devices used in the RAID set have their values set at 4096. I think the optimum value would be something like (D-P)*512/1024/2048/4096. Where D is the number of disks in the RAID and P is the number of parity disks (1 in case of RAID-5, 2 in case of RAID-6). Or at least something that's a binary multiple (which 127 is definitely not).

Can't find much about it. Some guy reported his issues were gone once he upgraded the kernel to 3.3. Also saw this: http://www.spinics.net/lists/raid/msg38609.html

In any case, I'm quite shocked it hasn't been noticed/fixed. I get like 20-30MiB/s write sequential sustained (when doing over 10GB writes with dd for example, the first ~8G or so go fast due to buffers, once they're full performance crumbles down to utter shit) on a 8 disk RAID set. Individual disks do nearly 100MiB/s sequential each (even when they're all under load at the same time), CPU's are hardly loaded at all so it's not a checksumming thing.

1) 12.04 LTS server 64-bit (upgraded from 8.04 -> 10.04 -> 11.?? (had issues with mvsas controllers - they're replaced (LSI) now, not that this version helped it :)) -> 12.04)
2) 3.2.0-26-generic #41-Ubuntu SMP Thu Jun 14 17:49:24 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
3) At least 200-300MiB/s write performance sustained (sequential oc). Note the at least :).
4) Crappy performance like 20-30MiB/s sustained sequential.

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: linux-image-3.2.0-26-generic 3.2.0-26.41
ProcVersionSignature: Ubuntu 3.2.0-26.41-generic 3.2.19
Uname: Linux 3.2.0-26-generic x86_64
AlsaDevices:
 total 0
 crw-rw---T 1 root audio 116, 1 Jul 8 14:29 seq
 crw-rw---T 1 root audio 116, 33 Jul 8 14:29 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.0.1-0ubuntu8
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
Date: Tue Jul 31 10:30:29 2012
HibernationDevice: RESUME=UUID=3680d1bc-fe0c-4584-98b3-4cfc8bb50c60
InstallationMedia: Ubuntu-Server 10.04 LTS "Lucid Lynx" - Release amd64 (20100427)
IwConfig:
 lo no wireless extensions.

 eth1 no wireless extensions.

 eth0 no wireless extensions.
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 001 Device 002: ID 8087:0020 Intel Corp. Integrated Rate Matching Hub
 Bus 002 Device 002: ID 8087:0020 Intel Corp. Integrated Rate Matching Hub
 Bus 001 Device 003: ID 0557:2221 ATEN International Co., Ltd
MachineType: Supermicro X8SIL
PciMultimedia:

ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-3.2.0-26-generic root=UUID=3615935e-453f-448c-a6ea-bd595a49da9c ro quiet
RelatedPackageVersions:
 linux-restricted-modules-3.2.0-26-generic N/A
 linux-backports-modules-3.2.0-26-generic N/A
 linux-firmware 1.79
RfKill: Error: [Errno 2] No such file or directory
SourcePackage: linux
UpgradeStatus: Upgraded to precise on 2012-07-06 (24 days ago)
dmi.bios.date: 02/25/2010
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 1.0c
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: X8SIL
dmi.board.vendor: Supermicro
dmi.board.version: 0123456789
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 24
dmi.chassis.vendor: Supermicro
dmi.chassis.version: 0123456789
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr1.0c:bd02/25/2010:svnSupermicro:pnX8SIL:pvr0123456789:rvnSupermicro:rnX8SIL:rvr0123456789:cvnSupermicro:ct24:cvr0123456789:
dmi.product.name: X8SIL
dmi.product.version: 0123456789
dmi.sys.vendor: Supermicro

Revision history for this message
Freaky (freaky) wrote :
Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Freaky (freaky) wrote :

After talking in #ubuntu-server I also got a report from a guy that he has the same issue in 10.04 with the 2.6.32 kernel. Apparently has been around for a while thus.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.5kernel[0] (Not a kernel in the daily directory) and install both the linux-image and linux-image-extra .deb packages.

Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag(Only that one tag, please leave the other tags). This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5-rc7-quantal/

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key
tags: added: needs-upstream-testing
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Freaky (freaky) wrote :

Hi, I will test with the requested 3.5 kernel shortly.

As already stated however I've found posts on, iirc, the kernel mailing list where a user stated he resolved the problem by upgrading his kernel to 3.3 vanilla.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Maybe a quicker test would be to test v3.3 final. It can be downloaded from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-precise/

If we find this bug is fixed, I can perform a "Reverse" bisect to identify the fix that caused this bug.

tags: added: needs-bisect
Revision history for this message
Freaky (freaky) wrote :
Download full text (9.1 KiB)

Hi, I've tested with 3.5 (it's still running). Took another one than the one you linked, as you linked -rc7 and the official release was on the site too (so just 3.5.0 no rc/prerelease). It seems somewhat resolved (note - I'm no kernel dev ;). With that I mean the values are much higher now, however, the max_hw_sector_kb has a value of 16383 which is an odd number and is 16MiB - 1 kiB. The minus 1kiB does not make sense to me whatsoever.

Have done some simple tests with values like 512 (default after boot), 2048 and most notably 12288 seems to perform best (64k chunk size, 7 disk RAID-5 (6 data disks thus) * 32 (makes 64*6*32)).

That has doubled my write performance w/o touching any other settings whatsoever.

This is not very extensive, just a copy/paste from my console. Had a 'while (true); do killall -USR1 dd; sleep 20; done' running in another terminal to have dd print statistics in between.

At the time of testing nothing else was accessing any of the disks in the RAID array. There was sufficient time between the dd's for the kernel to flush it's buffers. Didn't copy the output for free, but checked the size of the buffers with that before starting the next dd. Wasn't more than a couple of MiB's before starting with over 3G RAM free, which explains the initial (way) higher speeds.

root@datavault:/sys/block/md4/queue# cat max_sectors_kb
512
root@datavault:/sys/block/md4/queue# cat max_hw_sectors_kb
16383
root@datavault:/sys/block/md4/queue# dd if=/dev/zero of=/dev/md4p2 bs=1M count=20480
2482+0 records in
2482+0 records out
2602565632 bytes (2.6 GB) copied, 27.5633 s, 94.4 MB/s
3795+0 records in
3795+0 records out
3979345920 bytes (4.0 GB) copied, 47.6073 s, 83.6 MB/s
4980+0 records in
4980+0 records out
5221908480 bytes (5.2 GB) copied, 67.7112 s, 77.1 MB/s
5383+0 records in
5383+0 records out
5644484608 bytes (5.6 GB) copied, 87.6912 s, 64.4 MB/s
5931+0 records in
5931+0 records out
6219104256 bytes (6.2 GB) copied, 107.683 s, 57.8 MB/s
6827+0 records in
6827+0 records out
7158628352 bytes (7.2 GB) copied, 127.683 s, 56.1 MB/s
7920+0 records in
7920+0 records out
8304721920 bytes (8.3 GB) copied, 147.699 s, 56.2 MB/s
8594+0 records in
8594+0 records out
9011462144 bytes (9.0 GB) copied, 167.704 s, 53.7 MB/s
9157+0 records in
9157+0 records out
9601810432 bytes (9.6 GB) copied, 187.723 s, 51.1 MB/s
9709+0 records in
9709+0 records out
10180624384 bytes (10 GB) copied, 207.755 s, 49.0 MB/s
11159+0 records in
11159+0 records out
11701059584 bytes (12 GB) copied, 227.724 s, 51.4 MB/s
12307+0 records in
12307+0 records out
12904824832 bytes (13 GB) copied, 247.744 s, 52.1 MB/s
12867+0 records in
12867+0 records out
13492027392 bytes (13 GB) copied, 267.759 s, 50.4 MB/s
13634+0 records in
13634+0 records out
14296285184 bytes (14 GB) copied, 287.759 s, 49.7 MB/s
14287+0 records in
14287+0 records out
14981005312 bytes (15 GB) copied, 307.784 s, 48.7 MB/s
15118+0 records in
15118+0 records out
15852371968 bytes (16 GB) copied, 327.795 s, 48.4 MB/s
15699+0 records in
15699+0 records out
16461594624 bytes (16 GB) copied, 347.775 s, 47.3 MB/s
16325+0 records in
16325+0 records out
17118003200 bytes (17 GB) copied, 367.835 s, 46.5 MB/s
16938+...

Read more...

tags: added: kernel-fixed-upstream
removed: needs-upstream-testing
Revision history for this message
Freaky (freaky) wrote :

PS do note I did check a while ago at home on my desktop and on my laptop, my laptop reports 32767 for my Vertex 3 SSD as max_hw_sectors_kb, don't recall the exact number from my home computer but that had a nice multiple of MiB's instead of N*MiB - 1 kiB. This doesn't seem to be consistent thus either (and not only for MD devices either thus).

Can't imagine a disk reporting a value that's not a nice number (binary wise). This might be kernel related too...

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.