Volume filesystem not optimized for HDFS

Bug #1395699 reported by Adrien Vergé
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Sahara
Fix Released
Medium
Adrien Vergé

Bug Description

Classic ext4 options and features are not suited for HDFS, mainly because this is another filesystem on top of it. Some are not really needed, produce bad performance, and do not work well for large files.

Here is a simple benchmark for writing a 10 GB file, first using classic options, then enabling/disabling other ones:

# mkfs.ext4 /dev/vdb
# mount /dev/vdb /volumes/disk1
# dd if=/dev/zero of=/volumes/disk1/test conv=fsync bs=1M count=10000
-> 127 MB/s

# mkfs.ext4 -m 1 -O dir_index,extents,^has_journal /dev/vdb
# mount -o data=writeback,noatime,nodiratime /dev/vdb /volumes/disk1
# dd if=/dev/zero of=/volumes/disk1/test conv=fsync bs=1M count=10000
-> 898 MB/s

Changed in sahara:
assignee: nobody → Adrien Vergé (adrien-verge)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to sahara (master)

Fix proposed to branch: master
Review: https://review.openstack.org/136746

Changed in sahara:
status: New → In Progress
Changed in sahara:
milestone: none → kilo-1
importance: Undecided → Medium
Revision history for this message
Hank Jakiela (hjakiela) wrote :

To understand the performance in the description, we need to know what hardware underlies /dev/vdb? Is it a single spinning disk, or some sort of Raid set with multiple disks, or an SSD? Assuming it is a single spinning disk, then:

The first case writes 10GB to the disk and conv=fsync makes dd wait until all the data has been written to the disk before returning. For a single spinning disk, that would run at about 100-150 MB/s. So, 127 MB/s is in the right ballpack. This makes me think it is in fact a single spinning disk.

The second case, 898 MB/s would be physically impossible for a single spinning disk. So, either it's a multi-disk Raid set, or an SSD. Or data=writeback is allowing ext4 to ignore the conv=fsync. That would be a bug.

If you watch disk I/O traffic with iostat or collectl or similar, you can clearly see whether disk I/O continues after dd finished.

Revision history for this message
Adrien Vergé (adrienverge) wrote :

Thanks for your feedback Hank.

It is indeed a single spinning disk. I have upgraded VM kernel to a newer version (3.10) and I get something a bit different. Here are the results for /dev/vdb (500 GB, mounted on /volumes/disk1) and /dev/vdc (formatted and mounted with performance options on /volumes/disk2).

# dd if=/dev/zero of=/volumes/disk1/test bs=1M count=10000
-> approx. 390 MB/s

# dd if=/dev/zero of=/volumes/disk1/test bs=1M count=10000 conv=fsync
-> approx. 150 MB/s

# dd if=/dev/zero of=/volumes/disk2/test bs=1M count=10000
-> approx. 390 MB/s

# dd if=/dev/zero of=/volumes/disk2/test bs=1M count=10000 conv=fsync
-> approx. 390 MB/s

In all cases iostat reports an approx. 400 wkB/s during copy. In test 1, 3 and 4 disk I/O stops when dd returns; whereas during test 2, disk I/O stops after ~30 s and dd returns only one minute later.

Do you have any clue?

Revision history for this message
Hank Jakiela (hjakiela) wrote :

It appears that in your last test case, fsync is being ignored. The options conv=fsync and data=writeback contradict each other. One says to wait for all the data to be flushed to disk, and the other says don't wait. So, what to do. It seems fsync is ignored.

However, I don't understand your explanation of the iostat results. In case 2, you should see all I/O completed before dd returns. In cases 1, 3, 4, some of the I/O to disk actually takes place after dd returns, by as much as 30 seconds.

Revision history for this message
Adrien Vergé (adrienverge) wrote :

Hank, great explanation for conv=fsync + data=writeback. So this is not a bug, but rather incompatible options that make results incoherent for test 4.

About the iostat results, in test 2 all I/O is completed 30 seconds before dd returns. I would expect both to happen at the same time. In tests 1, 3 and 4 (with 3 and 4 equivalent because of ignored conv=fsync), I would expect what you said (dd returns and I/O continuing), but I/O stops at the moment dd returns.

I agree this is a strange behavior. Could this be because iostat and dd are run in a virtual machine with virtual disks?

Changed in sahara:
milestone: kilo-1 → kilo-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to sahara (master)

Reviewed: https://review.openstack.org/136746
Committed: https://git.openstack.org/cgit/openstack/sahara/commit/?id=3e31824e4b20c35aeb61d1e9fdbb82b1ff21869f
Submitter: Jenkins
Branch: master

commit 3e31824e4b20c35aeb61d1e9fdbb82b1ff21869f
Author: Adrien Vergé <email address hidden>
Date: Fri Nov 21 10:55:19 2014 +0100

    Mount volumes with options for HDFS performance

    Enhance disk performance for volumes used as HDFS back-ends, by
    disabling unneeded filesystem options and enabling features optimized
    for large files.

    Change-Id: I3f919f19b83a6bb048a09a9ead6da35821e4174b
    Closes-Bug: #1395699

Changed in sahara:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in sahara:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in sahara:
milestone: kilo-2 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.