Slow file extend when posix_fallocate used on SSD file storage.

Bug #1286114 reported by Jan Lindström on 2014-02-28
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Server moved to
Fix Released
Laurynas Biveinis
Fix Released
Laurynas Biveinis
Fix Released
Laurynas Biveinis

Bug Description

Analysis: posix_fallocate was called using 0 as offset and len as desired size. This is not optimal for SSDs.

Fix: Call posix_fallocate with correct offset i.e. current file size and extend the file from there len bytes.

Suggested fix (5.5) also 5.6 affected:

--- fil0fil.c.ORIG 2014-02-27 18:03:33.135333993 +0200
+++ fil0fil.c 2014-02-27 18:04:21.795335296 +0200
@@ -4953,20 +4953,30 @@

  if (srv_use_posix_fallocate) {
- offset_high = (size_after_extend - file_start_page_no)
- * page_size / (4ULL * 1024 * 1024 * 1024);
- offset_low = (size_after_extend - file_start_page_no)
- * page_size % (4ULL * 1024 * 1024 * 1024);
+ ib_int64_t start_offset = start_page_no * page_size;
+ ib_int64_t end_offset = (size_after_extend - start_page_no) * page_size;
+ ib_int64_t desired_size = size_after_extend*page_size;

- success = os_file_set_size(node->name, node->handle,
- offset_low, offset_high);
+ if (posix_fallocate(node->handle, start_offset, end_offset) == -1) {
+ fprintf(stderr, "InnoDB: Error: preallocating file "
+ "space for file \'%s\' failed. Current size "
+ " %lld, len %lld, desired size %lld\n",
+ node->name, start_offset, end_offset, desired_size);
+ success = FALSE;
+ } else {
+ success = TRUE;
+ }
   if (success) {
    node->size += (size_after_extend - start_page_no);
    space->size += (size_after_extend - start_page_no);
    os_has_said_disk_full = FALSE;
   fil_node_complete_io(node, fil_system, OS_FILE_READ);
   goto complete_io;

Related branches

tags: added: xtradb
tags: added: contribution


Yes, it does look like offset is always taken to be 0 unconditionally
(in os_file_set_size). However, regarding the I/O, fallocate (which
posix_fallocate does unless fallocate is unavailable) is a no-op for the
already written/allocated parts of the file (ie, it won't zero out any
written data), thus the offset shouldn't harm here (unless the filesytem
did something awry here). Also, since fallocate doesn't involve
any I/O (due to lazy extent allocation), it shouldn't add to
added I/O pressure.

However, there is other side to posix_fallocate, where it falls
back to pwrite on filesystems/kernels where it is not supported
(say, tmpfs, for which it was added in 2011 or so), here, it may
end up doing I/O; but here again, it shouldn't do anything to
already written data (as per specs of posix_fallocate), so the
I/O should be same as extending from that offset.

In SSD case, which filesystem/kernel combinations were in use?

Jan Lindström (jplindst) wrote :

This was observed with Fusion-io ioDrive2, Driver version 3.3.4, build 5833069, file system nvmfs, Linux 3.4.12. Based on performance tests fallocate to already written/allocated parts is not only no-op (could be file system missing feature).

R: Jan


Ah, yes, that explains it, the filesystem support for fallocate may be lacking/incomplete here. (For other common in-tree filesystems - ext4, XFS, tmpfs, btrfs there shouldn't be any issues).

Jan, your patch replaced file_start_page_no with start_page_no in offset calculations. Was that intentional?

Jan Lindström (jplindst) wrote :

Yes it is, but you may change that, idea is to call posix_fallocate(fd, current_size_of_file, size_to_extent); I had little bit different version of when I first fixed this issue.

Adjusted the title as Percona Server does not have such public option.

summary: - Slow file extend when innodb_use_fallocate=1 and SSD file storage.
+ Slow file extend when posix_fallocate used on SSD file storage.

Percona now uses JIRA for bug reports so this bug report is migrated to:

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers