Fallocate support in innodb

Bug #892831 reported by Raghavendra D Prabhu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Status tracked in 5.7
5.1
Won't Fix
Wishlist
Unassigned
5.5
Triaged
Wishlist
Unassigned
5.6
Triaged
Wishlist
Unassigned
5.7
Fix Released
Wishlist
Unassigned

Bug Description

Currently innodb physically writes zeroes to file for --

innodb table space creation (ibdata), log file creation(ib_logfile*), innodb single tablespace creation (ibd), extension of table space files (both ibdata and ibd)

--- all of which make the process really slow. So I decided to add fallocate support to all of the above. Even though benefit should come from fast creation of initial files*, most benefit will be visible in extension, since it can actively affect the queries and also adds overhead with mutexes etc. Fallocate is by far a O(1) operation. I have tested it on XFS/ext4 filesystem on my box for small sizes and results are really good. But needs to be benchmarked on better systems.

The code is here (commits from 3547 to 3550) -- https://code.launchpad.net/~raghavendra-prabhu/+junk/mysql-server-fallocate and is based on latest mysql server tip from here -- bazaar.launchpad.net/%2Bbranch/mysql-server/ . It needs to be built with -DWITH_FALLOCATE=ON to cmake, system should also support it (added a feature test for that).

* Earlier, I have seen a case of innodb ibdata file being set to 2-3 TB and that physical writing of zeroes taking hours even on RAID, so on a downtime or fresh boxes adding time significantly.

PS: The only caveat so far is that on old ext4 (<= 2009) systems, Direct I/O with fallocate falls back to buffered IO. XFS doesn't have any such issues.

Stewart Smith (stewart)
Changed in percona-server:
importance: Undecided → Wishlist
status: New → Triaged
Revision history for this message
Stewart Smith (stewart) wrote :

We may not go with fallocate() all the time - as (at least for XFS, which is all anybody cares about) you then get unwritten extents, which means that as you fill up the datafile you're getting filesystem metadata log traffic for converting the unwritten extents to written extents, potentially having a performance impact that one wouldn't expect.

(we'd use posix_fallocate() instead of fallocate() as the posix version is portable)

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote : Re: [Bug 892831] Re: Fallocate support in innodb
Download full text (3.8 KiB)

Regarding unwritten extents, I had a doubt regarding that*. However, after
discussing with XFS developers, I understood that since unwritten extents became
default years ago, the performance impact in converting unwritten extents to
written one are negligible now, far outweighed by benefits of fallocate and, of
course, better than writing zeroes.

Regarding fallocate, I went with fallocate instead of the posix variant because
posix_fallocate fallsback to old legacy behavior on unsupported systems silently
which may not be desirable.

* The doubt was I wanted to test XFS specific ioctls like XFS_IOC_RESVSP64,
XFS_IOC_ZERO_RANGE, but their usage was not encouraged since fallocate provided
a better interface and more stable api. (and fallocate internally calls these)

I also asked about this (in xfsctl man page)
"
If the XFS filesystem is configured to flag unwritten file extents, performance
will be negatively affected when writing to preallocated space, since extra
filesystem transactions are required to convert extent flags on the range of
the file written."

-- seems this statement no longer applies and will be removed.

* On Mon, Nov 21, 2011 at 09:21:31AM -0000, Stewart Smith <email address hidden> wrote:
>We may not go with fallocate() all the time - as (at least for XFS,
>which is all anybody cares about) you then get unwritten extents, which
>means that as you fill up the datafile you're getting filesystem
>metadata log traffic for converting the unwritten extents to written
>extents, potentially having a performance impact that one wouldn't
>expect.
>
>(we'd use posix_fallocate() instead of fallocate() as the posix version
>is portable)
>
>--
>You received this bug notification because you are subscribed to the bug
>report.
>https://bugs.launchpad.net/bugs/892831
>
>Title:
> Fallocate support in innodb
>
>Status in Percona Server with XtraDB:
> Triaged
>
>Bug description:
> Currently innodb physically writes zeroes to file for --
>
> innodb table space creation (ibdata), log file creation(ib_logfile*),
> innodb single tablespace creation (ibd), extension of table space
> files (both ibdata and ibd)
>
> --- all of which make the process really slow. So I decided to add
> fallocate support to all of the above. Even though benefit should come
> from fast creation of initial files*, most benefit will be visible in
> extension, since it can actively affect the queries and also adds
> overhead with mutexes etc. Fallocate is by far a O(1) operation. I
> have tested it on XFS/ext4 filesystem on my box for small sizes and
> results are really good. But needs to be benchmarked on better
> systems.
>
> The code is here (commits from 3547 to 3550) --
> https://code.launchpad.net/~raghavendra-prabhu/+junk/mysql-server-
> fallocate and is based on latest mysql server tip from here --
> bazaar.launchpad.net/%2Bbranch/mysql-server/ . It needs to be built
> with -DWITH_FALLOCATE=ON to cmake, system should also support it
> (added a feature test for that).
>
> * Earlier, I have seen a case of innodb ibdata file being set to 2-3
> TB and that physical writing of zeroes taking hours even on RAID, so
> on a downtime or fr...

Read more...

Revision history for this message
Stewart Smith (stewart) wrote :

On Mon, 21 Nov 2011 13:43:38 -0000, Raghavendra D Prabhu <email address hidden> wrote:
> Regarding unwritten extents, I had a doubt regarding that*. However, after
> discussing with XFS developers, I understood that since unwritten extents became
> default years ago, the performance impact in converting unwritten extents to
> written one are negligible now, far outweighed by benefits of fallocate and, of
> course, better than writing zeroes.

There's still a performance impact of converting them - it's file system
metadata IO.

>
> Regarding fallocate, I went with fallocate instead of the posix
> variant because
> posix_fallocate fallsback to old legacy behavior on unsupported
> systems silently
> which may not be desirable.

We use posix_fallocate() in NDB because of the portability (IIRC to
Solaris) and just live with the fact that this may not always be
optimal.

By preallocating and then writing zeros you get the best of both worlds:
you tell the allocator that you want huge chunks of disk and you don't
have the performance impact of unwritten extents.

This is mostly only a benefit when doing parallel operations or direct
IO on non-empty file systems. IIRC InnoDB does not do parallel init of files.

--
Stewart Smith

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote : Re: Re: [Bug 892831] Re: Fallocate support in innodb
Download full text (4.9 KiB)

Hi Stewart,

* On Tue, Nov 22, 2011 at 12:49:14AM -0000, Stewart Smith <email address hidden> wrote:
>On Mon, 21 Nov 2011 13:43:38 -0000, Raghavendra D Prabhu <email address hidden> wrote:
>> Regarding unwritten extents, I had a doubt regarding that*. However, after
>> discussing with XFS developers, I understood that since unwritten extents became
>> default years ago, the performance impact in converting unwritten extents to
>> written one are negligible now, far outweighed by benefits of fallocate and, of
>> course, better than writing zeroes.
>
>There's still a performance impact of converting them - it's file system
>metadata IO.
>
>>
>> Regarding fallocate, I went with fallocate instead of the posix
>> variant because
>> posix_fallocate fallsback to old legacy behavior on unsupported
>> systems silently
>> which may not be desirable.
>
>We use posix_fallocate() in NDB because of the portability (IIRC to
>Solaris) and just live with the fact that this may not always be
>optimal.
>
>By preallocating and then writing zeros you get the best of both worlds:
>you tell the allocator that you want huge chunks of disk and you don't
>have the performance impact of unwritten extents.

Thansks, got it. I will add posix_fallocate to that as well.

>
>This is mostly only a benefit when doing parallel operations or direct
>IO on non-empty file systems. IIRC InnoDB does not do parallel init of files.
Currently innodb init of a single file takes a really long time due to repeated
writing/syncing, a lot of time will be saved there.

Also, init is only one of the parts, the rest deal with the autoextension of shared
tablespace (ibdata) and single tablespace (ibd) files, which is where many
performance issues lurk. Currently the code for autoextension as I saw it is
behind several layers of mutexes and what not, based on the assumption that it
is something time consuming/complex when it shouldn't be.

I have seen on bugs.mysql people asking for a separate thread for it and also, fsyncing only during that
(facebook/innodb_io patches) and fdatasync otherwise.

I also noticed that ibd files have no autoincrement option, they are extended in
small increments upto a extent and after that based on actual request in short
increments (multiple of extent size -- FSP_FREE_ADD), again this can cause
severe fragmentation of file as a whole on heavily loaded systems. This is where
fallocate can help the most IMO. Currently innodb doesn't allow one to define a
variable to define this size (autoincrement variable defn allowed only for
shared ibdata files). So I have added a variable called
innodb_auto_extend_increment_single which when non-zero defines auto increment
for these files. Since fallocate is a O(1) for practical purposes, the size
shouldnt matter to a certain extent.
>
>
>--
>Stewart Smith
>
>--
>You received this bug notification because you are subscribed to the bug
>report.
>https://bugs.launchpad.net/bugs/892831
>
>Title:
> Fallocate support in innodb
>
>Status in Percona Server with XtraDB:
> Triaged
>
>Bug description:
> Currently innodb physically writes zeroes to file for --
>
> innodb table space creation (ibdata), log file c...

Read more...

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

See https://blueprints.launchpad.net/percona-server/+spec/atomic-writes-beta-5.5, which happens to introduce posix_fallocate() for that one use scenario.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

5.7 appears to use posix_allocate for extending tablespaces.

tags: added: innodb upstream
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-2367

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.