ext3/4 fsyncdata does not flush disk cache

Bug #504632 reported by Miron Cuperman
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Surbhi Palande

Bug Description

The following program illustrates the problem. When run, on ext3 or ext4 with data=ordered or data=writeback, it completes under a second. Since each fsyncdata should take at least one disk revolution (~10ms), the total time taken should be ~10 seconds.

The program works as expected (i.e. takes ~10 seconds or more) for the following cases: XFS, reiserfs, and ext4 with data=journal. It works correctly when replacing the pwrite with a write. It also works as expected when the drive cache is disabled.

This bug will likely cause database systems to fail to commit transactions in a durable manner.

#include <stdio.h>
#include <fcntl.h>

int main() {
  int fd = open ("test_file", O_RDWR | O_CREAT | O_TRUNC, 0600);
  char byte;
  int i = 1000;
  while (i-- > 0) {
    byte = i&0xFF;
    pwrite (fd, &byte, 1, 0);
    fdatasync (fd);
  }
}

ProblemType: Bug
Architecture: i386
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: guest1 2176 F.... pulseaudio
CRDA:
 country CO:
  (2402 - 2472 @ 40), (3, 27)
  (5170 - 5250 @ 20), (3, 17)
  (5250 - 5330 @ 20), (3, 23), DFS
  (5735 - 5835 @ 20), (3, 30)
Card0.Amixer.info:
 Card hw:0 'SB'/'HDA ATI SB at 0xf7ff4000 irq 16'
   Mixer name : 'Realtek ALC1200'
   Components : 'HDA:10ec0888,104382fe,00100101'
   Controls : 40
   Simple ctrls : 22
Date: Thu Jan 7 22:41:51 2010
DistroRelease: Ubuntu 9.10
HibernationDevice: RESUME=/dev/sda6
MachineType: System manufacturer System Product Name
NonfreeKernelModules: nvidia
Package: linux-image-2.6.31-16-generic 2.6.31-16.53
ProcCmdLine: root=UUID=305df5de-09e0-4f0f-98fe-eda2d4f95e0a ro
ProcEnviron:
 PATH=(custom, user)
 LANG=en_US.UTF-8
 SHELL=/bin/zsh
ProcVersionSignature: Ubuntu 2.6.31-16.53-generic
RelatedPackageVersions:
 linux-backports-modules-2.6.31-16-generic N/A
 linux-firmware 1.25
RfKill:
 0: phy0: Wireless LAN
  Soft blocked: no
  Hard blocked: no
SourcePackage: linux
Uname: Linux 2.6.31-16-generic i686
WpaSupplicantLog:

dmi.bios.date: 08/28/2008
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 0701
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: M3A78-EM
dmi.board.vendor: ASUSTeK Computer INC.
dmi.board.version: Rev X.0x
dmi.chassis.asset.tag: Asset-1234567890
dmi.chassis.type: 3
dmi.chassis.vendor: Chassis Manufacture
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr0701:bd08/28/2008:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKComputerINC.:rnM3A78-EM:rvrRevX.0x:cvnChassisManufacture:ct3:cvrChassisVersion:
dmi.product.name: System Product Name
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Revision history for this message
Miron Cuperman (devrandom) wrote :
Revision history for this message
Miron Cuperman (devrandom) wrote :

Note that I have also confirmed this bug in Hardy.

Revision history for this message
Miron Cuperman (devrandom) wrote :
Revision history for this message
Miron Cuperman (devrandom) wrote :

Latest mainline kernel seems to work right, with barrier=1 on both ext3 and ext4.

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Andy Whitcroft (apw)
tags: added: karmic
Revision history for this message
Surbhi Palande (csurbhi) wrote :

@ Miron Cuperman, can you please confirm if the latest ubuntu kernel fixes this for you ?

Revision history for this message
Miron Cuperman (devrandom) wrote :

The latest *mainline* kernel seems to fix it. This is not the ubuntu distribution kernel.

The version I'm referring to is 2.6.33-999.201001071308.

I have not checked other branches.

Revision history for this message
Surbhi Palande (csurbhi) wrote :

@Miron Cuperman, in ext3/4 for data=writeback and data=ordered, only meta data is written to the journal. Whereas for the data=journal mode, both data and meta data are written to the journal. So there is more writing to do in the journal mode.
Similarly with barriers=1, in order to keep the order of the writes correct, there is a wait performed appropriately. Hence barrier=1, is slower than barriers=0. The fsync performance seems appropriate here. Can you point out to any application specifically, which relies on the fsync() completion time ? (and which is failing currently) ?

Changed in linux (Ubuntu):
importance: Medium → Low
Revision history for this message
Miron Cuperman (devrandom) wrote :

This is not an issue of performance, it is an issue of data durability in the face of power loss.

For fsync and fsyncdata to work correctly, they must flush file data to the disk platters before returning so that the calling application can be confident that the transaction is durable. All applications that use this system call depend on the system to correctly flush file data. This includes all database systems, mail servers and other software.

Disks rotate at about 100 revolutions per second. This means that the number of fsync or fsyncdata calls in a single-threaded application cannot be more than 100 per second if data is flushed correctly to the disk.

The fact that the test program runs at more than 1000 calls per second means that data is not flushed correctly to disk.

Does this clarify the problem?

Revision history for this message
Surbhi Palande (csurbhi) wrote :

@Miron Cuperman, I believe that the files do get written to the disk. The fsync time may not be a measure of that. You can verify that, by writing to a file, syncing it and then unmounting-mounting the partition OR rebooting the machine and finally you could reread the contents of the file again. If the files are not getting written to, then this is a bigger "Data loss" bug.

Changed in linux (Ubuntu):
status: Triaged → Invalid
status: Invalid → Triaged
Revision history for this message
Surbhi Palande (csurbhi) wrote :

@Miron Cuperman, can you please verify if the file data is not getting written persistently by actually rereading the data as described above ?

Revision history for this message
Surbhi Palande (csurbhi) wrote :

@Miron Cuperman, I see what you are saying. However, if an application wants to ensure that at flush time the data finds its way on the hard disk rather than the disk cache then the disk caches should be turned off (as you mentioned in your description, this gives you the correct performance). Also the only other method is to have a power backed up cache.
Please refer to: http://xfs.org/index.php/XFS_FAQ for more details.

As, I mentioned before, you are seeing the speed difference in data=journalled mode probably because of more data that has to be written. This mode does not ensure that your data finds its way onto the harddisk instead of the disk cache. I believe that a filesystem would not ensure that data finds its way straight to the disk rather than the cache. This same option can be obtained by disabling the disk cache (and hence not needed to be implemented in the FS) So, if the data base application _MUST_ have the data on disk, it should simply turn off the disk cache (and get a performance hit) or have power backed up cache.

I am marking this bug as invalid, as this behavior of data going into the cache when disk cache is enabled, is expected behavior for ext[3-4]

Changed in linux (Ubuntu):
status: Triaged → Invalid
Revision history for this message
Miron Cuperman (devrandom) wrote :

All modern disk drives support cache flushing and/or write barriers. For example, if you look at he barrier option to ext3/4 mount you will see that the filesystems guarantee proper cache flushing for journal commits.

fsync and fdatasync are definitely designed to work with disk caches and should commit to the platters. The posix man page ( http://www.opengroup.org/onlinepubs/009695399/functions/fsync.html ) says:

"The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk."

Also, the current implementation of fsync works *correctly* when the length of the file changes. The problem I identified happens only when the length of the file is constant.

Mounting with data=journal (and cache still on) *works around the problem* by triggering different kernel code. Only a few KB are written to the journal. The 10x speed difference is not due to the few extra KB/s, it is because the flush command is correctly sent to the disk in this scenario. The number of transactions per second is equal to the rotational frequency of the disk (~100 per second).

To summarize, there is a bug in the kernel when fsync is called for a file who's length does not change (such as a DB file). The kernel fails to send a flush/barrier command to the drive in this scenario (and just in this scenario).

Revision history for this message
Miron Cuperman (devrandom) wrote :

Changed the status to New, since this needs to be looked at again.

Please let me know if you need further clarification about the symptoms / cause.

Changed in linux (Ubuntu):
status: Invalid → New
Revision history for this message
Miron Cuperman (devrandom) wrote :

The latest Karmic kernel (2.6.31-20) solves the issue for ext4. The problem persists in ext3.

The fix for ext4 might be in one of these:

  * ext4: fix cache flush in ext4_sync_file
    - LP: #496816
  * ext4: Make non-journal fsync work properly
    - LP: #496816
  * ext4: Wait for proper transaction commit on fsync
    - LP: #496816

This still needs fixing for ext3.

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi Miron,

If you could also please test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Confirmed
Surbhi Palande (csurbhi)
Changed in linux (Ubuntu):
assignee: nobody → Surbhi Palande (csurbhi)
importance: Low → Medium
status: Confirmed → In Progress
Revision history for this message
Surbhi Palande (csurbhi) wrote :

@Miron Cuperman, Thanks for your insightful comments :) I have written a patch for ext3 and submitted it upstream. When it gets accepted, we will pull it in Ubuntu.

Revision history for this message
Surbhi Palande (csurbhi) wrote :

@Miron Cuperman, infact I missed a patch which is already in Lucid and wrote a similar one. Can you kindly check if ext3 shows you the results expected? This is the commit that is expected to give you the results:
commit 56fcad29d4b3cbcbb2ed47a9d3ceca3f57175417
Author: Jan Kara <email address hidden>
Date: Tue Sep 8 14:59:42 2009 +0200
    ext3: Flush disk caches on fsync when needed

I will need to investigate, if things are not working for you. Thanks!

Revision history for this message
Miron Cuperman (devrandom) wrote :

I tried the Lucid nightly iso, and I'm happy to report that both ext3 and ext4 are working fine.

Thanks.

Surbhi Palande (csurbhi)
Changed in linux (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.