Ubuntu

Ext4 data loss

Reported by Bogdan Gribincea on 2009-01-16
430
This bug affects 22 people
Affects Status Importance Assigned to Milestone
ecryptfs-utils (Ubuntu)
High
Unassigned
Jaunty
High
Unassigned
linux (Ubuntu)
High
Tim Gardner
Jaunty
High
Tim Gardner

Bug Description

I recently installed Kubuntu Jaunty on a new drive, using Ext4 for all my data.

The first time i had this problem was a few days ago when after a power loss ktimetracker's config file was replaced by a 0 byte version . No idea if anything else was affected.. I just noticed ktimetracker right away.

Today, I was experimenting with some BIOS settings that made the system crash right after loading the desktop. After a clean reboot pretty much any file written to by any application (during the previous boot) was 0 bytes.
For example Plasma and some of the KDE core config files were reset. Also some of my MySQL databases were killed...

My EXT4 partitions all use the default settings with no performance tweaks. Barriers on, extents on, ordered data mode..

I used Ext3 for 2 years and I never had any problems after power losses or system crashes.

Jaunty has all the recent updates except for the kernel that i don't upgrade because of bug #315006

ProblemType: Bug
Architecture: amd64
DistroRelease: Ubuntu 9.04
NonfreeKernelModules: nvidia
Package: linux-image-2.6.28-4-generic 2.6.28-4.6
ProcCmdLine: root=UUID=81942248-db70-46ef-97df-836006aad399 ro rootfstype=ext4 vga=791 all_generic_ide elevator=anticipatory
ProcEnviron:
 LANGUAGE=
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.28-4.6-generic
SourcePackage: linux

Related branches

Changed in linux:
importance: Undecided → High
status: New → Triaged
Ben Hodgetts (enverex) wrote :

I thought it was worth adding this, even though I'm running Gentoo, it seems to be exactly the same issue:

I recently upgraded to ext4 as well, I ran a game in Wine and the system hardlocked (nothing special there with the fglrx drivers). After rebooting all my Wine registry files were 0 bytes, as were many of my Gnome configuration files. Absoloute nightmare. fsck on boot said that it had removed 760+ orphaned inodes.

Mounted as:
/dev/root on / type ext4 (rw,noatime,barrier=1,data=ordered)

Ben Hodgetts (enverex) wrote :

Additional: ext4 was implemented as a clean format, not an upgrade of any sort (backed up, formatted and copied back over).

Matt Drake (mattduckman) wrote :

This has happened to me twice, the first time erasing Firefox settings, and the second time erasing gnome-terminal settings. Both cases were caused by a kernel panic locking up the system. Also, both times the program whose settings were affected was in use during the kernel panic.

An important note is that these data losses have been taken place on an ext3 partition that is mounted as ext4 in fstab, so it is not a true ext4 partition.

This is taking place on fully up-to-date Jaunty.

Pavel Rojtberg (rojtberg) wrote :

I also had data loss with ext4. The "feature" responsible for this is delayed allocation.
With delayed allocation on all hd-writes are held back in memory, so if you just cut the power the data is lost.

Basically the old version should be still available, but perhaps ext4 decides that a zeroed file is more "consistent".

Anders Aagaard (aagaande) wrote :

Delayed allocation is to skip the allocating step when writing a file, not to keep data in memory. I'd say this is more likely to be related to barriers, but that's only because of my hate towards how ext handles barriers in a non safe way.

Theodore Ts'o (tytso) wrote :

Ben --- can you tell me what version of the kernel you are using? Since you are a Gentoo user, it's not obvious to me what version of the kernel you are using, and whether you have any ext4-related patches installed or not.

Bogden --- *any* files written during the previous boot cycle?

I've done some testing, using Ubuntu Interpid, and a stock (unmodified) 2.6.28 kernel on a Lenovo S10 netbook (my crash and burn machine; great for doing testing :-). On it, I created a fresh ext4 filesystem on an LVM partition, and I used as a test source a directory /home/worf, a test account that has been used briefly right after I installed it, so it has gnome dot files, plus a relatively small number of files in the Firefox cache. Its total size is 21 megabytes.

I then created a ext4 filesystem, and then tested it as follows:

% sudo bash
# cp -r /home/worf /mnt ; sleep 120; echo b > /proc/sysrq-trigger

After the system was forcely rebooted (the echo b >/proc/sysrq-trigger emulates a crash), I checked the contents of /mnt/worf using cp -r and cfv, and below changed the sleep time. What I found was that at sleep times above 65 seconds, all of /mnt/worf was safely written to disk. Below 30 seconds, none of /mnt/worf was written to disk. If the sleep 120 was replaced with a sync, everything was written to disk.

How aggressively the system writes things back out to disk can be controlled via some tuning parameters, in particular /proc/sys/vm/dirty_expire_centisecs and /proc/sys/vm/dirty_writeback_centisecs. The latter, in particular will be adjusted by laptop_mode and other tools that are trying to extend battery lifespans.

So the bottom line is that I'm not able to replicate any data loss except for very recently written data before a crash, and this can be controlled by explicitly using the "sync" command or adjusting how aggressively the system writes back dirty pages via /proc/sys/vm/dirty_expire_centisecs and /proc/sys/vm/dirty_writeback_centisecs.

It would be useful if you could send me the output of "sysctl -a", and if you can tell me whether the amount of data that you are losing is decreased if you explicitly issue the "sync" command before the crash (which you can simulate via "echo b > /proc/sysctl-trigger".)

Ben Hodgetts (enverex) wrote :

Kernel is Gentoo's own:

Linux defiant 2.6.28-gentoo #4 SMP Sat Jan 3 21:56:33 GMT 2009 x86_64 Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz GenuineIntel GNU/Linux

The files that were zeroed when my machine hardlocked I'd imagine were the ones that were in use; my desktop env is Gnome and I was running a game in Wine. Wine's reg files which it would have had open were wiped and also my Gnome terminal settings were wiped. Not sure how often it would have been writing to them but it would have only been tiny amounts of data if it was.

Pavel Rojtberg (rojtberg) wrote :

in my case the zeroed files were also just updated, not created.

My "test scenario" was starting a OpenGL applications with unsaved source files. Then all graphics froze because of a bug in fglrx.

If I immediately powered off the machine my source file was empty. If I waited a bit everything was saved fine. Likely because I waited long enough for the changes to be written to disk...

Just to clarify, my Ext4 partitions were all 'true' ext4 not converted from ext3.

It happend again. Somehow when trying to logout KDM crashed. After rebooting I had some zeroed config files in a few KDE apps, log files (pidgin)..
I coverted / and /home back to EXT3. This is extremely annoying, reminds me of Windows 9x

I will have some free time next week and I'll try testing this in a virtual machine.

Andy Whitcroft (apw) wrote :

@Bogdan Gribincea -- am i correct in thinking that you are using the ext4 support in the Intrepid kernel?

@Andy: I am the bug report starter and I attached all the logs generated by the ubuntu-bug command.
And, no, it's Jaunty with it's 2.6.28 kernel and the 'stable' ext4 support. Also the partitions were created as ext4 on a new drive, not converted from ext3.

Kai Mast (kai-mast) wrote :

I can confirm this with AMD64 and Ubuntu Jaunty

Niclas Lockner (niclasl) wrote :

I have experienced some issues with EXT4 and data losses too, but more extreme than you all describe. I installed jaunty alpha 3 two days ago and have all the updates installed. Since the install 2 days ago I have lost data on 3 occations. The most strange losses is:
* the computer wiped out a whole network share mounted in fstab
* the computer one time also removed ~/.icons when I empty the trash

The data losses never happened after a crasch or power failure.

Veovis (masterkedri) wrote :

I was just browsing the forums on Ubuntu ( http://ubuntuforums.org/showthread.php?t=1055176 ), where I read of a bug involving symbolic links. The bug is that if you delete a folder that is sylbolic link it will delete the contents of the folder as well as the symbolic link as if it were not a symbolic link.

Does this sound like it could have been the situation?

Ben Hodgetts (enverex) wrote :

No Veovis, please read the bug, that is nothing to do with the actual report here.

Just a question: Would data=journal in /etc/fstab be a workaround until this bug is fixed?

(Unfortunatlly I can not set this option in fstab for the root-partition, because initramfs does not support that feature!), but you may try this with your home and what-else partitions, if you do have them.

Data=journal deactivates delaloc. It should put data and metadata into the journal, thereof I hope that recently opened files should not have 0 bytes size. I do not put further comments here from my side. It´s just an idea.

Andy Whitcroft (apw) wrote :

Talking to Ted on this one we believe that the trigger for the data loss has been found and included in a new upstream stable update. The patches for ext4 have been picked up and applied to the jaunty kernel. Kernel 2.6.28-7.19 should contain those. If any of you are able to test and report back that would be most helpful.

Changed in linux:
assignee: nobody → apw
status: Triaged → Incomplete
Steve Langasek (vorlon) wrote :

Should we mark this bug as 'fix released' unless someone shows otherwise?

Hi,

just updated my system. While this was in process, I tried to switch compiz to metacity (checking for another bug). The X-server froze and I switched to tty2 to stop gdm. This took a long time and afterwards even the Xorg process hang. I entered reboot, but the system could not reboot in the last third of the stop procedure. I used SysRequest+s, -u, -b.

After reboot I saw: /home was unmounted unclean, check forced.

When I came into gnome again, my compiz settings were partially cleared and my gnome-terminal settings were lost. I can not say, if the files were zero bytes. But maybe the e2fsck had corrupted some files.

The updates that had taken place did not include compiz or gnome-terminal so at this point I can not see a connection to the updates done and the lost information.

Ok, thunderbird settings are gone, too. So this seems ext4 related?

pablomme (pablomme) wrote :

@Christian: I understand that your system hung _before_ you rebooted into the updated kernel? If so, the changes wouldn't have taken effect, and the data loss was caused by the original kernel.

Tim Gardner (timg-tpi) wrote :

This issue should be fixed with 2.6.28-7.18. I cherry picked a number of patches that Ted Tso is submitting for stable kernel updates which he says fixes this data loss problem. Please confirm.

Changed in linux:
assignee: apw → timg-tpi
status: Incomplete → Fix Released

pablomme schrieb:
> @Christian: I understand that your system hung _before_ you rebooted
> into the updated kernel? If so, the changes wouldn't have taken effect,
> and the data loss was caused by the original kernel.

Well, there has not been any kernel update so far here. I am on:

2.6.28-7.20 for several days now. The problem I spoke about was today
with the specified kernel.

So are the patches applied to this version or a leter one, which has not
arrived here, yet?

Else the problem persists.

Peter Clifton (pcjc2) wrote :

I'm using 2.6.28-8-generic and a crash just zeroed out a _load_ of important files in my git repository which I'd recently rebased a patch series in.

Not impressed (TM).

Oh well.. anyway.. I don't think this problem is fixed.

André Barmasse (barmassus) wrote :

For testing I installed ext4 together with Jaunty Alpha 4 as standard root file system on my Sony Vaio. Since then I had four hardlocks, two of them completely destroying my gnome desktop. So far, this only happens within Gnome while upgrading the system with apt-get in a shell AND at the same time running and working with other programms (like quodlibet, firefox, Thunderbird, Bluefish etc.).

As of the gnome desktop destructions, in one case apt-get unfortunately was just installing some xorg-server files and - in the other case - configuring the gnome-desktop, when the hard lock happened. The sad part is that I didn't find a way to repair the broken system neither with apt-get nor with dpkg nor with aptitute as the size some needed configuration files was set to zero by the crash. So, for now I am switching back to ext4 releasing this warning:

DON'T DO ANYTHING WHILE UPGRADING UBUNTU WITH EXT4!

David Tomaschik (matir) wrote :

Looks like the data loss bug may still exist. Setting back to confirmed.

Changed in linux:
status: Fix Released → Confirmed
pablomme (pablomme) wrote :

I think this bug is in desperate need of a systematic test, so I've attached a script which attempts to do precisely that. You need to run the script like this:

 ./write_stuff <directory-under-ext4-volume>

The script will open 5 files (named 'file_<i>') under the specified directory and start appending one line per second to each of them, until you stop it with Ctrl-C. If the script is re-run, lines are appended to the previous contents.

If instead of stopping the script you turn off or reboot your computer by force (say with SysRq+B, or holding the power button), you would be reproducing the conditions under which the bug seems to occur.

My / partition is ext4 (but not my /home, so I haven't suffered this bug as much as others have). Running the script on '/test' without any initial files and rebooting with SysRq+B gave:

 - rebooting in 30 seconds resulted in all 5 files zeroed
 - rebooting in 45 seconds resulted in 4 files having 40 lines and one having 41
 - rebooting in 60 seconds resulted in 4 files having 55 lines and one having 56

I would think that the first data flush on the initially-empty files takes too long to occur. This would explain the problems other people are having if the configuration files they mention are deleted and rewritten from scratch, and the system crashes before the first flush. Or maybe I'm completely wrong in my interpretation, so go ahead and do your own tests.

Hope this helps!

I never had any trouble, but I installed Jaunty last week, after the fix
was released. Is it possible this bug now only affects those that
installed it prior to the fix? When did you guys install it?

I've moved over 100GB of files since I installed it. I had at least two
hard crashes, everything seems to be intact.

André Barmasse wrote:
> For testing I installed ext4 together with Jaunty Alpha 4 as standard
> root file system on my Sony Vaio. Since then I had four hardlocks, two
> of them completely destroying my gnome desktop. So far, this only
> happens within Gnome while upgrading the system with apt-get in a shell
> AND at the same time running and working with other programms (like
> quodlibet, firefox, Thunderbird, Bluefish etc.).
>
> As of the gnome desktop destructions, in one case apt-get unfortunately
> was just installing some xorg-server files and - in the other case -
> configuring the gnome-desktop, when the hard lock happened. The sad part
> is that I didn't find a way to repair the broken system neither with
> apt-get nor with dpkg nor with aptitute as the size some needed
> configuration files was set to zero by the crash. So, for now I am
> switching back to ext4 releasing this warning:
>
> DON'T DO ANYTHING WHILE UPGRADING UBUNTU WITH EXT4!
>

pablomme (pablomme) wrote :

@Jeremy: I've never had a problem either, but I haven't had any crashes at all so this bug hasn't had a chance to show up. However my script above does reproduce the problem - have you tried it?

I installed Jaunty Alpha 4 on February the 6th. I would suppose that this is equivalent to what you've done, since you only get the updates after having installed the system. (Alpha 5 is not out yet, is it?)

I've not experienced the bug yet, though I've not had a chance to try
the script yet.

I was wondering if anyone knows of a terminal command I could run that
would give me a list of all the files on my system that are 0 KB. As far
as I know, I've never experienced this, but then again it may have
happened to a file I don't use often. I don't care if I lose anything
since I have everything backed up but a command to list files that may
have been effected would be nice if anyone knows.

pablomme (pablomme) wrote :

> I was wondering if anyone knows of a terminal command I could run that
> would give me a list of all the files on my system that are 0 KB.

There's

find / -type f -size 0

but there are very many files that have zero length under normal conditions, so it'll be very hard to tell if any file has been affected this way.

Wade Menard (wade-ezri) wrote :

find / -size 0b should be enough. Please keep further discussion not related to fixing this bug on a forum or mailing list.

Thank you. Please understand that my question was related to this bug,
as such a command will help me determine if this bug is affecting me,
then I could give more info that would help the fix.

Wade Menard wrote:
> find / -size 0b should be enough. Please keep further discussion not
> related to fixing this bug on a forum or mailing list.
>

There are a couple files that are 0b, so this bug is affecting me. Is
there any information I can provide to help the developers?

Wade Menard wrote:
> find / -size 0b should be enough. Please keep further discussion not
> related to fixing this bug on a forum or mailing list.
>

Michael Rooney (mrooney) wrote :

Jeremy, as pablomme said: "there are very many files that have zero length
under normal conditions, so it'll be very hard to tell if any file has been
affected this way."

Many people are reporting trashed gnome sessions so it should be fairly
obvious whether it is or not. A 0b file is definitely not indicative of
this.

The two files I have that are 0b are jpg images.

Michael Rooney wrote:
> Jeremy, as pablomme said: "there are very many files that have zero length
> under normal conditions, so it'll be very hard to tell if any file has been
> affected this way."
>
> Many people are reporting trashed gnome sessions so it should be fairly
> obvious whether it is or not. A 0b file is definitely not indicative of
> this.
>

kubrentu (brentkubuntu) wrote :

Same data loss problem.

Installed Kubuntu Jaunty Alpha 4. ext4 as root / partition. Did all the updates.

$ uname -a
Linux andor 2.6.28-8-generic #26-Ubuntu SMP Wed Feb 25 04:28:54 UTC 2009 i686 GNU/Linux

ran the "write_stuff" script, and held down the power button after about 5-10 seconds.

brent@andor:~/test$ ls -l
total 4
-rw-r--r-- 1 brent brent 0 2009-02-26 13:38 file_1
-rw-r--r-- 1 brent brent 0 2009-02-26 13:38 file_2
-rw-r--r-- 1 brent brent 0 2009-02-26 13:38 file_3
-rw-r--r-- 1 brent brent 0 2009-02-26 13:38 file_4
-rw-r--r-- 1 brent brent 0 2009-02-26 13:38 file_5
-rw-r--r-- 1 brent brent 1411 2009-02-26 13:32 write_stuff

All 0B files.

I'm happy to try other tests that people may suggest.

Ack... had a power outage and ran into this one today too. Several configuration files from programs I was running ended up trashed. This also explains the corruption I've seen of my BOINC/SETI files when hard-rebooting in past weeks.

System: Linux mars 2.6.28-8-generic #26-Ubuntu SMP Wed Feb 25 04:27:53 UTC 2009 x86_64 GNU/Linux

I'm running RAID1 dmraid mirroring w/ an Asus Striker Formula II MB, in case it matters.

Christoph Korn (c-korn) wrote :

I can confirm this data loss in jaunty.
I installed all updates before trying the script in this comment:
https://bugs.launchpad.net/ubuntu/jaunty/+source/linux/+bug/317781/comments/29

I am testing jaunty in virtualbox.

I have put the script on the desktop.
I started it and turned the virtual machine off after
some seconds.

After reboot all files (also the script itself) are 0Byte and I cannot
even open them with gedit.

Jared Biel (sylvester-0) wrote :

Hello all - I've experienced this problem also. I forked my sudoers file (didn't use visudo to edit it) so I had to boot into recovery mode to edit it. As soon as I ran visudo the system hard locked and had to be shut down. Upon next bootup /etc/sudoers was a 0B file.

I'll guess I'll try to not crash my system for now ;)

nyarnon (cabal) wrote :

Well thats it anyway, we trying to debug something that shouldn't happen in the first place :-) Can you imagine what the answer would be if you where on windows :-)

"Always close the system through the start menu and wait for the machine to shut down even if it takes 24 hours and has a bluescreen"

Anyway I tested for 0 byte files and just had a few in /proc and in /lib, here the tail for brevity

/proc/31078/sessionid
/proc/31078/coredump_filter
/proc/31078/io
/lib/init/rw/.ramfs
/lib/modules/2.6.28-8-generic/volatile/.mounted

Nothing special.

Im on 64 bit Intel P4HD my other machine with Jaunty a 32 bit asus eeepc1000h shows similar behaviour

Conte Zero (contez) wrote :

Hello everyone,
I've had the same problem with Jaunty on 64bit, kernel 2.6.28-8-generic x86_64 and everybody else up to date.
Problem is, as other people here, about first of all the machine locking up completely (still image, mouse freeze too and no hd activity at all) and ext4 data corruption in the form of wiped out (0 byte sized) files (actually as pointed out the ones that were open while it locked or briefly closed before).

The problem manifested itself with a converted ext3 to ext4 partition and also on a newly formatted ext4 partition (both created and used with a pletora of kernel versions, all under jaunty, also counting the latest one).

There's no identified action that triggers the locking up, but it happens quite often, at least once or twice a day (machine is home server, always powered on, with only / and swap on ext4 formatted drive, data are on a RAID5 mdraid XFS formatted set, which never suffered any problem) and seems to be triggered by big files (or directories) transfers (tens of GB) usually when another process actively accesses the disk
eg.:
One of the most common case is while transferring some GB from/to ext4 disk to/from xfs raid set and I try to do a apt-get update and upgrade.
Also happened while downloading Steam games (after a corrupted reg file) in wine (1.1.16) and looking at a HD video, both on ext4 partition.

Disabling trackerd doesn't resolves the problem, yet seems to alleviate it partially by reducing the frequency of the crashes (not too much anyway, let's say 2 to 3 average to 1 to 2 average a day).

Let me know if I can be of any help, I'll be glad to test or provide what I can.

P.S.: By the way, for anyone with apt troubles after a crash during an upgrade for experience this is usually caused by .preinst, .postinst, .prerm and .postrm scripts under /var/lib/dpkg/info, which should be executable script launched before or after installing or removing a given package but get zeroed and are unrecognized. Quick, troubles prone solution is to delete the troublesome script, right solution is to recover the script from another up to date jaunty machine, perfect solution... well you probably have other files zeroed too... ready for a reinstall?

Theodore Ts'o (tytso) wrote :
Download full text (7.4 KiB)

So, I've been aware of this problem, and have been working on a solution, but since I'm not subscribed to this bug, I wasn't aware of the huge discussion going on here until Nullack prodded me and asked me to "take another look at bug 317781". The short answer is (a) yes, I'm aware of it, (b) there is a (partial) solution, (c) it's not yet in mainline, and as far as I know, not in an Ubuntu Kernel, but it is queued for integration at the next merge window, after 2.6.29 releases, and (d) this is really more of an application design problem more than any thing else. The patches in question are:

http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=3bf3342f394d72ed2ec7e77b5b39e1b50fad8284
http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=6645f8c3bc3cdaa7de4aaa3d34d40c2e8e5f09ae
http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=dbc85aa9f11d8c13c15527d43a3def8d7beffdc8

So, what is the problem. POSIX fundamentally says that what happens if the system is not shutdown cleanly is undefined. If you want to force things to be stored on disk, you must use fsync() or fdatasync(). There may be performance problems with this, which is what happened with FireFox 3.0[1] --- but that's why POSIX doesn't require that things be synched to disk as soon as the file is closed.

[1] http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/

So, why wasn't this a problem before in the past? Well, ext3 by default has a commit interval of 5 seconds, and has data=ordered. What does this mean? Well, every 5 seconds, the ext3 journal is committed; this means that any changes in since the last commit are now guaranteed to survive an unclean shutdown. The journalling mode data=ordered means that only metadata is written in the journal, but data is ordered; this means that before the commit takes place, any data blocks are associated with inodes that are about to be committed in that transaction will be forced out to disk. This is primarily done for security reasons; if this was not done, then any newly allocated blocks might still contain previous data belonging to some other file or user, and after a crash, accessing that file might result in a user seeing someone else's mail or p0rn, and that's unacceptable from a security perspective.

However, this had the side effect of essentially guaranteeing that anything that had been written was guaranteed to be on disk after 5 seconds. (This is somewhat modified if you are running on batteries and have enabled laptop mode, but we'll ignore that for the purposes of this discussion.) Since ext3 became the dominant filesystem for Linux, application writers and users have started depending on this, and so they become shocked and angry when their system locks up and they lose data --- even though POSIX never really made any such guaranteed. (We could be snide and point out that they should have been shocked and angry about crappy proprietary, binary-only drivers that no one but the manufacturer can debug, or angry at themselves for not installing a UPS, but that's not helpful; expectations are expectations, and it's hard to get people to ...

Read more...

Theodore Ts'o (tytso) wrote :

Oops, one typo in the above missive:

In the paragraph which begins "So the difference between 5 seconds and 60 seconds (the normal time if you're writing huge data sets)", the parenthetical comment should read, "(the normal time if you're NOT writing huge data sets)".

Sorry for any confusion this may cause.

>> Another solution is to make sure your system is reliable.

Not really an option, since I've seen huge data centers have UPS breakage or other circuit problems that take down all systems unexpectedly. Same for home servers, since you never no when your MB, CPU, PS, etc. is going to bite the dust.

I love the speed of Ext4, but if it's going to truncate files every time a non-redundant component dies, then I can't think of any situation where I would recommend it. It's too bad there's no mechanism between the FS and RAID such that the truncated files could be automagically recovered.

But let me be the first to thank you so much for your work investigating this and working toward a resolution. I honestly appreciate the hard work you devs are tackling.

Michael B. Trausch (mtrausch) wrote :

@3vi1:

There are better solutions for certain things like protecting against data center issues or home server problems. Myself, I keep redundant copies on multiple machines in multiple locations. For business, that's essential: database replication, file syncronization, etc.; for home servers, if you have friends that are willing to provide space for you on their machines (usually in exchange for you doing the same for them), you're covered in near-real-time and able to be better equipped to deal with things. You may have to buy your friend a server if they don't already have one, but that's a very small investment to protect your data if you consider it important.

After all, you never know when your data is going to go up in flames or be slaughtered by a flood.

@Michael

Oh, no doubt what you say is true.

However, I don't envy anyone that has to comb through 1000 servers in a data center to figure out how many 0byte files there are and then recover them.

But, we digress. Le'ts drop the talk here and leave this space for issue/resolution talk. If anyone wants to talk about more general points, feel free to e-mail me.

Theodore Ts'o (tytso) wrote :

@3vi1

If you really want to make sure the data in on disk, you have to use fsync() or fdatasync(). Even with ext3, if you crash at the wrong time, you will also lose data. So it's not the case with ext4 that "it's going to truncate files <i>every time</i> a non-redundant component dies". It's not <b>every time</b>. If you fdatasync() or fsync() the file, once the system call returns you know it will be safely on disk. With the patches, the blocks will be forcibly allocated in the case where you are replacing an existing file, so if you crash, you'll either get the old version (if the commit didn't make it) or the new version (if the commit did make it). If you really care, you could write a program which runs sync() every 5 seconds, or even every 1 second. Your performance will be completely trashed, but that's the way things break.

Or you can be smart about how you write your application, so you fsync() or fdatasync() at critical points, so you have a good tradeoff between performance and data being reliably written to disk; very often it's not necessarily that data be always written to disk at that very instant; just under the right controlled circumstances. And if it's too much of a performance hit, then you can be more clever about how you write your application, or you can make your system more redundant. There's an old saying, "fast", "good", "cheap". Choose two. In this particular case, replace "good" with "reliable", and that's the fundamental tradeoff. With the patches, we are as close as possible to ext3 for the common workloads for crappy desktop applications that like to constantly rewrite hundreds of dotfiles. Editors like emacs already call fsync() when the save a file. And Mail Transfer Agents also call fsync() before they return the SMTP code meaning, "I've got the e-mail, now it's my responsibility". So most programs do the right thing. But if you want to make sure every single write is guaranteed to make it onto disk the moment the program writes it, change your application to open the file with O_SYNC. Performance will be crap, but that's the tradeoff.

Theodore Ts'o (tytso) wrote :

@3vil,

One more thought. You say, "You never know when your MB, CPU, PS" may bite the dust. Sure, but you also never know when your RAID controller will bite the dust and start writing data blocks whenver it's supposed to be reading from the RAID (yes, we had an Octel voice mailbox server fail in just that way at MIT once). And you never know when a hard drive will fail. So if you have those sorts of very high levels of reliability requirements, then you will probably be disappointed with any commodity hardware solution. I can direct you to an IBM salesperson who will be very happy to sell you an IBM mainframe, however.

At the end of day, the best we can do about surviving unplanned crashes in the absence of formal fsync() requests, is best efforts. This is true for all file systems, although it is true that the slowest file systems may be more robust. The patches is the best I can do without completely sacrificing performance; but hey, if it's not good enough for you, you're free to keep using ext3.

@Theo

I respect your knowledge in this area and most certainly don't expect every single write to be synced to disk instantaneously, but I also don't expect commonly used apps to have all their files truncated every time a system goes down hard. If the lack of commit left them in their previous, non-truncated non-updated state, it would be a lot more livable.

I'm not a Linux programming guru by any measure, so I don't understand what is probably entirely obvious to you: Why the truncation seems to be committed independent of the other (not committed before the hard-reset) changes. It seems like you enterprising devs would find a way to cache the truncation of the file so that it's not done until the next change is committed.

BTW: no thanks on the IBM mainframe - I got tired of programming IBM mainframe/minicomputers 10 years ago... if the FS code was in RPG, I'd be able to contribute some help, LOL. :)

Bill Smith (bsmith1051) wrote :

@Theo
So files that are being updated are immediately marked on-disk as zero-byte but then the actual replacement write is delayed? Doesn't that guarantee data-loss in the event of a crash? Or, have I massively misunderstood your explanation for how this is all the fault of those 'crappy' System programmers?

Theodore Ts'o (tytso) wrote :
Download full text (3.5 KiB)

OK, so let me explain what's going on a bit more explicitly. There are application programmers who are rewriting application files like this:

1.a) open and read file ~/.kde/foo/bar/baz
1.b) fd = open("~/.kde/foo/bar/baz", O_WRONLY|O_TRUNC|O_CREAT) --- this truncates the file
1.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
1.d) close(fd)

Slightly more sophisticated application writers will do this:

2.a) open and read file ~/.kde/foo/bar/baz
2.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
2.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
2.d) close(fd)
2.e) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")

What emacs (and very sophisticated, careful application writers) will do is this:

3.a) open and read file ~/.kde/foo/bar/baz
3.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
3.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
3.d) fsync(fd) --- and check the error return from the fsync
3.e) close(fd)
3.f) rename("~/.kde/foo/bar/baz", "~/.kde/foo/bar/baz~") --- this is optional
3.g) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")

The fact that series (1) and (2) works at all is an accident. Ext3 in its default configuration happens to have the property that 5 seconds after (1) and (2) completes, the data is safely on disk. (3) is the ***only*** thing which is guaranteed not to lose data. For example, if you are using laptop mode, the 5 seconds is extended to 30 seconds.

Now the one downside with (3) is that fsync() is a heavyweight operation. If your application is stupid, and has hundreds of dot files in your home directory, each one taking up a 4k disk block even though it is only storing 4 to 12 bytes of data in each singleton dot file, and you have to repeat (3) for each of your one hundred dot files --- and worse yet, your application for some stupid, unknown reason is writing all of these hundred+ dot files every few seconds, then (3) will be very painful. But it is painful because the application is stupidly written --- not for any fundamental filesystem fault. It's like if you had a robot which was delivering mail to mail box numbers 1, 2, 3, 4, 5, and crossing the street for each mail box; on a busy road, this is unsafe, and the robot was getting run over when it kept on jaywalking --- so you can tell the robot to only cross at crosswalks, when the "walk" light is on, which is safe, but slow --- OR, you could rewrite the robot's algorithsm so it delieveres the mail more intelligently (i.e., one side of the street, and then cross, safely at the crosswalk, and then do the other side of the street).

Is that clear? The file system is not "truncating" files. The application is truncating the files, or is constantly overwriting the files using the rename system call. This is a fundamentally unsafe thing to do, and ext3 just happened to paper things over. But *both* XFS and ext4 does delayed allocation, which means that data blocks don't get allocated right away, and they don't get written right away. Btrfs will be doing delayed allocation as well; all modern filesystems will d...

Read more...

@Theo

Thank you very much for taking the time to write that very illuminating explanation. I can now see why you characterize this as a problem with the way the applications are written.

That said, I'm kind of skeptical that every app programmer will get to the same mental place and do things the "right" way. Though I promise to try, now knowing what I know. :)

Is there no way to cache any file open call that uses O_TRUNC such that the truncation never happens until the file is committed? I'm sure there are considerations that are not immediately obvious to me.

Theodore Ts'o (tytso) wrote :
Download full text (3.7 KiB)

>s there no way to cache any file open call that uses O_TRUNC such that the truncation never
>happens until the file is committed? I'm sure there are considerations that are not immediately
>obvious to me.

In practice, not with 100% reliability. A program could truncate the file, and then wait two hours, and then later write data and close the file. We can't hold a transaction for hours. But in practice, what we can do is we can remember that the file descriptor was truncated, either when the file was originally opened with the O_TRUNC flag, or truncated to zero via the ftruncate() system call, and if so, when the file is closed, we can force the blocks to be allocated right away. That way when the journal commits, the files are forced out to disk right away. This causes ext4 to effectively have the same results as ext3, in the case where the application is doing the open w/ O_TRUNC, write the new file contents, and close in quick succession. And in fact that is what the patches I referred to earlier do:

http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=3bf3342f394d72ed2ec7e77b5b39e1b50fad8284
http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=6645f8c3bc3cdaa7de4aaa3d34d40c2e8e5f09ae
http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=dbc85aa9f11d8c13c15527d43a3def8d7beffdc8

These patches cause ext4 to behave much the same way ext3 would in the case of series (1) and (2) that I had described above. That is, where the file is opened via O_TRUNC, and where the file is written as foo.new and then renamed from foo.new to foo, overwriting the old foo in the process.

It is not fool-proof, but then again ext3 was never fool-proof, either.

It also won't help you if you are writing a new file (instead of replacing an existing one). But the creation of new files aren't *that* common, and people tend to be most annoyed when an existing file disappears. Again, the 30 to 150 second delay before files are written out isn't really that bad. The chances that a newly written file will disappear, or appear to be "truncated to zero" (in reality, it was simply enver allocated) is actually that common --- after all, how often to machines crash? It's not like they are constantly crashing *all* the time. (This is why I really object to the categorization of needing to check "thousands of servers in a data center for zero-length file"; if "thousands of servers" are crashing unexpectedly, something is really wrong, and the sysadmins would be screaming bloody murder. Unexpected crashes is ***not*** the common case.)

The reason why people were noticing this on desktops is because crappy applications are constantly rewriting various config files, in some cases every minute or worse yet, every few seconds. This causes a massive write load, which destroys battery life, and consumes write cycles on SSD. These are badly written applications, which hopefully are not the common case. If the application is only writing its state files to dot files in the user's home directory every hour or two, what are the odds that you will get unlucky and have the crash happen in less than 30 s...

Read more...

Dustin Kirkland  (kirkland) wrote :

I've seen this happen twice now, both times in the ext4 filesystem beneath my encrypted home directory using ecryptfs.

Tyler Hicks and I spent about 4 or 5 hours trying to track down the bug in ecryptfs.

In my dmesg, I was seeing the following (recording here in case other ecryptfs users experience the same problem):

    Valid eCryptfs headers not found in file header region or xattr region

The file was an ecryptfs file, with an encrypted filename. However, there is no such thing as 0-byte files in ecryptfs. Even empty files are padded and encrypted. When ecryptfs encounters (tries to read or write) an empty file, it cannot find any valid headers (as the dmesg says), which means that it can't decode the fek with the fnek, and <barf>.

I was able to track these files down in the underlying encrypted directory, and clean them out with the following:

 $ umount.ecryptfs_private
 $ cd $HOME/.Private
 $ mount.ecryptfs_private
 $ find . -size 0c | xargs -i rm -f {}

I'm going to mark this bug as "affecting" ecryptfs-utils, but mark it "invalid", such that search results for users with the same problem might find themselves here.

Ted, thanks for the detailed explanations.

:-Dustin

Changed in ecryptfs-utils:
importance: Undecided → High
status: New → Invalid
nyarnon (cabal) wrote :

@theo

One thing is not clear to me theo, I just expect it to. Is it true that memory cache, being faster, might also prevent a lot of problems? I got 4Gb on my box, might that be an explanation that I'm less likely to encounter problems?

@Theo:

Ahh, that makes sense. Thanks for the addition info and the time you've spent on the patches to minimize the impact.

I, and several of the other people here, are probably just feeling more pain than the average data center ever would see; A lot of us are alpha-testing an alpha operating system, and other are running daily SVNs of Wine with proprietary video drivers.... so crashes and hard resets aren't really unusual at this point. :)

Unfortunately for me, they'll probably never be strangers - because I like to test bleeding edge stuff. Hopefully your patches make the truncation less evident for me. If not, I'll take your suggest and go back to Ext3 or some other FS that's better suited to what I'm doing.

I think I'll take what I've learned here, do some additional testing, and suggest that the people working on BOINC change the way they rewrite their client_state files - because the way they're doing it now makes both the current *and* the _prev version prone to truncation and hours of lost work.

Theodore Ts'o (tytso) wrote :

>One thing is not clear to me theo, I just expect it to. Is it true that memory
>cache, being faster, might also prevent a lot of problems? I got 4Gb on my
>box, might that be an explanation that I'm less likely to encounter problems?

@nyarnon,

No, having more or less memory won't make a difference; the issue here isn't a matter of being faster or slower, it's about how quickly data gets pushed out to disk. Normally the VM subsystem only starts writing data to disk only after 30 seconds have gone by for performance reasons --- in case the file gets modified a second time, no point writing it once, and then a second time 30 seconds later. How often tihs gets done is controlled by /proc/sys/vm/dirty_expire_centisecs, and how often you scan is controlled by /proc/sys/vm/dirty_writeback_centisecs. The default for these two values are 3000 (30 seconds) and 500 (5 seconds), respectively. These values can be adjusted by laptop_mode (to be higher), and in fact I've been playing with having these values much higher (60 minutes and 1 minute), mainly because I wanted to try to reduce writes to an SSD.

What happened is that ext3 in ordered data mode forced *all* blocks that were dirty to be pushed out during the next journal commit (which takes place by default every 5 seconds, more if you have laptop mode turned on). So something else you can do is to simply adjust these values down, in particular the dirty_expire_centisecs, to be something like 300. I wouldn't do this on a system where I was going to be running ext4 benchmarks, but it will automatically sync things out more aggressively.

nyarnon (cabal) wrote :

@theo

Thanks that makes things much clearer and I will keep an eye on those values as I do not like SSD's for obvious reason so to create less writes isn't an issue here for me. Rather stick to good'ol (new sata :-) HD's they have proven their value :-) I was just wondering because of the statement you made about dirty pages in cache slowing down. That's how I figured that memory cache would probably be faster then diskcache.

Theodore Ts'o (tytso) wrote :

@3vi1,

At least for files which are constantly being replaced, the patches which are queued to be merged into mainline at the next merge window should cause ext4 to work no worse than ext3, and that seems to be most people were complaining about on this bug. Most editors, like vi and emacs, are actually pretty good about calling fsync() when you save a file. I just checked OpenOffice, and it uses fsync() when it is done saving a file as well. So that tends to take care of most of the other common cases. (Things like object files in a build directly generally aren't considered precious since you can always do a "make clean; make" to generate them.

Personally, I test bleeding edge kernels all the time, and I've never had a problem simply because I usually know before I'm about to do something that might be dangerous, and so I just use the "sync" command in a terminal beforehand.

The other thing that probably helps is that I also avoid hardware that requires proprietary video drivers like the plague. Why settle for machines that crash all the time? There are enough hardware options out there that don't require proprietary video drivers, I don't understand why folks would even consider buying cards that need binary-only video drivers. There's a reason why kernel developers refuse to debug kernels that have been tainted by binary drivers; at least a few years ago, Nvidia drivers had enough wild pointer errors that would randomly corrupt kernel memory and cause hard-to-debug kernel oops and panics in completely unrelated subsystems that they were pretty much single-handedly responsible for the kernel "taint" flag infrastructure; kernel developers were wasting too much time trying to debug buggy binary-only video-drivers.

Finally, if you are really paranoid, you can mount your filesystem with the "sync" option; this works for all filesystems, and will force writes out to disk as soon as they are issued --- you can also toggle this on and off by remounting the filesystem. i.e., "mount -o remount,sync /mntpt" and "mount -o remount,async /mntpt". This will work for any filesystem, as "sync" and "async" are generic mount option flags.

Kai Mast (kai-mast) wrote :

@Theodore,

the whole explanation seems logical to me but with open office i had data loss too so maybe it is not using fsync() or not properly?

Olli Salonen (daou) wrote :

I'm experiencing something similar, but with a twist.

After few hours to two days of use, my ext4 /home partition becomes read-only all of a sudden. I usually close my session, unmount home, run fsck on the partition, and remount. Occasionally it leads to data loss, depending on what I was doing.

I'm currently on 2.6.28-6 because the last upgrade attempt lead to an unbootable system (also running ext4 on root partition), so I don't know if this is a fixed issue.

Theodore Ts'o (tytso) wrote :

@Kai,

Well, I straced Open Office1:2.4.1-11ubuntu2.1 that ships with Ubuntu Jaunty, and I can tell you it calls fsync() after finishing to write out files that it is saving. When you say data loss, was this with a file you had just saved using Open Office? Details are important in understanding what may have happened.

Theodore Ts'o (tytso) wrote :

@Olli,

If the filesystem has gone read-only, then it means that the kernel has detected filesystem corruption of some kind. Use dmesg to try to get the kernel logs, or can go through /var/log/messages for older kernel messages from previous boot sessions. Feel free to file a separate bug report for such bugs (this bug is getting pretty long, and what you are describing is a distinctly separate bug report).

I will warn you that there are a very large number of filesystem corruption bugs which we have fixed since 2.6.26, and in fact at this point we are only doing backports to 2.6.27 (and there are patches queued up for 2.6.27 that are waiting for Greg K-H to do another stable kernel series release). If you are willing to compile a vanilla 2.6.29-rc7 kernel, you will probably have the best luck (and the best performance). Which, by the way, is another reason for not using proprietary binary-only kernel modules; they very often aren't available for the latest bleeding-edge kernel.

I understand that some people are hesitant putting pre-release kernels on stable systems --- but quite frankly, back in the 2.6.26 and 2.6.27 days we were warning people that ext4 was still being stablized, and to think twice before putting it on production systems. Even for people putting in on their laptops, there was always a "we who are about to die salute you" attitude; early testing is critical, since that's how we get our bug reports so we can fix bugs, and people who tested early ext4 versions did us and the Linux community a huge service by reporting bugs that I wasn't seeing given my usage patterns. (For example, one bug was much more likely to show up if you were using Bittorrent, and I'm not a big bittorrent user.) Of course, once the bugs are fixed it's important to get folks moved up to newer kernels, which can sometimes be hard for them.

I really wish Ubuntu had a "kernel of the week" or which provided the latest development kernel pre-packaged up, much like Fedora has. It would make it a lot easier to recommend that people try a newer kernel package.

Tom Jaeger (thjaeger) wrote :

It's easy to lay the blame on "crappy" applications, but the fact of the matter is that it is really the interface that sucks here. Using the notation of comment #54, what applications really want is execute (1) atomically. Then it shouldn't matter whether they rewrite their files every minute or every second, because the file system could still decide when to actually commit this to disc. I always assumed, naively perhaps, that (2) did essentially that, though that is of course not the case if the file system decides to commit the result of (2e) before (2c) and (2d).

So what is an application that rewrites a file (possibly not as the result of direct user action) supposed to do? You suggest (3), but I can see a few drawbacks:
* If the file is overwritten twice in short succession, you'll get gratuitous disc writes. This is the least of the problems and can be worked around inside the app by using a timer.
* fsync is expensive. If your application can't afford to make a system call that can potentially block for on the order of a second, you're going to have to offload fsync to a seperate thread. Not impossible, but not entirely trivial either.
* Calling fsync on a c++ ofstream looks like a major pita.

Tom Jaeger (thjaeger) wrote :

Packaged upstream rc kernel builds are available here:

https://wiki.ubuntu.com/KernelMainlineBuilds

@Theo,

>> I don't understand why folks would even consider buying cards that need binary-only video drivers.

The same reason they use Ext4: performance (Haha <ducking>) I'm not aware of any open 3D drivers for any that perform even relatively close... *yet*. But, I'll gladly switch horses when there are.

Thanks again for the code and the time you took to give us this information. I look forward to testing Ext4 further in future kernels.

Jisakiel (jisakiel) wrote :

Just for the record, I'm guessing over here that some of the data loss might be avoided by making the four-finger-salute (as in ctrl-alt-sysrq-S) to force the kernel to sync instead of hard resetting when hung - although it depends a lot on how much hung is it, but with the shitty nvidia 8400 I use it usually works... In fact I use SUB (sync-unmount-reboot) in those cases.

Dustin Kirkland  (kirkland) wrote :

On Sat, Mar 7, 2009 at 12:03 PM, Theodore Ts'o <email address hidden> wrote:
> I really wish Ubuntu had a "kernel of the week" or which provided the
> latest development kernel pre-packaged up, much like Fedora has.  It
> would make it a lot easier to recommend that people try a newer kernel
> package.

@Ted-

We do!

http://kernel.ubuntu.com/~kernel-ppa/mainline/

:-Dustin

Danny Daemonic (dannydaemonic) wrote :

If POSIX is undefined in this regard then we should also be opening bug reports with the developers who have this problem.

I see people here complaining about GNOME, I looked through their bugs and couldn't find any one reporting this. For people who would want to run with the special ext4 mount option that would speed up their system, it might be a good idea to open a bug.

Following that train of thought, Theo says KDE has the same problem, although no one here has mentioned it. I use KDE and if this is a design flaw I'd like it fixed so ext4 can be mounted with the faster mount option. I searched through the KDE bugs and only found people complaining that KDE fdatasyncs too often (preventing their laptop hard drive from spinning down). I also ran across closed reports of people saying it fsyncs too often. Has anyone seen the 0 byte truncation issue with KDE, or is Theo just assuming since GNOME does it, KDE must also?

@Danny:

As it is a confilict between the way some apps are (unreliably) rewriting files and the way Ext4 needs them to (only so far as reliability is concerned), yes: It affects all desktops equally.

I'm running KDE, for the record.

> OK, so let me explain what's going on a bit more explicitly. There are application programmers who are rewriting application files like this:
>
> 1.a) open and read file ~/.kde/foo/bar/baz
> 1.b) fd = open("~/.kde/foo/bar/baz", O_WRONLY|O_TRUNC|O_CREAT) --- this truncates the file
> 1.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
> 1.d) close(fd)
>
> Slightly more sophisticated application writers will do this:
>
> 2.a) open and read file ~/.kde/foo/bar/baz
>2.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
>2.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
>2.d) close(fd)
>2.e) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")
>
>What emacs (and very sophisticated, careful application writers) will do is this:
>
>3.a) open and read file ~/.kde/foo/bar/baz
>3.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
>3.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
>3.d) fsync(fd) --- and check the error return from the fsync
>3.e) close(fd)
>3.f) rename("~/.kde/foo/bar/baz", "~/.kde/foo/bar/baz~") --- this is optional
>3.g) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")
>
>The fact that series (1) and (2) works at all is an accident. Ext3 in its default configuration happens to have the property that 5 seconds after (1) and (2) completes, the data is safely on disk. (3) is the ***only*** thing which is guaranteed not to lose data. For example, if you are using laptop mode, the 5 seconds is extended to 30 seconds.

The variant (1) is unsafe by design: data can be gone due to software failure. But variant (2) is correct. Both application developer and ext3 assuming following logic behind the scene:

2.a) open and read file ~/.kde/foo/bar/baz
2.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)

transaction_start(fd); // Hidden logic

2.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
2.d) close(fd)
2.e) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")

transaction_finish(fd); // Hidden logic

While ext4 and XFS assumes following logic:

2.a) open and read file ~/.kde/foo/bar/baz
2.b) fd1 = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
2.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
2.d) close(fd)

transaction_start(); // Hidden logic

2.e) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")

transaction_finish(); // Hidden logic

Because of that, such problem might happen in many other areas. It cannot be fixed easily just by putting call to fsync(fd), (which is not available in every programming language, BTW).

IMHO, ext4 should respect these hidden transactions. I.e., it should not reorder file and filesystem operations, which come from same process.

dunnow (thiofrou) wrote :

>such as Americans [...] getting shocked and angry when gasoline hit $4/gallon

They are one to talk. At its peak in q2/08, gasoline was like $9/gal in, what was it, germany...

Theodore Ts'o (tytso) wrote :

@Volodymyr M. Lisivka,

You can opine all you want, but the problem is that POSIX does not specify anything about "hidden transactions", and certainly does not make any guarantees like this. As I said, most modern file systems are doing delayed allocation for speed reasons, so you can expect this to be more of a norm. The patch which is going into 2.6.30 will do this, and by default, when you are replacing files, mostly because I know most application programmers are going to continue to rely on this. However, it's a bad idea to do so.

If you really care about making sure something is on disk, you have to use fsync or fdatasync. If you are about the performance overhead of fsync(), fdatasync() is much less heavyweight, if you can arrange to make sure that the size of the file doesn't change often. You can do that via a binary database, that is grown in chunks, and rarely truncated.

I'll note that I use the GNOME desktop (which means the gnome panel, but I'm not a very major desktop user), and "find .[a-zA-Z]* -mtime 0" doesn't show a large number of files. I'm guessing it's certain badly written applications which are creating the "hundreds of dot files" that people are reporting become zero lengh, and if they are seeing it happen a lot, it must be because the dot files are getting updated very frequently. I don't know what the bad applications are, but the people who complained about large number of state files disappearing should check into which application were involved, and try to figure out how often they are getting modified. As I said, if large number of files are getting frequently modified, it's going to be bad for SSD's as well, there are multiple reasons to fix badly written applications, even if 2.6.30 will have a fix for the most common cases. (Although some server folks may mount with a flag to disable it, since it will cost performance.)

Michael Rooney (mrooney) wrote :

Theodore, do you think we should try to be proactive about this and
encourage people to file bugs against applications which do this, and
tagging them with a specific tag for this?

Theodore Ts'o (tytso) wrote :

@Michael,

The trick is being able to determine which applications are being too aggressive with writing files. Section 5 in the laptop-mode's FAQ: http://samwel.tk/laptop_mode/faq has some good tips about how to detect which applications are writing indiscriminately to the disk. However, some amount of work will be necessary to determine how much writing the applications are doing, and whether it is "justified" or "unjustified". What we really need is a powertop-like program that tracks write and inode activity so that application writers can be shamed into fixing their applications.

"echo 1 > /proc/sys/vm/block_dump" works, but unfortunately you have to shutdown sysklogd first, and it doesn't summarize information very well. Ideally we also would also track fsync vs. fdatasync calls. Creating a custom ftrace module in the kernel plus a userspace application and then promulgating it as the next step in trying to reduce battery usage and promote SSD friendly applications is probably what we need to do.

My concern with encouraging people to file bugs against applications is whether or not people will accurately file j'accuse! statements against the correct applications. If there are too many false positives and/or false negatives, it might end up being counterproductive. The advantage of creating a powertop-like tool is once you have something which can measured, application authors have something they can optimize against --- and as they old saying goes, you get what you optimize for.

Theodore Ts'o (tytso) wrote :

@Michael,

The trick is being able to determine which applications are being too aggressive with writing files. Section 5 in the laptop-mode's FAQ: http://samwel.tk/laptop_mode/faq has some good tips about how to detect which applications are writing indiscriminately to the disk. However, some amount of work will be necessary to determine how much writing the applications are doing, and whether it is "justified" or "unjustified". What we really need is a powertop-like program that tracks write and inode activity so that application writers can be shamed into fixing their applications.

"echo 1 > /proc/sys/vm/block_dump" works, but unfortunately you have to shutdown sysklogd first, and it doesn't summarize information very well. Ideally we also would also track fsync vs. fdatasync calls. Creating a custom ftrace module in the kernel plus a userspace application and then promulgating it as the next step in trying to reduce battery usage and promote SSD friendly applications is probably what we need to do.

My concern with encouraging people to file bugs against applications is whether or not people will accurately file j'accuse! statements against the correct applications. If there are too many false positives and/or false negatives, it might end up being counterproductive. The advantage of creating a powertop-like tool is once you have something which can measured, application authors have something they can optimize against --- and as they old saying goes, you get what you measure.

> The patch which is going into 2.6.30 will do this, and by default, when you are replacing files, mostly because I know most application programmers are going to continue to rely on this.

Nice to hear. Thank you. :-)

Brian Rogers (brian-rogers) wrote :

@Theo:

Something is bothering me... I like laptop mode's ability to buffer writes to memory while keeping the hard drive spun down. I have 4 GB of RAM and wrote a script to cache a ton of system and user files to memory so I can start programs, then edit and save files with the disk remaining off most of the time.

Are you saying there's no safe way to save over a file without spinning up the disk immediately? And that every file editor should call fsync when saving? I don't want a spin-up to occur every time I save a file. There's already a limit set for how long my data can be held in memory before being written, so I'm not worried about losing ten minutes of work in the rare instance of a particularly bad crash where I can't sync before rebooting. But I am worried about the possibility of losing the entire file.

So I want the update of the file to be safe, but I don't want to throw away the benefits of laptop mode to obtain that safety. Ext3 has never failed me in this regard and given me a zero-byte file, even though I've allowed it to hold changes in memory for long intervals.

I see one of your patches forces the file data to be flushed out alongside a rename when a file is replaced. Would the opposite be feasible? That is, instead of flushing the file data earlier, hold off on committing the rename until everything the file contained at the time of the rename is flushed to disk. Hopefully doing it that way could retain the performance of delayed allocation. Programs already use the write-and-rename pattern to atomically create or replace files, and it'd be nice if this atomicity was preserved even in the event of a crash.

Theodore Ts'o (tytso) wrote :
Download full text (3.6 KiB)

@Brian,

We can't hold off one rename but not other file system activities. What you can do is simply not save files to disk in your editor, until you are ready to save them all --- or, you can extend the commit time to longer than 5 seconds; laptop mode extends the commit time to be 30 seconds, if I recall correctly.

In practice, note that ext3 generally ended up spinning up the disk anyway when you saved out the file, given that (a) it would need to read in the bitmap blocks to do the non-delayed allocation, and (b) it would end up spinning up the disk 5-30 seconds later when the commit timer went off.

The current set of ext4 patches queued for 2.6.29 does force the data blocks out right away, as opposed to merely allocating the data blocks, and not actually flushing the data blocks out until the commit. The reason for this was simply lack of time on my part to create a patch that does things right, which would be a much more complicated thing to do. Quoting from the patch:

+ /*
+ * We do something simple for now. The filemap_flush() will
+ * also start triggering a write of the data blocks, which is
+ * not strictly speaking necessary (and for users of
+ * laptop_mode, not even desirable). However, to do otherwise
+ * would require replicating code paths in:
+ *
+ * ext4_da_writepages() ->
+ * write_cache_pages() ---> (via passed in callback function)
+ * __mpage_da_writepage() -->
+ * mpage_add_bh_to_extent()
+ * mpage_da_map_blocks()
+ *
+ * The problem is that write_cache_pages(), located in
+ * mm/page-writeback.c, marks pages clean in preparation for
+ * doing I/O, which is not desirable if we're not planning on
+ * doing I/O at all.
+ *
+ * We could call write_cache_pages(), and then redirty all of
+ * the pages by calling redirty_page_for_writeback() but that
+ * would be ugly in the extreme. So instead we would need to
+ * replicate parts of the code in the above functions,
+ * simplifying them becuase we wouldn't actually intend to
+ * write out the pages, but rather only collect contiguous
+ * logical block extents, call the multi-block allocator, and
+ * then update the buffer heads with the block allocations.
+ *
+ * For now, though, we'll cheat by calling filemap_flush(),
+ * which will map the blocks, and start the I/O, but not
+ * actually wait for the I/O to complete.
+ */

It's on my todo list to get this right, but given that I was getting enough complaints from users about losing dot files, I figured that it was better to get the patch in.

And again, let me stress that the window was never no more than 30-60 seconds off, and people who were paranoid could always manually use the sync command. The fact that so many people are complaining is what makes me deeply suspicious that there may be some faulty applications out there which are constantly rewriting existing applications reguarly enough that people are seeing this --- either that, or the crappy proprietary drivers are much more crash-prone than I thought, and people are used to Linux machines crashing all the time --- both of which are very bad, and very unfortunate. Hopefully neither is tru...

Read more...

Carey Underwood (cwillu) wrote :

@Theodore

iotop (in the repos already) may or may not be useful for this.

Theodore Ts'o (tytso) wrote :

@Brian,

One other thought... emacs (and all other competently implemented editors) will use fsync() and *should* use fsync because for networked filesystems, fsync() may be the only way that the editor will know whether or not a file will be written to stable storage. For example, if AFS returns a quota error, or the NFS server has disappeared because of a network outage, the OS may not try to contact the fileserver when calling write(2), and perhaps not even when close(2) is called. The only way to be certain of receiving error return codes from file systems is to call fsync() on the file, before it is closed. Given the semantics of fsync(), that will wake up the hard drive, and there's not much that can be done about that.

If you really don't like that, not saving your buffers will have the same net effect, and have the same downside risks (namely, that after a crash, you'll lose data that hasn't been safely written to disk). Of course, I use carefully selected hardware (an X61s with integrated graphics) and I've but rarely had crashes that would have lost me data --- and I can count the times when I've lost data due to delayed allocation on the fingers on one hand --- but again, I use the "sync" command before I do somethng which I think might trigger a kernel crash, and this is larlgely never been a problem with me.

Michael Rooney (mrooney) wrote :

On Tue, Mar 10, 2009 at 10:21 PM, Theodore Ts'o <email address hidden> wrote:
> The fact that so many people are complaining is what
> makes me deeply suspicious that there may be some faulty applications
> out there which are constantly rewriting existing applications reguarly
> enough that people are seeing this --- either that, or the crappy
> proprietary drivers are much more crash-prone than I thought, and people
> are used to Linux machines crashing all the time

@Theodore: Keep in mind the context of many of these reports, at least
here: an alpha OS (Jaunty) and not well tested, newly pulled in
versions of proprietary drivers. I wouldn't say linux users are used
to crashed all the time, but on an alpha with new proprietary drivers,
kernel panics aren't THAT rare until worked out. New upstream
application versions could also be doing silly things, they haven't
been thoroughly tested yet, that's the point of testing them :) Just
thought this reminder might help, it could be easy to not think about
if you are using EXT4 in a more stable environment. Having said that,
Alpha 5 has been awesome for me and I am now using it full time with
EXT4.

> Hopefully neither is true, but in that case, the chances of a file getting replaced by a zero-length file are very small indeed.

I expect that Ext4 is much better in performance than Ext3 and will save me about 1 minute per day in average, (6 hours per year - about 1 additional working day), which is very good.

On other hand, I can lose few hours per data corruption, which is comparable. My notebook has some problems with restoring from Hiberante (no proprietary code at my notebook at all), so unexpected hangups in few minutes after restore are common to me (2-3 per month), thus I can loss few days per year with Ext4 - about a working week.

Formula to calculate actual benefit: benefit_value*benefit_probability - loss_value*loss_probability [ - loss_value*loss_probability ]..., where is benefit_probability

For Ext3: benefit_value is zero comparing to Ext4 (Ext3 is slower than Ext4), but loss_value is small too - about 1 minute per failure.

Ext3_benefit = 0*(1-k) - 1m*k; where k is probability of failure per working day;
Ext4_benefit = 1m*(1-k) - 2h*k;

If you see failures less than twice a year, then Ext4 is better for you. If you see failures more than twice a year, then Ext3 is better.

> And again, I will note that XFS has been doing this all along, and other newer file systems will also be doing delayed allocation, and will be subject to the same pitfalls. Maybe they will also encode the same hacks to work around broken expectations, and people with crappy proprietary binary drivers. But folks really shouldn't be counting on this....

I, personally, have very bad experience with XFS. We used it on linux.org.ua site and I spent few days of my short life to fix corrupted files manually after few power failures in data centre (all files are created or modified recently, so backup is not helpful in this case). I recommend to stay away from XFS or similar filesystems in favour of Ext3, which has optimal balance between speed and robustness.

I used crash test in 2003 to check maturity of Ext3 filesystem. I set up computer to soft reset itself every 5 minutes, while executing filesystem intensive operations, and then left it for few days (Thursday-Monday). Ext3 is passed that test just fine.

Can we create few test cases with common filesystem usage patterns and run them continuously in Qemu on raw device and then use "pkill -9 qemu; qemu &" to simulate crash and restart? Such crash test can help much better than talks about this problem. Run it for few days to gather statistic about number of data corruption problems per failure.

Theodore Ts'o (tytso) wrote :

>I wouldn't say linux users are used
>to crashed all the time, but on an alpha
>with new proprietary drivers,
>kernel panics aren't THAT rare until worked out.

@Michael,

I use bleeding edge kernels all the time -- including 2.6.X-rc1 kernels, right after the merge window has closed and all sorts of new code has been dumped into the mainline kernel --- and with open source drivers, I don't see that many problems.... :-P

Theodore Ts'o (tytso) wrote :

@Volodymyr,

If you or someone else is going to run a statistical analysis, it's better to wait until after the patches queued for 2.6.30 make it into an Ubuntu kernel. Also, you should when you did your test, did you check to see about file lossagem, or just that the filesystem was self-consistent?

Also, I can guarantee you that for certain classes of files, and for certain workloads, you will lose data with ext3 if you put the system under enough memory pressure. Heck, with ext3 (since barriers are disabled), if you are running with a very heavy workload that puts the system under continuous memory pressure, the filesystem can be corrupted badly enough on an unclean powerdown that it requires fsck to fix it. Chris Mason proved that a few months ago. The fact that you didn't detect that just meant that you didn't rig your test that you happened to catch that combination.

I can tell you what you will find with 2.6.30, which is that as long as the newly written file is replacing an existing file using O_TRUNC or file rename, the newly written file will be forced to disk before the transaction commit is allowed to proceed. However, for files that are newly written and not yet fsync'ed, the data might not be saved until 45-120 seconds after it was written, and for files that are extended via O_APPEND, or via an open(O_RDWR), lseek(), write(), newly allocated blocks again may not be allocated until 45-120 seconds have gone by; however these files are generally either log files, or database files, and databases are usually competently written by people who understand the value of using fdatasync() or fsync().

For both ext3 (and for files replacing other files in ext4) if the transaction commit has not taken place, the new file contents will obviously not be there, or be present with the "file.new" filename. So even for ext3, if your editor doesn't use auto-save files, and you are doing a multi-hour long editing session without saving the file at intermediate points, and then you save the file, and the editor doesn't do an fsync(), and then before the five second transaction commit happens, you fire up Google Earth or some other 3D application, *and* you are using a crappy proprietary driver that causes a crash --- you could lose hours of work. Ultimately, there's only so much incompetence and bad practices that a file system can protect you against without completely sacrificing performance....

Theodore Ts'o (tytso) wrote :

@Volodymyr,

Oh, about your scenario --- the hiberate scripts *should* be doing a sync before putting your laptop to sleep, so you shouldn't be losing any files if the problem is failing to wake up after a hibernate. (Me, I'm still paranoid about whether the hibernation scripts work correctly, so I tend to quiesce the filesystem -- i.e., ^Z any compiles -- and manually run sync by hand.) In any case, if your only concern is crashes caused by hibernate, you should be OK --- again, assuming the hibernate are properly calling sync before going to sleep --- and if they aren't, that's a bug that should be fixed.

Your probability analysis also doesn't take into account the probability of a file write happening 45-120 seconds before the crash, using an application that isn't properly using fsync(), such that you lose a file that takes "hours" to recover. People to date have complained about state files written by desktop applications --- in some cases the failure is as simple as not remembering the size and position of the window where the application was last opened.

Peter Clifton (pcjc2) wrote :

.files aside, it completely wrecked a git repository (with stgit) which I was working from. Perhaps git / stgit needs to fsync some of its log files.

Colin Sindle (csindle) wrote :

I can confirm that Subversion (on update) and "aptitude update" (or the underlying dpkg?) can also be left in an inconsistent/broken state when this occurs. Subversion ends up with many truncated ".entries" files leaving the working copy unusable. Aptitude/dpkg ends up with truncated versions of the files being upgraded (e.g. libpng12.so.0.27.0).

pablomme (pablomme) wrote :

I think the causes of the data loss problem are clear by now. From a more pragmatic point of view, I would like to ask the Ubuntu kernel people: are the patches Tytso mentions going to be backported to the Jaunty kernel? The answer to this question will determine my choice of filesystem for the final installation, and probably that of others listening to this bug as well.

And a related question: which kernel version is going to be shipped with Jaunty according to current plans?

Theodore Ts'o (tytso) wrote :

@Peter

For git, you should add to your ~/.gitconfig file:

[core]
         fsyncobjectfiles = true

@Colin,

For Aptitude/dpkg, there really should be a call to sync before the program exits, if it has installed or upgraded new packages. I can't speak to subversion because I don't know how the .entries are written. If they are entirely new files, or if the files are being appended to, the 2.6.30 patches won't help (but if they are entirely new files), it's quite possible they will be corrupted with ext3 if you crash at the wrong time. Again, if you care about things being on disk, you have to use fsync(). I'd suggest filing bugs with the programs involved --- and in the meantime, use sync before you do risky things that might cause your kernel crash.

Raine (rainefan) wrote :

@TheoTso
Well, I think we should be "thanking" this "crappy" applications paradigms so you could find in a short time this nasty "bugs/unfortunate behaviors"... And I would not like to read on Slashdot and others, things like Linux's not reliable, Linux loose data, uninstall Linux, and similar anti-Linux no-sense FUD's...

@KDE/Gnome people
Please consider "fixing" this behavior by employing a more centralized and consistent/fast/easy/etc. paradigm...pls

Thnx for the attention ;)

Theodore Ts'o (tytso) wrote :

@Raine,

Well, if the applications were written correctly, there wouldn't be any data loss problems. I suppose we could thank the unreliable proprietary binary drivers that were causing all of these crashes, so we could find out about the buggy application code. :-)

It is a twist saying that buggy driver code that causes lots of random crashes is actually a good thing, though! Heh.

Martin W (mw-apathia) wrote :

I installed a clean Jaunty Alpha 5 about a week ago, I chose ext4.
The machine crashed a few days ago but I didnt take much notice as to why.
Today when I installed vmware workstation it crashed 3 times.
Each time there is a ext4 error in dmsg and it remounts the filesystem as read-only.
The os itself does not crash.
Please see submitted output from dmesg.

Martin W (mw-apathia) wrote :

.. I forgot to say, it force me to reboot of course, and then it forces me to do a manual fsck where it has to do a bunch of repairs.

Aigars Mahinovs (aigarius) wrote :

While the design of ext3 in the regard to this bug might be considered accidental, it would be wise to attempt to carry it over to ext4 in order to go 'above and beyond' POSIX in compatibility with previous behaviour. Specifically, truncation of a file needs to be made a regular write operation so that it would be cached with other write operations and flushed to disk in a regular batch.

Given the following code:

1.a) open and read file ~/.kde/foo/bar/baz
1.b) fd = open("~/.kde/foo/bar/baz", O_WRONLY|O_TRUNC|O_CREAT) --- this truncates the file
1.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
1.d) close(fd)

Assuming that less than 30 seconds pass between 1.b and 1.c, these two operations must be executed at the same write cycle without allowing a significant window of opportunity for major data loss.

2.a) open and read file ~/.kde/foo/bar/baz
2.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
2.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
2.d) close(fd)
2.e) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")

It is even clearer here - why would the rename operation change the destination file before the previous operations are completed? It should not - the rename must be an atomic operation, even if POSIX does not demand it. This is an expected behaviour for extN filesystem and ext4 needs to document and honor that.

I understand that a program can not be certain that data will reach the disk unless some sort of fsync() is called. But destroying old data and then delaying writing the new version _is_ an ext4 bug, regardless of what POSIX says.

And as a sidenote - maybe programmers feel differently, but system administrators much prefer to have a bunch of small text files that we can edit with text editors and all kinds of scripts instead of SQL database stores for application configuration. Configuration registry is a cool principle, but a horrible practice even in bast implementations.

pablomme (pablomme) wrote :

@Raine: too late... http://linux.slashdot.org/article.pl?sid=09/03/11/2031231 . Links straight here...

Ethan Tira-Thompson (ejtttje) wrote :

I strongly agree with Aigars Mahinovs in comment 98, the problem here is not the delayed sync to disk; the problem is the significant discrepancy in time between truncating the old data and syncing the new data. Particularly in case (2), the nice thing to do here is to make renaming a file depend on flushing pending data for the file: either delay syncing the rename, or force the flush on the data.

As a developer, I can understand how (1) would be risky if power loss occurred between the truncate and data flush. But (2) should be *made to be* a safe operation (even if not POSIX required), particularly if this can be done without a huge performance loss. By delaying the rename until after the data flush, this shouldn't be a problem.

Tom Jaeger (thjaeger) wrote :

Theodore Ts'o wrote:
> @Raine,
>
> Well, if the applications were written correctly, there wouldn't be any
> data loss problems. I suppose we could thank the unreliable proprietary
> binary drivers that were causing all of these crashes, so we could find
> out about the buggy application code. :-)

Well, maybe if the kernel actually exposed the API that applications
need (namely, in this case, atomically replacing the contents of a file
without forcing it [or in case of ext3 the whole file system] to be
written to disk), then application would have an easier time "doing
things right".

Anyone want to venture a guess as to how many laptop hard drives have
died due to excessive wear caused by firefox calling fsync multiple
times on basically any user action to ensure data integrity?

I think the best solution is to move EXT3's behavior in this regard to
EXT4, then ask developers to use the new style to ween them off of the
old style little by little. Asking all applications to change isn't
going to happen, and using the EXT3 save style is the only forseeable
way right now.

Tom Jaeger wrote:
> Theodore Ts'o wrote:
>> @Raine,
>>
>> Well, if the applications were written correctly, there wouldn't be any
>> data loss problems. I suppose we could thank the unreliable proprietary
>> binary drivers that were causing all of these crashes, so we could find
>> out about the buggy application code. :-)
>
> Well, maybe if the kernel actually exposed the API that applications
> need (namely, in this case, atomically replacing the contents of a file
> without forcing it [or in case of ext3 the whole file system] to be
> written to disk), then application would have an easier time "doing
> things right".
>
> Anyone want to venture a guess as to how many laptop hard drives have
> died due to excessive wear caused by firefox calling fsync multiple
> times on basically any user action to ensure data integrity?
>

Ariel Shkedi (aslaunchpad) wrote :

@Theo

   Question:

Something I don't understand from your earlier explanation is why does 1.b hit the disk, but 1.c does not? Or even more: why does 2.e hit before 2.c? At least with example 1 it's in order, but with example 2 it's not. I can understand data loss if the system crashes exactly in between 1.b and 1.c, but if it's after 1.c, why would someone loose anything? (I'm guessing the journal is written, but not the data?)

Someone talked about hidden transactions. Shouldn't there be an assumption that for any one particular file, any operations on it must hit the disk exactly in the order they were made?

   Suggestion:

It seems to me that the correct fix to the problem is sort of the reverse to the patch you mentioned: don't allow a file to hit the _journal_ (for example truncating it in 1.b) until any outstanding requests for that file have hit the disk, (no matter when they were made, before or after the one for the journal). If necessary withhold placing the truncate or delete in the journal, until the blocks have hit the disk (rather than forcing the blocks to the disk whenever the journal is written which is what your patch is doing).

I think, and if I'm wrong please tell me, that doing this will prevent data loss (at worst you'll have the old version), and also keep performance good (since your never force a write), without having to resort to heuristics like your patch.

Steve Langasek (vorlon) wrote :

Tim, are there any plans to backport the fixes Ted mentions in comment 45 to the jaunty kernel?

If not, we should adjust the bug state here to make it clear that this is the behavior ext4 will ship with in 9.04.

If 9.04 ships with this bug, EXT4 should be altogether removed or at
least not advertised in the feature list. Shipping a broken EXT4 with
9.04 would be a VERY bad move, and the press would have a field day.

Steve Langasek wrote:
> Tim, are there any plans to backport the fixes Ted mentions in comment
> 45 to the jaunty kernel?
>
> If not, we should adjust the bug state here to make it clear that this
> is the behavior ext4 will ship with in 9.04.
>

Tim Gardner (timg-tpi) wrote :

Steve - I've no plan to add these patches as they are not part of the stable updates.

Tim, does that mean that the patches required to make EXT4 stable won't
be in the final release of Jaunty?

Tim Gardner wrote:
> Steve - I've no plan to add these patches as they are not part of the
> stable updates.
>

r6144 (rainy6144) wrote :

@Theodore:

If I understand correctly, ext3's data=ordered mode should make the create/rename method safe even without fsync: after a crash either the old or the new content is visible under the old name. It is unsafe in data=writeback mode, and the O_TRUNC method is always unsafe.

Now the problem is apparently due to ext4 not having ext3's data=ordered behavior, by having delay-allocated data blocks written after journal commit. I believe most ext3 users who choose data=ordered over data=writeback, like me, prefer this mode due to better data integrity in the create/rename case above, not security concerns.

Some applications should be fixed to reduce updates and make use of f*sync(), sure, but even given enough time, not all of them can be fixed because some applications are simply too low-level to have any knowledge on the importance of their output and thus the necessity to fsync(). Therefore, in the foreseeable future, many users will probably rely on the integrity guarantees of ext3's data=ordered mode, even though it is not specified by the POSIX spec. If I want performance, I can lengthen the journal commit interval to e.g. one minute, and fsync() is necessary only when losing one minute's changes (NOT the whole file) is unacceptable.

Does your patch get ext3's data=ordered behavior back entirely, when this option is used?

David Tomaschik (matir) wrote :

@Jeremy:

If you look at the kernel docs, ext4 is considered stable. I personally
haven't seen a problem with several systems running ext4. It only becomes
an issue if your system is unstable and you're using software that "assumes"
ext3 (and older) behavior.

@all:

I believe that having ext4 "mimic" the behavior of ext3 makes it ext3 with
extents. I believe that would negate all of the purpose of delayed
allocation. Theo would be able to explain it better.

Well what I was asking was if whatever patch that negates the issues
described in this bug report is going to be a part of 9.04 or not. We
can't expect applications to change their behavior overnight, it can
take years. Certainly not even by Jaunty +3.

Would forcing all dot files to be written to disk immediately be a
suitable fix?

David Tomaschik wrote:
> @Jeremy:
>
> If you look at the kernel docs, ext4 is considered stable. I personally
> haven't seen a problem with several systems running ext4. It only becomes
> an issue if your system is unstable and you're using software that "assumes"
> ext3 (and older) behavior.
>
> @all:
>
> I believe that having ext4 "mimic" the behavior of ext3 makes it ext3 with
> extents. I believe that would negate all of the purpose of delayed
> allocation. Theo would be able to explain it better.
>

dnyaga (daniel-nyaga) wrote :

This one bit me a few weeks ago, but I did not know what had hit me. Essentially, when running Virtualbox 2.1.4 (the binary .deb from virtualbox.org, shame shame) AND fglrx [two "crimes"] AND copying large amounts of data between ext4 partitions, my system would lock.

After hard resetting the system, my painstakingly laid out KDE 4 desktop layout would be nuked. I couldn't decide what to blame: fglrx, Virtualbox or KDE 4.2, so I simply stopped using fglrx, and switched to GNOME. I have marked https://bugs.launchpad.net/ubuntu/+bug/334581 as invalid.

As for the data I was copying - I did not check to see if any files were zeroed. I simply copied again.

Michael Rooney (mrooney) wrote :

I just want to add another voice saying that I have been running ext4
in Jaunty with no problems. It has been excellent so far! Having a
backup is wise, but that is nothing new.

The fundamental problem is that there are two similar but different operations an application developer can request:

1. open(A)-write(A,data)-close(A)-rename(A,B): replace the contents of B with data, atomically. I don't care when or even if you make the change, but whenever you get around to it, make sure either the old or the new version is in place.

2. open(A)-write(A,data)-fsync(A)-close(A)-rename(A,B): replace the contents of B with data, and do it now.

In practice, operation 1 has worked as described on ext2, ext3, and UFS with soft-updates, but fails on XFS and unpatched ext4. Operation 1 is perfectly sane: it's asking for atomicity without durability. KDE's configuration is a perfect candiate. Browser history is another. For a mail server or an interactive editor, of course, you'd want operation 2.

Some people suggest simply replacing operation 1 with operation 2. That's stupid. While operation 2 satisfies all the constraints of operation 1, it incurs a drastic and unnecessary performance penalty. By claiming operation 1 is simply operation 2 spelled incorrectly, you remove an important word from an application programmer's vocabulary. How else is an he supposed to request atomicity without durability?

(And using a "real database" isn't a good enough answer: then you've just punted the same problem to a far heavier system, and for no good reason. As another commenter mentioned, it's a lot easier to administer a set of small, independent text files. There is no reason a filesystem in 2009 should be able to cope a few hundred files.)

The fixes in 2.6.30 seem to make operation 1 work correctly, and that's good enough for me. I don't recommend application developers insert fsync calls everywhere; that will kill performance. Just use operation 1 and complain loudly when filesystems break it. While it may not be guaranteed by POSIX, operation 1's atomicity is nevertheless something any sane filesystme should provide.

"While it may not be guaranteed by POSIX, operation 1's atomicity is nevertheless something any sane filesystme should provide."

That's very misguided. It's /not/ guaranteed by POSIX, and going "above and beyond" POSIX in every respect is a surefire recipe for terrible performance.

Anders Aagaard (aagaande) wrote :

A new sata standard with a tiny battery, to ensure buffers are written, and fsync implemented as noop is very high on my wishlist...

The idea is free ;)

@Matthew: I reject your premise. ZFS preserves ordering guarantees between individual writes. UFS maintains a dependency graph of all pending filesystem operations. Both these filesystems perform rather well, especially the former.

Brian Rogers (brian-rogers) wrote :

@Matthew:

No, rewriting virtually every application to do fsync when updating files is a surefire recipe for terrible performance. Atomic updates without fsync are how we ensure good performance.

Theodore Ts'o (tytso) wrote :

So a couple of things, since this has gone totally out of control.

First of all, ZFS also has delayed allocation (also called "allocate on flush"). That means it will suffer similar issues if you crash without doing an f*sync(), and the page cache hasn't been flushed out yet. Solaris partisans will probably say, "But Solaris is so reliable, it doesn't crash", which is an acceptable response. It's true of Linux too, but apparently Ubuntu has too many unstable proprietary drivers. :-)

Secondly, you can turn off delayed allocation for ext4. If you mount the filesystem with the nodelalloc mount option, you will basically get the old ext3 behaviour with data=ordered. You will also lose many of the performance gains of ext4, and the files will tend to be more fragmented as a result, but if you have crappy drivers, and you're convinced your system will randomly crash, that may be your best move. Personally, that's not the Linux I use, but I'm not using proprietary drivers.

As far as including the 3 queued-for-2.6.30 patches in question, they are very low-risk, so I think it's fair for Ubuntu to consider including them; if not, and if you don't have confidence about the stability if your kernel, probably your best bet is to just include nodelalloc as a default mount option for now, or patch ext4 to use nodelalloc as a the default, and allow people who want delayed allocation to request it via a mount option delalloc. This used to be the default, until we were confident with the delalloc code.

Finally, I'll note that Fedora folks haven't really been complaining about this, so far as I know. Which should make people ask the question, "why is Ubuntu different"?

pablomme (pablomme) wrote :

@Theo: could you comment on the points made above (e.g., comment #98), namely about why the truncate operation is immediate and the writing operation is delayed? I think that's a very good point; if both operations were delayed no (old) data would be lost, while achieving top performance.

> Finally, I'll note that Fedora folks haven't really been complaining about this, so far as I know. Which should make people ask the question, "why is Ubuntu different"?

Well:
- Release cycle timing: alpha 1 for Ubuntu Jaunty was released last November, while the alpha for Fedora 11 was released in February. Fedora 11 will have ext4 by default, I'm pretty sure there _will_ be complaints if they don't address this problem, just give it time.
- Feature list: Fedora 11 is including experimental btrfs support. The truly adventurous will be trying that. The not-so-adventurous may as well wait for the beta to try fedora 11 out.
- Number of users: the more distribution users there are, the more comments and complaints you'll get for a given problem. I've no data to back this up, but in my perception Ubuntu does have more users than Fedora.

Martina H. (m-hummelhausen) wrote :

I'm following this "bug" for a while now and I must say it's quite amusing. However, I do find it quite strange that people keep referring to this as a "bug" when it has been clearly and more than once explained that it really isn't. So I thought I'd chime in and repeat it in nice and friendly letters for everyone to see:

THIS IS NOT A BUG!

HTH,

Martina

pablomme (pablomme) wrote :

> THIS IS NOT A BUG!

No, it's a feature. You automatically get a clean desktop configuration once in a while, because your desktop effects were probably too fancy and your desktop was too cluttered. Entirely as intended.

Seriously, though, this IS a bug, and it involves data loss, so it's an important one. The point of the discussion is whether it's a bug in the kernel or in the applications, and in the latter case, whether it makes sense to modify the kernel behaviour to cover for the applications' fault. And BTW, neither question is closed so far.

Adam Goode (agoode) wrote :

Perhaps off topic and/or well known, but this paper claims to solve all these problems? (At least compared to ext3)
http://www.usenix.org/events/osdi06/tech/nightingale.html

Hiten Sonpal (hiten-sonpal) wrote :

@Theo,

Appreciate everything you've done for ext filesystems and Linux in general. A few comments:

> Slightly more sophisticated application writers will do this:
>
> 2.a) open and read file ~/.kde/foo/bar/baz
> 2.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
> 2.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
> 2.d) close(fd)
> 2.e) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")

> The fact that series (1) and (2) works at all is an accident. Ext3 in its default configuration happens to have the property that 5 seconds after (1) and (2) completes, the data is safely on disk.

I offer that 0-length files only appear when 2.c) happens after 2.e). This is a sequencing error - I don't know where it happens, but a crash between 2.a) through 2.e) should result in only four states:

A. ~/.kde/foo/bar/baz was not touched at all
B. ~/.kde/foo/bar/baz was not touched at all, ~/.kde/foo/bar/baz.new exists with no data
C. ~/.kde/foo/bar/baz was not touched at all, ~/.kde/foo/bar/baz.new exists with some data
D. ~/.kde/foo/bar/baz was not touched at all, ~/.kde/foo/bar/baz.new exists with all the data
D. ~/.kde/foo/bar/baz contains data previously written to baz.new

If ~/.kde/foo/bar/baz exists with no data, it means that the rename has been moved up in sequence in the disk and step 2.c) did not actually happen on disk before the crash.

> So, what is the problem. POSIX fundamentally says that what happens if the system is not shutdown cleanly is undefined. If you want to force things to be stored on disk, you must use fsync() or fdatasync(). There may be performance problems with this, which is what happened with FireFox 3.0[1] --- but that's why POSIX doesn't require that things be synched to disk as soon as the file is closed.

That's not what we are saying. No one has a problem with fsync() being required to force items on disk. The issue is that not using fsync() causes us to loose items that were on the disk because of long windows between out-of-sequence updates to the disk.

Atomic transactions like renames should happen in-order with related transactions to make sure that we do not have unexpected data corruption.

Thanks for quickly creating a fix,
-- Hiten

r6144 (rainy6144) wrote :

I agree with Daniel's comment #113. The data integrity semantics of ext3's data=ordered is indeed useful in practice, and ext4 should not introduce different semantics (essentially no safer than data=writeback) for the option with the same name. The current behavior should be called "data=ordered_for_allocated", and described as a mode where files updated via create/rename without fsync() can be totally lost in case of a power failure, but no random and possibly sensitive data from other users may possibly end up in it.

Thank you, Theodore, for explaining that turning off delayed allocation solves the problem. Is it possible to delay allocation until transaction commit time?

Kai Krakow (hurikhan77) wrote :

> THIS IS NOT A BUG!

I would consider it a bug. As far as I understood the problem is that flushing to the filesystem does not occur in correct order. Metadata should be flushed after data has been flushed to ensure transactional integrity. But exactly this is what Ext4 currently does not. Hence it truncates the files instead of leaving it being unsure about the content.

On the other hand: Surely many apps need fixing. But you can imagine what happens to fs performance if every application does fsyncs after every write or before every close. Performance would suffer badly. fsync is needed for critical phases like writing important/central configuration files, database stuff etc. Both sides need fixing. In the spirit of posix of course the current behaviour is perfectly okay as far as I understood. But it's not very wise to do it that way.

I'd prefer that Ext4 leaves my files with the old content instead of simply truncating them because it cannot ensure integrity after a crash if the file's data has only been partly flushed.

Other file systems handle this case better - e.g. even NTFS. Problems as these just feed the trolls - so let's fix it.

In my eyes close() should ensure transactional integrity between write() and close(). fsync() should ensure logical integrity between writes or updating of multiple files related to each other. This may not be technically currect but this is how I understand the tools given by the filesystem API. In reality it may be more complicated.

PS: I intentionally chose to compare to a non-POSIX fs...

ubuntu 9.04 alpha amd64, ext4

I play World Of Goo 1.4 and evry time I press "Quit" it stopped working and the only way to return to Ubuntu was 'hard' reboot (turning the power switch off).

So, one time after such reboot I lost my save file from the game, Pidgin displayed 'can't read buddy.lst' and many other programs couldn't find files.

Now all windows lost their borders, maybe it's because some config files are missing.

I confirm that there is a problem with ext4 on energy-lost.

Chris Schanck (chris-schanck) wrote :

> Finally, I'll note that Fedora folks haven't really been complaining about this, so far as I know.
> Which should make people ask the question, "why is Ubuntu different"?

This, to me gets to the root of the loggerheads displayed in this bug. The reason Ubuntu is different is because it is *more* likely (I suspect) to be deployed as a personal use desktop. Since it gets thrown onto any number of laptops and such, it is used with a large array of relatively new hardware (I am on a Dell d830 right now -- about a year old). Binary drivers? Fact of life. Without the nvidia driver I'd not use Linux. Sorry, the user experience matters.

This use model conflicts with the "no binary drivers" ivory tower mentality. Sorry, but under windows this stuff just works. If you accept that Linux is a serious desktop os, you'll have to live with the reality of binary drivers, lousy programmers, etc. While ext4 has some wonderful performance and behavior, and came claim to being "correct" as far as POSIX goes, it's a bit much to be a spec lawyer when usage models that work cease to do so.

I understand your point, and even agree that it is an app error by the spec. But this puts users in the situation of having their machines fail where they didn't use to. As soon as they move to ext4. Accurate explanations won't change that impression.

Brett Alton (brett-alton) wrote :

So servers, presumably running with CLI and not GNOME or KDE should be relatively unaffected by this, correct?

I've been using ext4 on my desktop (/, not /home) for quite some time and have seen no problems. I like the performance boost it gives me and would like to give it to my servers as well...

Theodore Ts'o (tytso) wrote :

@Brett,

Servers generally run on UPS's, and most server applications generally are set up to use fsync() where it is needed. So you should be able to use ext4 in good health. :-) I'm using a GNOME desktop myself, but I use a very minimalistic set of GNOME applications, and so I don't see a large number of dot files constantly getting rewritten. So it's probably not even fair to say that all GNOME/KDE desktops are affected; mine certainly has not been.

Theodore Ts'o (tytso) wrote :

@Kai,

>But you can imagine what happens to fs performance if
>every application does fsyncs after every write or before
>every close. Performance would suffer badly.

Note that the "fsync causes performance problems meme got" started precisely because of ext3's "data=ordered" mode. This is what causes all dirty blocks to be pushed out to disks on an fsync(). Using ext4 with delayed allocation, or ext3 with data=writeback, fsync() is actually quite fast. Again, all modern file systems will be doing delayed allocation, one way or another. Even the much-vaunted ZFS does delayed allocation. So really, the right answer is to use fsync() or fdatasync(), as appropriate.

@Pablomme,

The reason why the write operation is delayed is because of performance. It's always going to give you performance benefits to delay writes for as long as possible. For example, if you end up deleting the file before it ever gets staged out for writing, then you might not need to write it at all. There's a reason why all modern filesystems use delayed allocation as a technique. Now, maybe for desktop applications, we need to have modes that sacrifice performance for better reliability given unreliable device drivers, and applications that aren't explicitly requesting fsync() when they really should.

It comes as an absolute shock to me, for example, that using a system which requires a hard reset whenever you exit "World of Goo", would ever be considered acceptable; I'd use an ATI or Intel chipset before I would accept that kind of reliability. Maybe that's an ivory-tower attitude; I dunno. But for a system which is that unreliable, disabling delayed allocation does make sense. Maybe we can do something such as what r6144 has suggested, where we have a data=alloc-on-commit mode. There will still be a performance hit associated with such a mode, since (like ext3's data=ordered mode) it effectively means an implied fsync() for every single inode involved with the transaction commit --- which will hurt; there's no way around it.

pablomme (pablomme) wrote :

> The reason why the write operation is delayed is because of performance.

Yup, I understand that and I'm all for it. Delay writing for hours if that improves performance further, that's great. But the question remains: why is the _truncate_ operation not delayed as well? The gap between the truncate and the write is what creates a window where crashing the system leads to data loss. That gap should be closed, not by "un-delaying" the write (which reduces performance), but by delaying the truncate (which should, if anything, improve performance).

PowerUser (i-am-sergey) wrote :

As for configuration registry:

Filesystems are about files exactly same as sqlite and other databases are about records. Actually, files could be treated like some sort of records in some sort of very specific database (if we'll ignore some specifics).

And we're, the users expect BOTH databases and file systems (at least these with journal) to care about our data and their integrity. And if file system does not wants to care about data integrity and rather tries to push data integrity question into another extra layer like databases instead of taking care on this question itself, why I should trust to such file system? Am I really expected to store my valuable data on a file system which prefers speed over data integrity?

As for me, I want file system to provide data integrity on it's own, without REQUIRING extra layers like sqlite in applications. If file is written and closed, it have to be on disk. And as for me, gain of less fragnemtation and some gain in speed due to temp files in RAM are not worth of possible data losses due to over-aggressive caching (and ALL apps will be NEVER rewritten to use extra bloat like sqlite database just to keep data integrity).

Sorry if some words are offensive or wrong but offer to use sqlite for data integity REALLY HURTS and RAISES QUESTION: why should I trust my data to such filesystem? As for now, I'm probably have to stick to ext3 even if this costs some speed but it does not loses data at least.

P.S. I'm also using XFS _but_ only on computers where performance valued over data integrity and only with UPSes. And I'm unable to supply UPSes to each and every computer. So - in short, users need RELIABLE file systems which are providing reliability without extra layers like sqlite. Please do not disregard this simple fact. Sorry once more if this sounds offensive or whatever else.

Kai Krakow (hurikhan77) wrote :

@Theodore,

> Note that the "fsync causes performance problems meme got" started
> precisely because of ext3's "data=ordered" mode. This is what causes
> all dirty blocks to be pushed out to disks on an fsync(). Using ext4 with
> delayed allocation, or ext3 with data=writeback, fsync() is actually quite fast.

While that may be true (and I suppose it is ;-)) what happens to all those users sticking to ext3 or similar fs' when "suddenly" all apps to fsync() on every occassion?

It may not hurt ext4 performance that much but probably other fs' performance.

I still think the solution lies somewhere between "fix the apps" and "fix the fs". Correct me if I am wrong but I read, currently Ext4 does (for yet unknown reasons) out-of-order flushing between data and meta data which hopefully can be fixed without affecting performance too much while improving integrity on crashes. Still it is important to fix the apps which still rely on Ext3s behaviour too much.

/End of @Theodore

Side note:

I'm using XFS so Ext4 isn't my preference but this is still interesting for me as it looks like XFS and Ext4 share the same oddities that lead to truncated configs in e.g. KDE4. I lost my kmailrc due to this several times, including all my filters, account settings, folder settings etc... (btw: a situation which could be improved if KMail wouldn't work with a single monolithic config) The situation was already present with KDE3 and got worse with KDE4 and latest NVIDIA-drivers which is currently a not-so-stable combination on some hardware (rock-solid on my Intel-based system at home, pretty unstable on my AMD/Via-based system at work). And now a clue: Also an open-source driver could have made the system freeze - that's not a single fault of closed-source drivers. So some of the arguments here are just irrelevant. That also holds true for the config database vs. tiny config files arguments.

Jan Larres (majutsushi) wrote :

@Theo

Sorry, but you seem to avoid the actual point people are making. No one says that delayed allocation is bad in general or questions its benefit, but that reordering the operations on a file and then delaying the data-writing part, but not the renaming part, is prone to serious data loss if anything happens in between these operations. This has nothing to do with binary drivers, something like that could happen due to a lot of reasons. If you keep these operation in order, and delay them both if you want to, everything is fine, and the worst thing that could happen is that you lose a few minutes of updates. This is obviously much better than losing the entire file(s).

Jan Claeys (janc) wrote :

@Theodore:
Please stop blaming this on binary drivers, they are not the only reason for this happening; open source drivers aren't magically bug-free, power losses happen and hardware breaks or starts to behave flaky...

Theodore Ts'o (tytso) wrote :
Download full text (4.9 KiB)

@Kai,

>While that may be true (and I suppose it is ;-)) what happens
>to all those users sticking to ext3 or similar fs' when "suddenly"
>all apps to fsync() on every occassion?
>
>It may not hurt ext4 performance that much but probably other
>fs' performance.

Actually the problem with fsync() being expensive was pretty much exclusive to ext3's data=ordered mode. No other filesystem had anything like that, and all modern filesystems are using delayed allocation. So in some sense, this is a "get used to it, this is the wave of the future". (And in many ways, this is also "back to the future", since historically Unix systems sync'ed metadata every 5 seconds, and data every 30 seconds.) Basically, modern application writers (at least under Linux), have gotten lazy. Older programs (like emacs, vi), and programs that need to work on other Legacy Unix systems will tend to use fsync(), because that is the only safe thing to do.

>Correct me if I am wrong but I read, currently Ext4
>does (for yet unknown reasons) out-of-order flushing
>between data and meta data which hopefully can be
>fixed without affecting performance too much while
>improving integrity on crashes.

Well, that's not how I would describe it, although I admit in practice it has that effect. What's happening is that the journal is still being committed every 5 seconds, but dirty pages in the page cache do not get flushed out if they don't have a block allocation assigned to them. I can implement a "allocate on commit" mode, but make no mistake --- it ***will*** have a negative performance impact, because fundamentally it's the equivalent of calling fsync() on dirty files every five seconds. If you are copying a large file, such as an DVD image file, which takes longer than five seconds to write, forcing a allocation in the middle of the write could very well result in a more fragmented file. On the other hand, it won't be any worse than ext3, since that's what is happening under ext3.

>I'm using XFS so Ext4 isn't my preference but this
>is still interesting for me as it looks like XFS and Ext4
>share the same oddities that lead to truncated configs

Yes, and as I've said multiple times already, "get used to it"; all modern filesystems are going to be doing this, because delayed allocation is a very powerful technique that provides better performance, prevents file fragmentation, and so on. It's not just "oddities" in XFS and ext4; it's also in btrfs, tux3, reiser4, and ZFS.

>in e.g. KDE4. I lost my kmailrc due to this several
>times, including all my filters, account settings, folder
>settings etc... (btw: a situation which could be improved
> if KMail wouldn't work with a single monolithic config)

Yeah, that's a good example for what I mean by a broken application design --- why is it that KMail is constantly updating its config? Is it doing something stupid such as storing the last location and size of the window in a monolithic config, which it is then constantly rewriting out each time you drag the window around on the screen? If you are going to be using a single monolithic config, then you really want to fsync() it each time you write it out. ...

Read more...

@Theodore,

As a scalable server developer with 25 years experience, I am fully aware of the purpose of fsync, fdatasync and use them if and only if the semantics I want are "really commit to disk right now". To use them at any other time would be an implementation error.

I further agree delayed allocation is a good thing and believe application developers who use the first command sequence you describe above get what they deserve and that is it a mistake for the filesystem to perform an implicit sync in that case.

Where I strongly disagree with you is for the open-write-close-rename call sequence (your second scenario). It is very common for an application to need "atomic replace, defer ok" semantics when updating a file (more common, in fact, than cases where fsync is really needed). The only way to express that semantic is open-write-close-rename, and furthermore that semantic is the only useful interpretation of that call sequence. Adding an fsync expresses a different and less useful semantic. For example, when I do "atomic replace, defer ok" twice in a flush interval I would expect an optimal filesystem to discard the intermediate version without ever committing it to disk. So I find the workaround you've implemented undesirable as it results in non-optimal and unnecessary disk commits.

Now your not-useful interpretation of open-write-close-rename is Posix compliant under a narrow interpretation. But I can interpret any standard in a not-useful way. An IMAP server that delivers all new mail to a mailbox "NEWMAIL" and has no "INBOX" would be strictly compliant with the spec and also not useful. Any reasonable IMAP client vendor will simply state they don't support that server. And that's exactly what will happen to EXT4, XFS and other filesystems that interpret the open-write-close-rename call sequence in a not useful way. You will find applications declare your filesystem unsupported because you interpret a useful call sequence in a not-useful fashion.

The right interpretation of open-write-close-rename is "atomic replace, defer ok". There is no reason to spin up the disk or fsync until the next flush interval. What's important is that the rename is not committed until after the file data is committed.

If you disagree, I invite you to suggest how you would express "atomic replace, defer ok" using Posix APIs when writing an application.

@Theodore,

> Well, that's not how I would describe it, although I admit in practice it has
> that effect. What's happening is that the journal is still being committed
> every 5 seconds, but dirty pages in the page cache do not get flushed out if
> they don't have a block allocation assigned to them.

I think everyone understands why it's a bad idea to write data pages immediately (thanks for your detailed and clear explanations). But why can't the metadata writes be delayed as well? Why do they have to be written every five seconds instead of much later, whenever the data happens to get written?

Ariel Shkedi (aslaunchpad) wrote :

Theo, I have tremendous respect for the work you did, but you are wrong.

> If you are going to be using a single monolithic config, then you really want to
> fsync() it each time you write it out. If some of the changes are
> bullsh*t ones where it really doesn't matter of you lose the last
> location and size of the window, then write that to a separate dot file
> and don't bother to fsync() it.

No. If I overwrite a file the filesystem MUST guarantee that either the old version will be there or the new one. That is one of the main selling points of a journaling file system - if the write did not complete (crash) you can go back to the old version.

There should be NO case where you end up with a zero byte file. Telling people to call fsync constantly is wrong. The filesystem should make sure not to truncate the file until it's ready to write the replacement. (Yes there are corner cases where it commits exactly in between the truncate and the write, but that is not what is happening here.) Even a crash in between the truncate and the overwrite should not loose anything, since the journal should be rolled back to the old version of the file.

Telling people to use sqllite is also not the right answer - you are essentially saying the fs is broken so use this app to fix the bugs. I might as well use sqlite on a raw partition!

> I can implement a "allocate on commit" mode, but make no mistake
> --- it ***will*** have a negative performance impact, because
> fundamentally it's the equivalent of calling fsync() on dirty files
> every five seconds.

No Theo, that is not what people are asking for. People simply want the filesytem not to commit the truncate before committing the data.

I have no idea if that is hard to do, I assume it is because you seem to be resisting the idea, but it needs to be done for ext4 to be a reliable filesystem.

Tim (aardbeiplantje) wrote :

@Hiten

I agree with your comment, I think. I was about to make that same post.

I would even dare to say that fsync(fd) is an evil call that never should be used by any application. The reason for this is very simple, it doesn't make a difference: if fsync(fd) needs to write 100MB to disk and a power loss occurs, the temp file will always be in state B,C or D. Your application isn't able to even use that file at next startup, as it has no way of telling whether the file is complete or not. fsync(fd) just might improve on things, but the performance hit doesn't justify it's use.

On ext3, fsync(fd) is a gigantic performance hit too when there's one 'big consumer' on the same machine. Even as simple as 2 files copying around can make saving of a file in vim lag behind 30s. Luckily, vim has ':se nofsync' and ':se swapsync='.
Because of this performance problem, I wanted to test ext4, but after reading this bug, it all looks to me that it won't make any difference - I hope I'm wrong on that. Unless ext4 is smarter and fsync(fd) only does that file's data, instead of a 'everything that should go in the transaction log first because it came first'-style algorithm that is in ext3 (which happens to be a lot when you'r copying a file around ~1G).

Again, calling fsync(fd) to make the rename() appear after the close() is IMHO bad coding. It's fixing some one else's problem. Using a tempfile + rename is good application design to fix halfly written files, it would be nice that it stays that way, without waiting 30s for an fsync() when you're copying :-).

@Arial

> No. If I overwrite a file the filesystem MUST guarantee that either the old version will be there or the new one.

Err, no it's perfectly fine for a filesystem to give you a zero-byte file if you truncate, then write over the truncated file. Why should the filesystem try to guess the future and hold off on that truncate? As long as the relative ordering of the truncate and write is preserved, you're fine.

Why is it okay for the filesystem to give you a zero-byte file between a truncate() and a write()? Because the filesystem gives you a facility for asking for an atomic commit instead: write to a scratch file and rename() that scratch file over the original. That's been the unix technique since time immemorial, and it works fine.

1. When you need neither atomicity nor durability, truncate() and write().
2. When you need atomicity but not durability, write() to a temporary and rename()
3. When you need both atomicity and durability, write() to a temporary, fsync the file, rename, and fsync the directory.
4. When you need just durability, truncate(), write(), and fsync().

The problem isn't a zero-length file in cases 1 and 4. That's an expected danger. You asked for those semantics but not using atomic rename, so you can deal with them.

The real insidious problem is getting a zero-byte file under scenarios 2 and 3. rename() on top of an existing file should *always* make sure the data blocks for the new file are committed before the record of the rename itself is. It's absolutely critical that this work correctly because there's no other good way of achieving atomicity.

Theodore Ts'o (tytso) wrote :
Download full text (3.2 KiB)

@Chris

I hate to keep repeating myself, but the 2.6.30 patches will cause open-write-close-rename (what I call "replace via rename") to have the semantic you want. It will do that by forcing a block allocation on the rename, and then when you do the journal commit, it will block waiting for the data writes to complete. So it will do what you want. Please note that this is an ext4-specific hack; there is no guarantee that btrfs, ZFS, tux3, reiser4 will implement anything like this. And all of these filesystems do implement delayed allocation, and will have exactly the same issue. You and others keep talk about how this is a MUST implement, but the reality is that it is not mandated by POSIX, and implementing these sorts of things will hurt benchmarks, and real-life server workloads. So don't count on other filesystems implementing the same hacks.

@CowbowTim,

Actually ext4's fsync() is smarter; it won't force out other files' data blocks, because of delayed allocation. If you write a new 1G file, thanks to delayed allocation, the blocks aren't allocated, so an fsync() of some other file will not cause that 1G file to be forced out to disk. What will happen instead is that the VM subsystem will gradually dribble out that 1G file over a period of time controlled by /proc/sys/vm/dirty_expire_centisecs and /proc/sys/vm/dirty_writeback_centisecs.

This problem you describe with fsync() and ext3's data=ordered mode is unique to ext3; no other filesystem has it. Fortunately or unfortuately, ext3 is the most common/popularly used filesystem, so people have gotten used to its quirks, and worse yet, seem to assume that they are true for all other filesystems. One of the reasons why we implemented delayed allocation was precisely to solve this problem. Of course, we're now running into the issue that there are people who have been avoiding fsync() at all costs thanks to ext3, so now we're trying to implement some hacks so that ext4 will behave somewhat similar to ext3 in at least some circumstances.

The problem here is really balance; if I implement a data=alloc-on-commit mode, it will have all of the downsides of ext3 with respect to fsync() being slow for "entagled writes" (where you have both a large file which you are copying and a small file which you are fsync()'ing). So it will encourage the same bad behaviour which will mean people will still have the same bad habits when they decide they want to switch to some new more featureful filesystem, like btrfs. The one good thing about the "alloc-on-replace-via-truncate" and "alloc-on-replace-via-rename" is it handles the most annoying set of problems (which is an existing file getting rewritten turning into a zero-length file on a crash), without necessarily causing an implied fsync() on commit for all dirty files (which is what ext3 was doing).

It's interesting that some people keep talking about how the implied fsync() is so terribly, and simultaneously arguing that ext3's behaviour is want they want --- what ext3 was doing was effectively a forced fsync() for all dirty files at each commit (which happens every 5 seconds by default) --- maybe people didn't realize that w...

Read more...

Theodore Ts'o (tytso) wrote :

>But why can't the metadata writes be delayed as
>well? Why do they have to be written every five seconds
>instead of much later, whenever the data happens to get written?

Fundamentally the problem is "entangled commits". Normally there are multiple things happening all at once in a filesystem. One process is truncating a file and rewriting it, and other process is creating a new file and allocating blocks, and so (for example) both process might touch the block allocation bitmap as they do their various operations. So it's not as simple as "delaying the truncate"; you can delay committing all operations in the journal, but you can't just delay one transaction but not another. This is the case with SQL as well; you can issue various commands, such as an SQL "INSERT" and an SQL "DROP" command, but you can't delay one SQL statement beyond another one; although you can control when you send the "COMMIT" statement.

So you can change the journal commit interval from 5 seconds to say 30 seconds, or 600 seconds. Laptop mode for example will by default change the journal commit time to 30 seconds. That will do part of what you want; if you make the journal commit interval much larger than the default writeback time, that will achieve most of what you want. However, various disk buffers will get pinned in memory until the commit takes place, so extending commits may end up chewing up more memory used by the kernel. TNSTAAFL.

pablomme (pablomme) wrote :

@Theo: thanks for addressing this point.

> So it's not as simple as "delaying the truncate"; you can delay committing all operations in the journal, but you can't just delay one transaction but not another.

I think this is the overall idea people are expressing here, delaying the entire journal operation stack, not swapping operations.

> However, various disk buffers will get pinned in memory until the commit takes place, so extending commits may end up chewing up more memory used by the kernel. TNSTAAFL.

If I understand you correctly, you are implying that delaying the journal commits is more expensive memory-wise than delaying the data write? Else I don't understand you concern, the data writes are already delayed for longer...

Is it possible to exactly synchronize the journal commits and the data writes (i.e., one immediately after the other)? Would this prevent data loss for the programs that rely on posixly-unsafe file rewrites?

Carey Underwood (cwillu) wrote :

Theo, does that then imply that setting the writeback time to the
journal commit time (5 seconds) would also largely eliminate the
unpopular behavior?

How much of the benefit of delayed allocation do we lose by waiting a
couple seconds rather than minutes or tens of seconds? Any large
write could easily be happening over a longer period than any
reasonable writeback time, and so those cases should already be
allocating their eventual size immediately (think torrents or a long
running file copy).

On Thu, Mar 12, 2009 at 4:21 PM, Theodore Ts'o <email address hidden> wrote:
> <snip>
>
> So you can change the journal commit interval from 5 seconds to say 30
> seconds, or 600 seconds.  Laptop mode for example will by default change
> the journal commit time to 30 seconds.  That will do part of what you
> want; if you make the journal commit interval much larger than the
> default writeback time, that will achieve most of what you want.
> However, various disk buffers will get pinned in memory until the commit
> takes place, so extending commits may end up chewing up more memory used
> by the kernel.   TNSTAAFL.

Theodore Ts'o (tytso) wrote :

@pablomme,

Well, until the journal has been committed, none of the modified meta-data blocks are allowed to be written to disk --- so any changes to the inode table, block allocation bitmaps, inode allocation bitmaps, indirect blocks, extent tree blocks, directory blocks, all have to be pinned in memory and not written to disk. The longer you hold off on the journal commit, the more file system meta-data blocks are pinned into memory. And of course, you can't do this forever; eventually the journal will be full, and a new journal commit will be forced to happen, regardless of whether the data blocks have been allocated yet or not.

Part of the challenge here is that normally the VM subsystem decides when it's time to write out dirty pages, and the VM subsystem has no idea about ordering constraints based on the filesystem journal. And in practice, there are multiple files which will have been written out, and the moment one of the is fsync()'ed, we have to do a journal commit for all files, because we can't really reorder filesystem operations. All we can do is force the equivalent of an fsync() when a commit happens.

So the closest approximation to what you want is a data=alloc-on-commit mode, with the commit interval set to some very large number, say 5 or 10 minutes. In practice the commit will happen sooner than that, especially if there are lots of filesystem operations taking place, but hopefully most of the time the VM subsystem will gradually push the pages out before the commit takes place; if the commit takes place first, the alloc-on-commit mode will force any remaining pages to disk on the transaction commit.

Theodore Ts'o (tytso) wrote :

@Carey,

>Theo, does that then imply that setting the writeback time to the
>journal commit time (5 seconds) would also largely eliminate the
>unpopular behavior?

You'd need to set it to be substantially smaller than the journal commit time, (half the commit time or smaller), since the two timers are not correlated. Furthermore, the VM subsystem doesn't write dirty pages as soon as the the expiration time goes off. It stages the writes over several writeback windows, to avoid overloading the hard drive with background writes (which are intended to be asynchronous). So the answer is yes, you could probably do it by adjusting timers, but you'd probably need to up the journal commit time as well as decreasing the dirty_writeback and dirty_expire timers.

>How much of the benefit of delayed allocation do we lose by waiting a
>couple seconds rather than minutes or tens of seconds? Any large
>write could easily be happening over a longer period than any
>reasonable writeback time, and so those cases should already be
>allocating their eventual size immediately (think torrents or a long
>running file copy).

Well, yes, but that means modifying more application code (which most people on this thread seems to think is a hopeless cause :-P). Also, it's only been in the latest glibc in CVS that there is access the fallocate() system call. Current glibc has posix_fallocate(), but the problem with posix_fallocate() is that it tries to simulate fallocate on filesystems (such as ext3) which doesn't support it via writing zero's into the file. So posix_fallocate() is a bit of a performance disaster on filesystems that don't support fallocate(). If you use the fallocate() system call directly, it will return a error (ENOTSUPP, if I recall correctly) if the file system doesn't support it, which is what you want in this case.

The reality is that almost none of the appliations which are writing big files are using fallocate() today. They should, especially bittorent clients, but most of them do not --- just as many applications aren't calling fsync() even though POSIX demands it if there is a requirement that the file be written onto stable storage.

I created lame test case to catch the bug. Numbers:

Filesystem, Method, Performance, Percentage of data loss
ext3, (1), 0,50, 1% (one file is partial)
ext3, (2), 0,44, 0% (one temporary file is partial)
ext3, (3), 0,37, 0% (one temporary file is partial)
ext4, (1), 0,50, 102% (all files are zeroed, including my scripts)
ext4, (2), 0,44, 101% (all files are zeroed, including one .tmp file)
ext4, (3), 0,29, 0% (ext4 is, actually, slower than ext3).

BTW: I see no way to call fsync() in bash, so I used "sync" command instead in method (3).

> Finally, I'll note that Fedora folks haven't really been complaining about this, so far as I know.

I am Fedora user. Feel the difference.

@Volodymyr,

You can only fsync given a file descriptor, but I think writing an fsync binary that opens the file read-only, fsync on the descriptor, and close the file, should work.

> You can only fsync given a file descriptor, but I think writing an fsync binary that opens the file read-only, fsync on the descriptor, and close the file, should work.

Use this little program to verify your assumptions (I have no time right now):

#include <fcntl.h> /* open(), O_RDONLY */
#include <unistd.h> /* fsync() */
#include <errno.h> /* errno */
#include <string.h> /* strerror() */
#include <stdio.h> /* fprintf(), stderr */

int
main(int argc, char **argv)
{
  char *file_name; /* Name of the file to sync */
  int fd; /* File descriptor */
  int exit_code=0;
  int i;

  /* For each argument, except program itself */
  for(i=1; i<argc; i++)
  {
    file_name=argv[i];

    /* Open file in readonly mode, unbuffered mode */
    fd=open(argv[i], O_RDONLY);

    /* Ignore errors */
    if(fd==-1)
    {
      fprintf(stderr,"Cannot open file \"%s\": %s\n",file_name, strerror(errno));

      /* Ignore errors */
      exit_code=1; /* Return non-zero exit code to indicate problem. */
      continue;
    }

    if(fsync(fd)==-1)
    {
      fprintf(stderr,"Cannot open file \"%s\": %s\n",file_name, strerror(errno));

      /* Ignore errors */
      exit_code=1; /* Return non-zero exit code to indicate problem. */
      continue;
    }
  }

  return exit_code;
}

I wrote something similar, but with one change -- it turns out you must have write access to the file you want to fsync (or fdatasync).

It seems to work, but I have not had time to do a power loss simulation. Would be useful performance-wise on any system but ext3 (where calling this is identical in outcome to doing a full sync).

mkluwe (mkluwe) wrote :

Just in case this has not been done yet: I have experienced this »data loss problem« with XFS, losing the larger part of my gnome settings, including the evolution ones (uh-oh).

Alas, filesystems are not databases. Obviously, there's some work to be done in application space.

@Michel Salim

> You can only fsync given a file descriptor, but I think writing an fsync binary that opens the file read-only, fsync on the descriptor, and close the file, should work.

Wouldn't that only guarantee the updates through that descriptor (none) are synced?

@Theodore Ts'o

> 3.a) open and read file ~/.kde/foo/bar/baz
> 3.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
> 3.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
> 3.d) fsync(fd) --- and check the error return from the fsync
> 3.e) close(fd)
> 3.f) rename("~/.kde/foo/bar/baz", "~/.kde/foo/bar/baz~") --- this is optional
> 3.g) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")

> (3) is the ***only*** thing which is guaranteed not to lose data.

But it's not right either.
It assumes you have permission to write the .new file.
It assumes this file doesn't exist already (or can be overwritten).
It uses an fsync, which may not be required (if atomicity but no durability is desired).
It doesn't retain permissions of the old file.
If the target is a symlink, it gets replaced by a normal file.
If you do 3.f, there is a window where no file exists at all.
It's too complex, so needs to be wrapped in library funtions.

I think a concept like atomic updates (O_ATOMIC?) is needed. This would guarantee other apps and the disk (after a crash) either see the old file or the new file, but nothing else.

@Olaf

from the manpage, fsync() transfers "all modified in-core data of the file referred to by the file descriptor fd". So it should really be all pending writes, not just the writes that take place using the current fd.

I cannot really reboot any of my machines right now, but it does make sense. This use case:

1. program A writes a file, closes it without fsync
2. program B writes to the same file, fsyncs, closes

Program B should rightly expect to have the file to have the same on-disk content as its view of the file at the time it fsync(), which means any prior non-committed parts of the file that B sees should also be committed, modulo B's changes.

Jan Claeys (janc) wrote :

@mkluwe:
Filesystems are databases by definition (they are structured sets of data). But of course not all databases are/work equal, because they serve different purposes...

Maybe it would be good to amend POSIX to provide an (optional?) mechanism for guaranteed transactional integrity for some operations though... (examples why this might be useful were given earlier in this thread).

Adrian Cox (adrian-humboldt) wrote :

@Bogdan Gribincea

While the discussion here has concentrated on rewriting config files, you also report a loss of MySQL databases. What table configuration were you using, and were the underlying files corrupted or reduced to 0 length?

InnoDB is intended to be ACID compliant, and takes care to call fsync at appropriate places. The mechanism for loss may be different to the loss of config files discussed throughout this bug report.

@Adrian Cox

I'm sorry but I was in the middle of work so I just quickly restored a backup without really looking at what happened.

Some MYD files were truncated to 0 but I didn't take the time to investigate the cause. It was a standard Jaunty MySQL 5.0.x install using MyISAM tables.

Yves Glodt (yglodt) wrote :

I wonder if KDE really rewrites it's configfiles on startup? Why write something at all on startup? Maybe a KDE dev can comment on this...

Chris Cheney (ccheney) wrote :

Ted,

I am not sure if this was covered yet but if so I apologize in advance for beating a dead horse.

The current opinion is that if you want to make meta data (eg the out of order rename case) reliable you must use sync... Ok, that would require a huge number of changes to many programs but it is at least theoretically doable. However in doing so aren't you rendering laptop mode essentially useless? As far as I know and from mjg59 confirmed was the case you can't delay syncs. So now any time in the future any application that wants its data to be reliable if only for meta data cases it will cause a drive to spin up causing battery life to go down the tubes. Yes SSDs are much more power efficient and are the wave of the future, but from where I am it looks like they will continue to be the wave of the future, what with even the Intel SSDs being only rated for 20GB/day data transfer.

Also, you mentioned before that attempting to track meta data ordering for files in memory would cause entaglement forcing a write of all data to disk. Is that just due to the way the current kernel filesystem layer works or a problem that is not fixable? If it were fixable then sync's in code would only be needed for critical writes since meta data being out of order would no longer cause the serious problems it does today in cases with no sync calls. This would allow much more efficient power management of hard drives since a laptop could potentially go for a very long time without needing to spin up the hard drive (or bring the SSD out of low power mode).

You also did mention that in the case of saving to a remote filesystem that sync is actually needed to determine if it is even possible to save to it. With the way the desktop in the past had worked, mounting remote filesystems using the kernel and using posix apis, it was unknown what the user would be saving to so apps might just sync to be safe. However I think this case may not be as big of an issue anymore, on the desktop at least, due to moving to api's such as gio which already know if they are saving to a remote filesystem so can sync only in those cases, at least aiui.

I updated my "Lame test case" to include more scenarios and added version implemented in C.

You can use it to estimate reliability of ext3/4 FS.

I will post my results (much) later - I will be busy with sport dancing at Sun-Mon.

Can anybody prepare small QEMU or VMWare image with recent patched and unpatched kernels and small ext3 and ext4 partitions for testing?

helios (martin-lichtvoll) wrote :

For the application side of things I filed a bug report for KDE:

https://bugs.kde.org/show_bug.cgi?id=187172

Tim (aardbeiplantje) wrote :

@helios

I fail to see the fact that this would ever be a KDE bug. fsync only *helps*, it will never ever make sure that things will be solved permanently for good. The reason that a rename() is used with a temp file that *nothing* can get 100% durability (even using fsync). App developers want atomicity (with rename), because atomicity, they can get with a good filesystem design and implementation. 100% durability is impossible, power syncs during a fsync will also result in halfly written files, it will still suck that the rename() has been done at that point.

Hence, as it is impossible to solve the problem of durability with fsync and because fsync is slow, it should not even be used IMHO.

@CowBoyTim

Power failure during fsync() will result in a half-written file, but that's why the correct sequence is

1) Create new temp file
2) Write to new temp file
3) fsync() new temp file
4) rename() over old file

If there's a power failure before or during step 3, the temp file will be partially written or not at all, but you'll still have the old data intact. A power failure during step 4 is no problem due to journaling. Therefore this really does give 100% assurance of durability, unless of course the hardware fails. But "it's not perfect, therefore it's worthless" is flawed logic anyway.

Tom B. (tom-bg) wrote :

@Volodymyr

I did some experimenting with your test cases. My results so far are:

File System Method Performance (Typical, Minimum, Maximum) #Lost %Lost

ext3 1 0.43 0.42 0.50 1 1.00%
ext3 2 0.32 0.30 0.33 0 0.00%
ext3 3 0.19 0.16 0.20 0 0.00%
ext3 4 0.25 0.20 0.25 0 0.00%
ext3 5 0.25 0.20 0.25 0 0.00%
ext3 6 0.44 0.33 0.46 0 0.00%
ext4 1 0.45 0.44 0.50 100 100.00%
ext4 2 0.33 0.33 0.33 100 100.00%
ext4 3 0.20 0.20 0.21 0 0.00%
ext4 4 0.25 0.25 0.33 0 0.00%
ext4 5 0.25 0.25 0.26 0 0.00%
ext4 6 0.44 0.33 0.41 0 0.00%

Ext4 will zero-length all the files in Ext 4 with test cases 1 and 2. I am working on downloading a revised kernel with the patch. This is my first time doing a recompile for a jaunty release, so this may take a while.

Tom B. (tom-bg) wrote :
Download full text (3.4 KiB)

@CowBoyTim

I agree with you. I work with real-time industrial systems, where the shop floor systems are considered unreliable. We have all the same issues as a regular desktop user, except our users have bigger hammers. The attraction of ext3 was the journalling with the ordered data mode. If power was cut, it was possible to reassemble something to a recent point in time, with only the most recent data lost. This bug in ext4, results in zero-length files, and not only in the most recent files either.

All fsync() does is bypass one layer of write-back caching. This just makes the window of data loss smaller, in the specific case of infrequent fsync() calls. By itself, fsync() does nothing to guarantee data integrity. I think this is why Bogdan was complaining about defective MySQL databases. Given the benchmarks, it is likely that the file system zero-lengthed the entire database file. Specifically, fsync() guarantees the data is on the disk, it doesn't guarantee the file system knows where the file is. As such, one could call fsync(), and still not be able to get at the data after a reboot.

The arguments against telling every application developer to use fsync() are:
1. Under heavy file I/O, fsync() could potentially decrease your average I/O speed by defeating the write-back caching. This could make the window of data loss larger, especially with a real-time system where the incoming data rate is fixed.
2. Repeated calls to fsync() would be very rough on laptop mode and on SSDs (Solid State Disks).
3. Repeated calls to fsync() will limit maximum file system performance for desktop applications. Eventually, the file system developers will replace fsync() with an empty function, just like Apple did.
4. If everyone will want fsync(), why don't we just modify close() function to call fsync()?
5. There is a strong correlation between user activity and system crashes. Not using the fsync() leads to much more understandable system behavior.

Imagine a typical self-inflicted system crash. This can be caused either directly: "Press Save then turn off the Computer," or indirectly: "edit video game config, hit play, and then watch the video driver crash."

If the write-back cache is enabled, and fsync() is not used, the program will write data to the cache, cause a bunch of disk reads, and then during idle time, the data will be written to disk. If the user generated activity results in disk reads, then the write-back cache will "protect" the old version of the file. The user will learn that crashing the machine results in him losing his most recent changes.

On the other hand, if fsync() is used to disable the write back cache, then programmers will start calling fsync() and close() from background threads. This will result in a poor user experience, as the hard disk will be thrashing during program startup (when all the disk reads are happening), and anything could happen when the system crashes during the fsync().

In the case that system crashes correlate to user activity, it is really tempting from a software point of view, to try to get the fsync() to happen before the system crash occurs. Unfortunately, in practice t...

Read more...

Tim (aardbeiplantje) wrote :

@Tom

I might not be good at making my point sometimes, but you clearly sum things up very good. Way better than I do.

@Aryeh

In Ext3, too many applications use fsync, I think that was from the ext2 day-and-era, where not syncing could lead to corrupt filesystems, not just empty files. Same with ufs on solaris - what a PITA that that sometimes can be. FAT on usbsticks is even always synced for that reason.

Even gvim does it, firefox too (actually sqlite3). Sometimes I get frustrated by the performance problems when copying a large file, and not being able to surf, that I do kill -9 (firefox hangs forever - fsync is a blocking call). However on the open-temp+fsync+rename, the rename won't happen anymore, as the kill -9 is handled right after the fsync(), hence, no new file. If it would have taken the sheer 1s to complete, nobody is going to kill -9, you'd be too late, the new file is there.

Under ext4, things will improve as Theodore pointed out. However, fsync means real I/O, and harddisks are just painfully slow, where stupid applications fsync-ing too much can and will hurt a machine's performance while not solving the problem of durability. That problem can just best be solved by atomicity with a rename - given the order stays correct.

Atomicity is a simple performance friendly solution to fsync() for me on a journaled filesystem.

Kai Krakow (hurikhan77) wrote :

To conclude everything:

No distribution should probably ship with Ext4 as default because Ext3's behaviour was broken and everyone relies on it. And it should not be shipped with Ext4 as default for the same reasons while people are warned to use XFS: Potential data loss on crashes.

So as this means never-ever install XFS on a laptop we should now also say never-ever install Ext4 on a laptop. So what future will Ext4 have? Of course a bright one because everyone will use it anyway. Because Ext# was ever used as default and will ever be used. But in my eyes Ext4 should share the same future as XFS, ReiserFS or similar: Don't use it because it's bad in the one or other way.

These may be hard words but it would be logic consequence. KDE actually has a fix for this (KDE_EXTRA_FSYNC=1) but it's not on by default because of Ext3's broken behaviour and bad performance impact when doing that. So Theodore is true: applications need fixing (putting aside the commit-ordering bug which I still consider a bug and will be fixed in .30? Not sure). But how to get out of this dilemma? One cannot use Ext4 before the applications are fixed. But applications won't be fixed until Ext4 is rolled out and actually many people experience this data-loss. In the meantime one way or the other: Newbies won't be impressed: Either the software will be slow or it will loose data - either is bad.

Actually Ext3's bad performance was why I switched to ReiserFS. But ReiserFS has bad performance in SMP systems. So I switched to XFS. And now there will be a second player having the same issues: Ext4. Hopefully a way out of this dilemma will be found fast.

KDE has a framework for reading and writing application settings. So the solution should be simple: switch on the fsync call at the same Ubuntu release where ext4 is the default file system. Does anybody know what is the situation of the GNOME environment? Does a similar switch exist?
Of course other core apps must be fixed, too. Firefox and OpenOffice are using fsync, as far as I know, so no problem. Fixing dpkg will be a must, it rewrites files in huge amounts and is essential to work correctly.
I think moving to ext4 for Karmic would be a good idea. Ubuntu 10.04 will be an LTS, so a bad target to switch to a new file system. If Karmic would be ext4 by default, it would contain at least the .30 kernel, which has patches to improve the situation, and the core applications could be patched.

Guys, see comment 45 and comment 154. A workaround is going to be committed to 2.6.30 and has already been committed to Jaunty. The bug is fixed. There will be no data loss in these applications when using ext4, it will automatically fsync() in these cases (truncate then recreate, create new and rename over old).

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 2.6.28-10.32

---------------
linux (2.6.28-10.32) jaunty; urgency=low

  [ Amit Kucheria ]

  * Delete prepare-ppa-source script

  [ Andy Isaacson ]

  * SAUCE: FSAM7400: select CHECK_SIGNATURE
  * SAUCE: LIRC_PVR150: depends on VIDEO_IVTV
    - LP: #341477

  [ Ayaz Abdulla ]

  * SAUCE: forcedeth: msi interrupt fix
    - LP: #288281

  [ Brad Figg ]

  * Updating armel configs to remove PREEMPT

  [ Catalin Marinas ]

  * Fix the VFP handling on the Feroceon CPU

  [ Huaxu Wan ]

  * SAUCE: (drop after 2.6.28) [Jaunty] iwlagn: fix iwlagn DMA mapping
    direction

  [ Ike Panhc ]

  * squashfs: correct misspelling
    - LP: #322306

  [ Theodore Ts'o ]

  * SAUCE: (drop after 2.6.28) ext4: add EXT4_IOC_ALLOC_DA_BLKS ioctl
  * SAUCE: (drop after 2.6.28) ext4: Automatically allocate delay allocated
    blocks on close
  * SAUCE: (drop after 2.6.28) ext4: Automatically allocate delay allocated
    blocks on rename
    - LP: #317781

  [ Tyler Hicks ]

  * SAUCE: (drop after 2.6.28) eCryptfs: Don't encrypt file key with
    filename key
    - LP: #342128

  [ Upstream Kernel Changes ]

  * ALS: hda - Add support of iMac 24 Aluminium
  * USB: fix broken OTG makefile reference
  * ALSA: hda - add another MacBook Pro 3,1 SSID
  * ALSA: hda - Add model entry for HP dv4
  * x86-64: fix int $0x80 -ENOSYS return
    - LP: #339743

 -- Tim Gardner <email address hidden> Thu, 12 Mar 2009 19:16:07 -0600

Changed in linux:
status: Fix Committed → Fix Released
Tom B. (tom-bg) wrote :

@Volodymyr

I finished recompiling the kernel with Theodore Ts'o patches, and reran Volodymyr's test cases with the patched kernel. The results are:

File System Method Performance (Typical, Minimum, Maximum) #Lost %Lost

ext4patch 1 0.44 0.41 0.50 1 1.00%
ext4patch 2 0.32 0.32 0.40 0 0.00%
ext4patch 3 0.20 0.18 0.20 0 0.00%
ext4patch 4 0.25 0.25 0.25 0 0.00%
ext4patch 5 0.26 0.26 0.33 0 0.00%
ext4patch 6 0.41 0.33 0.42 0 0.00%

Essentially, the patches work. Ext4, with the patch, has the same data loss as the ext3 file system.

Adding fsync() to the code results in a significant decrease in loop speed. As such, for the application writers, only add fsync() to your code when you want to be really sure the data has been written to disk, like when you are writing a database.

Theodore Ts'o (tytso) wrote :

There have been a lot of the same arguments (and more than few misconceptions) that have been made over and over again in this thread, so I've created a rather long blog post, "Don't fear the fsync!", which will hopefully answer a number of the questions people have raised:

      http://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/

It will probably be more efficient to attach comments to that post than to add more comments to this (already extremely long) Launchpad bug, since many of the comments are discussing philosophical issues, and as some folks have already pointed out, patches to deal with forcing out blocks on rename (just like ext3 does) have already been backported into the Ubuntu kernel, and will be in 2.6.30. I understand that btrfs will also be adding an flush-on-rename patch in 2.6.30. (Note that XFS at this point only has flush-when-closing-previously-truncated-file; which yes, I implemented after Eric Sandeen from Red Hat pointed out XFS had done. However, XFS does not, as of this date, flush-blocks-on-rename behaviour --- this was my innovation, to reward applications that were at least making an attempt to do things right --- although given the notoriety of this discussion, I wouldn't be surprised if Eric or some other XFS developer also adds this to XFS for 2.6.30).

However, if you are an application writer, please, please, PLEASE read my blog post "Don't fear the fsync!".

Wow, this thing sure is being actively discussed. I might as well weigh in:

- I side with those that say "lots of small files are a good idea" -- the benefits of having many small human-readable config files are self-evident to anyone who's ever had to deal with the windows registry. Suggesting that linux move to a similar approach is... well, I just can't wrap my head around it. It's crazy. If we take a performance hit on the filesystem to maintain our small file configs, fine, because if we try to do something like the windows registry we'll surely suffer the same (or worse) performance hit later on when it starts to become cluttered, fragmented, and corrupted.

- I'm guessing that most of the cases described by users here as their system "locking up" and needing a hard reset were not actually total lockups. I used to have similar problems with my previous nvidia card -- but really it was only the X server that was locking up (and taking the keyboard with it). Some have already alluded to the solution -- the alt+sysrq keyboard combinations. Pressing alt+sysrq+r during an X lockup will give you keyboard control back* , and then you can use ctrl+alt+f1 to switch to a text console. From there, you can log in and give X a swift kick in the arse without having to resort to the power button, and you won't encounter this filesystem problem. (You can also force a full sync to disk with alt+sysrq+s as I think someone else mentioned.)

*: I don't know why, but in a lot of distros the alt+sysrq stuff is disabled by default. Google it to figure out how to turn it on, and what all the other shortcuts are.

Well, al least it seems that tweaking kernel to allow ext4 to be nice, something has been broken on JFS side....

My box uses JFS for root FS, and all worked OK at least up to two days ago (2.6.28-9, IIRC).
With both 2.6.28-10 and 2.6.28-11, all sort of filesystem corruptions started to pop up, forcing me to a clean reinstall of jaunty (I was upgrading since hardy). And Yes, smartctl says no error on disk AND formatting with badblocks check drags up nothing.
No problem as I have full backups, but it is annoying that top make something new to work developers just break something (yes, old) which was correctly working for years.

As all is lost, only my memory can be of use, and messages were something like:

d_ino != inode number

and file gets (at least with jfs with full name!) wiped. AND we are not speaking about Gnome configuration files, but a whole system update gone with the wind. And when apt and dpkg stop working.....

Geeez....

Graziano

Harik (ubunto-dan) wrote :

Theodore, you're a bright guy, but this really means that you can't use EXT4 for anything at all.

fsync() is slow on many(most?) filesystems, and it grinds the entire system to a halt. What you're saying is that those applications have to know the details of the filesystem implementation, and only call fsync on ext4 where it's required and presumably isn't still just an alias for sync().

In your set of examples, #1 (ftruncate/write) is just broken and I can live with those applications dying. If there's a sane way to protect them from their own idiocy, that's fine. I'd prefer they remain as a guaranteed data loss so their bugs get fixed. #2 (write/rename) should be preserved. The application writer is merely saying 'One or the other of these should survive'. That's lots of little things - AIM buddy lists, desktop color, mp3 playlist position, firefox history - if every single application that I have is required to fsync() before rename or face guaranteed data loss of BOTH copies, that's a massive performance hit.

From a purely practical standpoint, you're not going to reverse about a decade's worth of advice - I can't count the number of times I've been told or seen people say "Write a temp file, rename over the other. If the system crashes, at worst you'll be left with the old copy.". And if that's a good enough guarantee, then non-critical applications should make use of it. I do NOT want everything constantly thrashing my disk for every tiny update. I can live with my browsing history losing my most recent entry, or my playlist going back a few songs. I can't live with every desktop application that's done anything in the two minutes prior to a crash having to be configured from scratch.

The fact that the standard says 'undefined' doesn't mean it's OK to force every application to use a higher level guarantee then they actually need. If a crash happens and the rename doesn't go through, that's good enough for 99% of what people do.

It sure beats essentially mounting your filesystem sync...

Harik (ubunto-dan) wrote :

Shit, not used to launchpad, I didn't see all the comments it hid by default, I read the entire first page and didn't see my comment was answered. Ignore what I wrote, it's been covered already.

Jon Spencer (jonfspencer) wrote :

In this post, Ts'o writes: "Since there is no location on disk, there is no place to write the data on a commit; but it also means that there is no security problem." Well, this means that the specific security problem identified, exposure of information to those who are not authorized to see it, or more importantly, introduction of a covert storage channel, has been eliminated However, the lack of a guarantee of the order of writing data can introduce other security issues, such as an incomplete audit trail or inconsistent data (which can both be exploited). Following the POSIX recommendations will close these security holes for trusted applications.

nicobrainless (nicoseb) wrote :

I am experiencing the exact same problem... I was with an ext3 converted to ext4 partition and ended up reinstalling everything as data loss killed all /etc...
My fresh Jaunty on a fresh ext4 already gave me twice 'read-only file system' and now I don't know what to do...

   Olli Salonen wrote on 2009-03-07: (permalink)

   I'm experiencing something similar, but with a twist.

   After few hours to two days of use, my ext4 /home partition becomes read-only all of a sudden. I usually close my session, unmount home, run fsck on the partition, and remount. Occasionally it leads to data loss, depending on what I was doing.

   I'm currently on 2.6.28-6 because the last upgrade attempt lead to an unbootable system (also running ext4 on root partition), so I don't know if this is a fixed issue.

Sorry if someone gave a solution already, this bug is getting too long and I didn't find it... I am not new on linux, I am used to broken system by myself but here I don't get what is going on and I don't wanna start it without an idea to repair...
please help!

Thanks

I am sorry if

nicobrainless (nicoseb) wrote :

BTW I am running on a fully updated jaunty with kernel 2.6.28-11..and it started about 5 days ago

Carey Underwood (cwillu) wrote :

@nicobrainless: Sounds like a hardware failure to me. I'd suggest
investigating the smartctl utility (in the package 'smartmontools') to
check on the general health of the drive.

Note that this isn't a troubleshooting forum, nor is 'too many
comments' really a good excuse for not reading them.

nicobrainless (nicoseb) wrote :

@Carey

Sorry I did read a fair amount of the comments but realized that my problem was slightly different...

I already investigated the hardware side and all kind of test (short, long and don't remember the third word) returned with no error! I also reinstalled everything again on an ext3.... same random crashes but not as often at all... In my case looks like it was a combination of that ext4 crash and a weird other one...

I'll keep searching... that's kind of killing me...!

Thanks anyway

Problem seems 2.6.28-11. My system is stable with 2.6.28-9. I have reported bug #346691.

Yves Glodt (yglodt) wrote :

Linus made some comments about the filesystem's behaviour:

http://lkml.org/lkml/2009/3/24/415
http://lkml.org/lkml/2009/3/24/460

helios (martin-lichtvoll) wrote :

Daniel Philipps, developer of Tux3 filesystem, wants to make sure that renames come after file being written even when delayed writing of metadata is introduced to it:
http://mailman.tux3.org/pipermail/tux3/2009-March/000829.html

Jamin W. Collins (jcollins) wrote :

I know this report claims that a fix is already in Jaunty for this issue. However, I just found myself with a 0 byte configuration file after a system lockup (flashing caps lock).

$ uname -ra
Linux odin 2.6.28-11-generic #37-Ubuntu SMP Mon Mar 23 16:40:00 UTC 2009 x86_64 GNU/Linux

Theodore Ts'o (tytso) wrote :

@189: Jamin,

The fix won't protect against a freshly written new file (configuration or otherwise); it only protects against a file which is replaced via rename or truncate. But if it was a file that previously didn't exist, then you can still potentially get a zero-length file --- just as you can crash just before the file was written out.

Jamin W. Collins (jcollins) wrote :

@Theo
The file in question was a previously existing configuration file for my IM client (gajim). All IM accounts and preferences were lost. Not a huge deal, but definitely a preexisting file. The system kernel panicked (flashing caps lock) while chatting. The kernel panic is a separate issue that's been reported previously.

Rocko (rockorequin) wrote :

@Theo: I vote for what (I think) lots of people are saying: if the file system delays writing of data to improve performance, it should delay renames and truncates as well so you don't get *complete* data loss in the event of a crash... Why have a journaled file system if it allows you to lose both the new *and* the old data on a crash rather than just the new data that couldn't be written in time?

It's true that this situation won't happen if the system never crashes, and it's great that this is true of your system - but in that case, why not just use ext2?

If ext3 also allows this, I'd say there's a problem with ext3 too.

Incidentally, I just ended up with a ton of trashed object files due to a kernel panic in the middle of a build. But I wouldn't say gcc is a crappy application!

PS. Other than this bug, ext4 rocks.

Theodore Ts'o (tytso) wrote :

@Rocko,

If you really want this, you can disable delayed allocation via the mount option, "nodelalloc". You will take a performance hit and your files will be more fragmented. But if you have applications which don't call fsync(), and you have an unstable system, then you can use the mount option. All I can say is that I don't see these data loss problems, but everyone has different usage patterns.

In terms of trashed object files in the middle of the build, those object files are non-precious files. How often do you crash in the middle of a build? Should you slow down all builds just to handle the rare case where your system crashes in the middle of the build? Or would it be better to run "make clean", and rebuild the tree in the case where you have trashed object files? It's not like a kernel rebuild takes that long. OTOH, if your system is crashing all the time, there's something else seriously wrong; Linux systems shouldn't be that unstable.

Theodore Ts'o (tytso) wrote :

@Jamin,

We'd have to see how gaim is rewriting the application file. If it is doing open/truncate/write/close, there will always be the chance that the file would be lost if you crash right after the truncate. This is true with both ext3 and ext4. With the workaround, the chances of losing the file with ext4 when the application tries to do the fundamentally broken replace-via-truncate are the same as with ext3. We can't do better than that.

Jamin W. Collins (jcollins) wrote :

@Theo,

Been digging through the source to track down how it does it. Managed to find it. It does use a central consistent method, which does use a tempfile. However, it does not (as of yet) force a sync. I'm working on getting that added to the code now. Here's the python routine it uses:

self.__filename: the full path to the user's configuration file.
self.__tempfile: the same path and filename but with a dot prefix

 def write(self):
  (base_dir, filename) = os.path.split(self.__filename)
  self.__tempfile = os.path.join(base_dir, '.' + filename)
  try:
   f = open(self.__tempfile, 'w')
  except IOError, e:
   return str(e)
  try:
   gajim.config.foreach(self.write_line, f)
  except IOError, e:
   return str(e)
  f.close()
  if os.path.exists(self.__filename):
   # win32 needs this
   try:
    os.remove(self.__filename)
   except Exception:
    pass
  try:
   os.rename(self.__tempfile, self.__filename)
  except IOError, e:
   return str(e)
  os.chmod(self.__filename, 0600)

That looks like it removes the file before it does the rename, so it misses the special overwrite-by-rename workaround. This is slightly unsafe on any filesystem, since you might be left with no config file with the correct name if the system crashes in a small window, fsync() or no. Seemingly Python on Windows doesn't support an atomic rename operation at all.

It might be simplest for it to only do the remove if rename throws an OSError, or only if the platform is Windows. Ideally it should call fsync() as well, of course.

What that code does is stupid, yes. It shouldn't remove the original unless the platform is win32. *Windows* (except with Transactional NTFS) doesn't support an atomic rename, so it's no surprise that Python under Windows doesn't either.

You're seeing a zero-length file because Tso's fix for ext4 only applies to files being renamed on top of other files. The filesystem should be fixed to allocate blocks on *every* commit, not just ones overwriting existing files.

As for the program -- fsync should *not* be inserted. (Though the unconditional os.remove() should be changed.) It's a bad thing to ritually fsync every file before the rename for a host of reasons described upthread. Just fix the filesystem.

Theodore Ts'o (tytso) wrote :

@Daniel,

Note that if you don't call fsync(), and you hence you don't check the error returns from fsync(), your application won't be notified about any possible I/O errors. So that means if the new file doesn't get written out due to media errors, the rename may also end up wiping out the existing file. This can be an issue with some remote file systems, like AFS, where you'll miss quota errors unless you fsync() and check the error returns on both fsync() and close(). But hey, if you don't care about that, feel free to write your applications any way you want.

"The filesystem should be fixed to allocate blocks on *every* commit, not just ones overwriting existing files."

alloc_on_commit mode has been added. Those who want to use it (and take the large associated performance hit) can use it. It's a tradeoff that is and should be in the hands of the individual system administrator. Personally, my machine almost never crashes, so I'd prefer the extra performance.

What the application is doing in this case is broken anyway, and if it fixed that there would be no problem on ext4.

"As for the program -- fsync should *not* be inserted. (Though the unconditional os.remove() should be changed.) It's a bad thing to ritually fsync every file before the rename for a host of reasons described upthread."

fsync() should preferably be used for config file updates, assuming those are reasonably rare, "for a host of reasons described upthread". Otherwise, the user will click "Save" and then the preference change won't actually take effect if the system crashes shortly thereafter. This is true in any filesystem. On some filesystems (not just ext4: XFS certainly, maybe NFS?), you might also get some kind of other bad stuff happening. Explicit user saving of files/preferences/etc. should therefore invoke an fsync() in most cases: you want to make sure the change is committed to stable storage before giving confirmation to the user that it's saved. Text editors already do this, and no one seems to have complained.

If Gaim updates its config file very often for some reason, though, they'd have to weigh the added reliability of fsync() against the performance hit (especially on filesystems like ext3).

If you accept that it makes sense to allocate on rename commits for overwrites of *existing* files, it follows that it makes sense to commit on *all* renames. Otherwise, users can still see zero-length junk files when writing a file out for the first time. If an application writes out a file using the atomic rename technique, it should expect just as good a consistency guarantee when the file doesn't already exist as when it does. Anything else just adds extra complexity.

Before your knee jerks out "performance," consider that brand-new, throwaway files aren't renamed. gcc doesn't write a file out, only to rename it immediately. Only files for which atomicty matters are renamed that way -- which are precisely the files that would get the commit-on-rename treatment in other circumstances. The performance impact of committing on *all* renames would be minimal over the existing rename code.

We keep talking in circles: if you're going to make a commitment to application reliability, go all the way and commit on all renames. Anything else is just a subtle gotcha for application programs. Yes, POSIX them harder, will you?

NFS is a special case in that 1) it's widely known to have strange semantics, and 2) many applications explicitly don't support NFS for that reason. NFS semantics are *not* the ones we should be striving to emulate! Besides, the kind of inconsistency you see with NFS doesn't result in corrupt configurations in the same way the ext4 bug does.

As for AFS: it has a special place in Hell. AFS doesn't even meet basic POSIX guarantees with regard to permissions. Its mind-bendingly stupid quota behavior is just icing on the cake. It's crap as a unix filesystem, and I sure as hell wouldn't consider using it except on a specially-prepared system. I'm not going to make my application jump through hoops to support your antiquated hack. Every other filesystem checks quotas on write and close; why does yours have to be different?

"If you accept that it makes sense to allocate on rename commits for overwrites of *existing* files, it follows that it makes sense to commit on *all* renames."

Renaming a new file over an existing one carries the risk of destroying *old* data. If I create a new file and don't rename it to anything, it's possible I will lose *the new file only*, on any filesystem (unless I fsync()). This is universally considered an acceptable risk: losing up to a couple of minutes' work (but nothing earlier) in the event of a system crash. This is the exact risk carried by renaming a file to a name that doesn't exist -- unless you gratuitously delete the old file first, which is completely pointless on Unix and obviously destroys any hope of atomicity (if the system crashes/app dies/etc. between delete and rename).

"Only files for which atomicty matters are renamed that way -- which are precisely the files that would get the commit-on-rename treatment in other circumstances."

Virtually all users of this atomicity technique appear to rename over the existing file, which is why almost all problems disappeared when users applied Ted's patches. Gaim only did otherwise as a flawed attempt to work around a quirk of the Windows API, in a way that wasn't atomic anyway, and that can be expected to be fixed in Gaim.

The risk isn't data loss; if you forgo fsync, you accept the risk of some data loss. The issue that started this whole debate is consistency.

The risk here is of the system ending up in an invalid state with zero-length files *THAT NEVER APPEARED ON THE RUNNING SYSTEM* suddenly cropping up. A zero-length file in a spot that is supposed to be occupied by a valid configuration file can cause problems --- an absent file might indicate default values, but an empty file might mean something completely different, like a syntax error or (famously) "prevent all users from logging into this system."

When applications *really* do is create a temporary file, write data to it, and rename that temporary file to its final name regardless of whether the original exists. If the filesystem doesn't guarantee consistency for a rename to a non-existing file, the application's expectations will be violated in unusual cases causing hard-to-discover bugs.

Why should an application that atomically updates a file have to check whether the original exists to get data consistency?

Allocate blocks before *every* rename. It's a small change from the existing patch. The performance downsides are minimal, and making this change gives applications the consistency guarantees they expect.

Again: if you accept that you can give applications a consistency guarantee when using rename to update the contents of a file, it doesn't make sense to penalize them the first time that file is updated (i.e., when it's created.) Unless, of course, you just want to punish users and application developers for not gratuitously calling fsync.

Chow Loong Jin (hyperair) wrote :

On Fri, 2009-03-27 at 22:55 +0000, Daniel Colascione wrote:
> The risk isn't data loss; if you forgo fsync, you accept the risk of
> some data loss. The issue that started this whole debate is consistency.
>
> The risk here is of the system ending up in an invalid state with zero-
> length files *THAT NEVER APPEARED ON THE RUNNING SYSTEM* suddenly
> cropping up. A zero-length file in a spot that is supposed to be
> occupied by a valid configuration file can cause problems --- an absent
> file might indicate default values, but an empty file might mean
> something completely different, like a syntax error or (famously)
> "prevent all users from logging into this system."
A syntax error usually prevents the whole program from running, I should
think. And I'm not sure about the whole "prevent all users from logging
into this sytem" bit. I've never even heard of it, so I don't know how
you can consider that famous.

> When applications *really* do is create a temporary file, write data to
> it, and rename that temporary file to its final name regardless of
> whether the original exists. If the filesystem doesn't guarantee
> consistency for a rename to a non-existing file, the application's
> expectations will be violated in unusual cases causing hard-to-discover
> bugs.
It is guaranteed. When you *rename onto an existing file*. If you delete
the original *before* renaming, then I see it as "you have agreed to
forgo your atomicity".
>
> Why should an application that atomically updates a file have to check
> whether the original exists to get data consistency?
Um, no, I don't think it needs to. See this:
Case 1: File already exists.
1. Application writes to file.tmp
2. Application closes file.tmp
3. Application renames file.tmp to file.
** If a crash happens, you either get the original, or the new.

Case 2: File doesn't already exist.
1-3 as above.
** If a crash happens, you either get the new file, or a zero-length
file.

Considering that in case 2 there wasn't a file to begin with, I don't
think it's much of an issue in getting a zero-length file. Unless your
program crashes when you get zero-length configuration files, in which
case I think your program sucks and you suck for writing it with that
assumption.

>
> Allocate blocks before *every* rename. It's a small change from the
> existing patch. The performance downsides are minimal, and making this
> change gives applications the consistency guarantees they expect.
I wholeheartedly agree with "Allocate blocks before renames over
existing files", but "Allocate blocks before *every* rename" is
overdoing it a little.
>
> Again: if you accept that you can give applications a consistency
> guarantee when using rename to update the contents of a file, it doesn't
> make sense to penalize them the first time that file is updated (i.e.,
> when it's created.) Unless, of course, you just want to punish users and
> application developers for not gratuitously calling fsync.
Again, I don't see exactly how an application is being penalized the
first time the file is updated.

--
Chow Loong Jin

First of all, the program under discussion got it wrong. It shouldn't have unlinked the destination filename. But the scenario it unwittingly created is *identical* to the first-time creation of a filename via a rename, and that's a very important case. EVERY program will encounter it the first time it creates a file via an atomic rename. If the system dies at the wrong time, the program will see a zero-length file in place of the one it just wrote.

This is your scenario two. This is *NOT* about data loss. If the program cared about data loss, it'd use fsync(), dammit. This is about consistent state.

The program didn't put that zero-length file there. Why should it be expected to handle it? It's perfectly reasonable to barf on a zero-length file. What if it's XML and needs a root element? What if it's a database that needs metadata? It's unreasonable to expect every program and library to be modified to not barf on empty files *it didn't write* just like it's unreasonable to modify every program to fsync gratuitously. Again -- from the point of view of the program on a running system, there was at *NO TIME* a zero-length file. Why should these programs have to deal with them mysteriously appearing after a crash?

Okay, and now what about XFS? XFS fills files with NULL instead of truncating them down to zero length (technically, it just makes the whole file sparse, but that's beside the point.) Do programs need to specially handle the full-of-NULLs case too? How many hoops will they have to go through just to pacify sadistic filesystems?

A commit after every rename has a whole host of advantages. It rounds out and completes the partial guarantee provided by a commit after an overwriting rename. It completely prevents the appearance of a garbage file regardless of whether a program is writing the destination for the first or the nth time. It prevents anyone from having to worry about garbage files at all.

It's far better to fix a program completely than to get it right 99% of the time and leave a sharp edge hiding in some dark corner. Just fix rename.

And what's the downside anyway? High-throughput applications don't rename brand-new files after they've just created them anyway.

As for no users being able to log in -- I was referring to an old BSD network daemon. But for a more modern example, how about cron.deny? If cron.deny does not exist, only root can use cron. If cron.deny exists *AND IS EMPTY*, all users can use cron.

Rocko (rockorequin) wrote :

I agree with Daniel - consistency should be a primary objective of any journaling file system.

Would it be possible to do something like store both the old and new inodes when a rename occurs, and to remove the old inode when the data is written? This way it could operate like it is currently, except that after a system crash it would be able to see that the new inode is invalid and restore the old one instead.

Jamin W. Collins (jcollins) wrote :

@Theo
Sorry for the false alarm. Filed it as soon as I found the 0 byte file while still investigating the source. I've since created and submitted a patch (via launchpad, https://bugs.launchpad.net/ubuntu/+source/gajim/+bug/349661) that I believe should correct gajim's behavior in this area.

Rocko (rockorequin) wrote :

@Theo: would it be hard to implement something like I suggested, ie storing rename backup metadata for crash recovery? I think in the discussion on your blog someone says that reiserfs already does this via 'save links' (comment 120).

Alternatively, if there were a barrier to all renames instead of just ones that overwrite an existing file, would that stop the zero-length files issue, ie make the file system consistent in the event of a crash? I imagine that this would only impact on performance for applications that create very large files and then rename them before the data is written to disk, which seems a very unusual case.

André Barmasse (barmassus) wrote :

Hello together

Just reporting some observations after making a brand new installation of Ubuntu 9.04 with ext4 as default file system on my Sony Vaio VGN-FS195VP. Since the installation some days ago I had again four hard locks, but luckily - despite my experiences some weeks ago - without any data loss. All of them happened with the standard installation of Ubuntu on the Gnome desktop.

One hard lock happened when listening internet radio with quodlibet in the background and trying to update Ubuntu via Synaptics. Another one when trying to rip a dvd with wine and dvdshrink in the background and trying to open other applications (Firefox, Bluefish, gFTP) almost at the same time. The other two happended when trying to remove some ISO files of DVDs (together maybe about 12 GB data) from the trash. The trash icon on the desktop turned empty (actuallly a good sign), but about five seconds later the entire system crashed .

The Kernel running on my system is 2.6.28-11-generic and the Gnome version is 2.6.28-11-generic. Since I am not a very technical guy, I have not applied any of the above mentioned remedies. But as I am very happy with ext4 as my default file system (and have not yet experienced data loss!) I will keep it hoping that there will be some fixes in the next Kernel.

Thanks for all your explanations about ext4, Theodore Ts'o, and keep up the good work!

Rocko (rockorequin) wrote :

@André: you might be experiencing one or two different bugs that are possibly related to ext4 in the Jaunty kernel - see https://bugs.launchpad.net/ubuntu/+source/linux/+bug/348731 and https://bugs.launchpad.net/ubuntu/+source/linux/+bug/330824. The latter happens when you try and delete lots of files from an ext4 partition.

To try and avoid the hard lockups, I've installed the 2.6.30.rc3 kernel from the weekly Ubuntu kernel builds since it has the patches in this bug applied to stop truncated files on a crash and the file deletion bug is fixed. So far so good.

André Barmasse (barmassus) wrote :

Hi

Thanks for your answers, Rocko. Today I have installed the Karmic Koala Alpha 1 with Kernel 2.6.30-5-generic, and it seems that all the former problems with ext4 are gone. For testing purposes I have created 5 big dvd iso files (together about 30 GB of data), moved them around in the system, copyied and deleted them three or four times, and - as a final barrier - emptyied the trash with meanwhile around 120 GB of data in it. Everything went smoothly without the system tottering for even one second!! Great work, guys!!

Rocko (rockorequin) wrote :

No worries, André! Some more feedback on 2.6.30: I've been using 2.6.30-rc3 then 2.6.30-rc5 without problems in Jaunty for several weeks now (I used to get kernel panics at least twice a week with 2.6.28) and am now trying 2.6.30-rc6. Still so far so good.

I just subscribed this bug as I started seeing this behaviour with 2.6.30-rc6 on my aspire one. First it was the 0 length files after a crash (the latest intel drivers still hang sometimes at suspend/resume or at logout/shutdown, and only the magic REISUB gets you out of it), and once I saw my /home mounted R/O because of a ext4 error (unfortunately didn't save dmesg) and after the fsck had again 0 byte files (mostly in my firefox profile, as I was web browsing at the time). Next time I get this bug I'll post here the dmesg.
Some possibly relevant points:
I formated both my / and my /home partitions clean with mkfs.ext4.
I have / on the internal ssd, and /home on a sdhc 8GB card.
I have 1.5GB RAM and no swap (to save some wear and tear on the flash memory).

Now I didn't even had a crash, but on reboot my kdewallet.kwl file was empty. I removed it, and in syslog I got the following:
"EXT4-FS warning (device mmcblk0p1): ext4_unlink: Deleting nonexistent file (274), 0

After another reboot some more problems with kwallet, here is dmesg.

And after another clean shutdown and a reboot, I finally had to reformat my home partition and restore it from a backup, as the fsck gave a huge amount of errors and unlinked inodes. Gone back to ext3, will wait for 2.6.30 final before new tests. Here is the final dmesg to just after the fsck. As with the previous one, I just removed my AP mac address for privacy reasons.

Theodore Ts'o (tytso) wrote :

Jose, please open a separate bug, as this is an entirely different problem. (I really hate Ubuntu bugs that have a generic description, because it seems to generate "Ubuntu Launchpad Syndrome" --- a problem which seems to cause users to search for bugs, see something that looks vaguely similar, and then they add a "me too" to a bug, instead of opening a new bug.)

Launchpad simply doesn't scale, and it's painful as all heck to deal with a bug with 200 comments. And this is entirely unrelated to the problem that people were dealing with --- and which has been solved in the Ubuntu kernels and in 2.6.29.

The errors you are reporting are entirely consistent with this which I found earlier in your dmesg:

[ 7.531490] EXT4-fs warning: mounting fs with errors, running e2fsck is recommended

I'm guessing you didn't set up your /etc/fstab correctly so that the filesystem on your /dev/mmcblk0p1 (i.e., your SD card) would have e2fsck run on reboot when it had errors. That would certainly be consistent with the dmesg log which you showed.

As for what caused the problem, I'm not entirely sure. One things is for sure, though --- you need to fix up your filesystem first, and I would recommend that you check out your /etc/fstab to make sure that the filesystem is checked at boot-up if it needs it. That means the fsck pass field of /etc/fstab needs to be non-zero. In any case, please open a new bug, and put a pointer in this launchpad bug to the new bug so people who are interested can follow you to the new bug. Thanks!

Ok, I'll try installing 2.6.30 final for ubuntu and report a new bug. As for the fsck, the only time I didn't boot into single user mode and ran fsck by hand was that one. My fstab entry is simple - "LABEL=Home /home ext4 relatime,defaults 0 0", and most errors I had were data corruption after a crash/hang, as most reporters here, so that is why I reported it here. I've now changed the 6th field to 2 to make sure it is checked at boot if needed.
Anyway, as I said I reformatted my sd card as ext3 and won't try ext4 on it until 2.6.30 final, so until then I'll keep quiet.

corneliupa (corneliupa) wrote :

Would it be possible to create sync policies (per distribution, per user, per application) and ensure like this a flexibility/compromise every user might choose/change?

ted ts'o:

"You can opine all you want, but the problem is that POSIX does not specify anything ..."

I'll opine that POSIX needs to be updated.

The use of the create-new-file-write-rename design pattern is pervasive and expected that after a crash either the new contents or the old contents of the file will be found there, but zero length is unacceptable. This is the behavior that we saw with ext2 where the metadata and data writes could get re-ordered and result in zero-length files. With the 800 servers that I was maintaining then, it meant that the perl scripts for our account management software would zero-length out /etc/passwd, along with other corruption often enough that we were rebuilding servers every week or two. As the site grew and roles and responsibilites grew that meant that with 30,000 linux boxes, even with 1,000-day uptimes there were 30 server crashes per day ( even without crappy graphics drivers, a linux server busy doing apache and a bunch of mixed network/cpu/disk-io seems to have about this average uptime -- i'm not unhappy with this, but at large numbers of servers, then server crashes catch up with you ). And while I've never seen this result in data loss, it does result in churn in rebuilding and reimaging servers. It could also cause issues where a server is placed back into rotation looking like it is working (nothing so obvious as /etc/passwd corrupted), but is still failing on something critical after a reboot. You can jump through intellectual hoops about how servers shouldn't be put back into rotation without validation, but even at the small site that I'm at now with 2,000 servers and about 300 different kinds of servers, we don't have good validation, don't have the resources to build it, and rely on servers being able to be put back into rotation after they reboot without worrying about subtle corruption issues.

There is now an expectation that filesystems have transactional behavior. Deal with it. If it isn't explicitly part of POSIX then POSIX needs to be updated in order to reflect the actual realities of how people are using Unix-like systems these days -- POSIX was not handed down from God to Linus on the Mount. It can and should be amended. And this should not damage the performance benefits of doing delayed writes. Just because you have to be consistent doesn't mean that you have to start doing fsync()s for me all the time. If I don't explictly call fsync()/fdatasync() you can hold the writes in memory for 30 minutes and abusively punish me for not doing that explicitly myself. But just delay *both* the data and metadata writes so that I either get the full "transaction" or I don't. And stop whining about how people don't know how to use your precious filesystem.

Steffen Neumann (sneumann) wrote :

Hi,

I am also bitten by the above ecryptfs messages slowly filling my /var/log and
have a followup question to the cleanup workaround presented by Dustin in comment #57
of this bug:

Is there any way to determine (=decrypt) which files have been messed up,
so I know if there is anything important, which I have to grab from the backup
before that expires and gets overwritten ? In other words:

 $ umount.ecryptfs_private
 $ cd $HOME/.Private
 $ mount.ecryptfs_private
 $ find . -size 0c | xargs ecryptfs-decrypt-filename {}
                                      ^^^^^^^^^^^^^^^^^^^^^^

Yours,
Steffen

Steffen Neumann (sneumann) wrote :

I have added a separate bug for the problem of (de-)crypting filenames,
see https://bugs.launchpad.net/ecryptfs/+bug/493779

Yours,
Steffen

Steffen Neumann (sneumann) wrote :

Hi,

I found a workaround to the problem of determining the cleartext filenames.
*Before* you delete the zero-byte files, back 'em up:

1) tar find .Private -size 0b | xargs tar -czvf zerofiles.tgz
2) Unmount your encrypted home
3a) Temporarily move the "good" files away:
       mv .Private .Private-real
3b) and restore the "broken" ones:
     tar xzvf zerofiles.tgz
4) remount your encrypted home
    The files will not be usable, but at least you know their names

5a) Unmount your unusable encrypted home
5b) restore the "good" encrypted files:
       mv .Private .Private-broken
       mv .Private-real .Private

6) Remount and continue.

My last problem: I *still* have 5 files for which I get
the "Valid eCryptfs headers not found in file header region or xattr region"
error. Since I purged all -size 0b files (verified!) I'd like to know
how to track those ones down. Is there another find expression
that can nail those down ? Any other debugging option I could/should
enable to find these 5 files ?

Yours,
Steffen

Rgpublic (rgpublic) wrote :

Installed Karmic with ext4 on a new PC today. Installed FGLRX afterwards. All of a sudden the PC froze completely. No mouse-movement, no keyboard. Hard reset. After reboot lots of configuration files that were recently changed had zero length. The system became unusable due to this (lots of error messages in dpkg etc). Installed again on ext3. No problems ever since. I wonder why this is installed by default as Ubuntu is supposed to be a user-friendly distro. Is it really necessary to squeeze out the last bit of extra performance for the sake of data security? This is certainly not desired for a desktop system. At least an explicit warning that this could happen should appear during installation.

Jobo (arkazon) wrote :

Can someone point me toward documentation for "data=alloc_on_commit"?

I am getting 0 byte files after system freezes on Ubuntu 10.04.01 (amd64) with kernel version 2.6.32-25. Just want to understand how one uses alloc_on_commit and how it works before I use it, and I can't find any proper documentation for it, just a few brief mentions in articles and forum postings.

Thanks.

Lukas (lukas-ribisch) wrote :

As far as I understand, the problem has been fixed for overwriting by rename and overwriting by truncate. Is it an issue at all for just overwriting part of a file, without truncating it first?

I realize that there are basically no guarantees when fsync() is not used, but will such writes to already allocated blocks be written on every commit interval (by default 5 seconds)?

Or can the changes possibly remain in the cache much longer, since there is no chance of metadata corruption? (It would seem that the inode wouldn't have to change except for modification time, and unlike for newly allocated blocks, there is also no security issue, since the owner of the file can only get access to his own stale data after a crash, not somebody else's, as it would be with newly allocated blocks.)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.