Comment 123 for bug 317781

Revision history for this message
Hiten Sonpal (hiten-sonpal) wrote :

@Theo,

Appreciate everything you've done for ext filesystems and Linux in general. A few comments:

> Slightly more sophisticated application writers will do this:
>
> 2.a) open and read file ~/.kde/foo/bar/baz
> 2.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
> 2.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
> 2.d) close(fd)
> 2.e) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")

> The fact that series (1) and (2) works at all is an accident. Ext3 in its default configuration happens to have the property that 5 seconds after (1) and (2) completes, the data is safely on disk.

I offer that 0-length files only appear when 2.c) happens after 2.e). This is a sequencing error - I don't know where it happens, but a crash between 2.a) through 2.e) should result in only four states:

A. ~/.kde/foo/bar/baz was not touched at all
B. ~/.kde/foo/bar/baz was not touched at all, ~/.kde/foo/bar/baz.new exists with no data
C. ~/.kde/foo/bar/baz was not touched at all, ~/.kde/foo/bar/baz.new exists with some data
D. ~/.kde/foo/bar/baz was not touched at all, ~/.kde/foo/bar/baz.new exists with all the data
D. ~/.kde/foo/bar/baz contains data previously written to baz.new

If ~/.kde/foo/bar/baz exists with no data, it means that the rename has been moved up in sequence in the disk and step 2.c) did not actually happen on disk before the crash.

> So, what is the problem. POSIX fundamentally says that what happens if the system is not shutdown cleanly is undefined. If you want to force things to be stored on disk, you must use fsync() or fdatasync(). There may be performance problems with this, which is what happened with FireFox 3.0[1] --- but that's why POSIX doesn't require that things be synched to disk as soon as the file is closed.

That's not what we are saying. No one has a problem with fsync() being required to force items on disk. The issue is that not using fsync() causes us to loose items that were on the disk because of long windows between out-of-sequence updates to the disk.

Atomic transactions like renames should happen in-order with related transactions to make sure that we do not have unexpected data corruption.

Thanks for quickly creating a fix,
-- Hiten