Comment 137 for bug 317781

Revision history for this message
Chris Newman (chris-greenhillroad) wrote :

@Theodore,

As a scalable server developer with 25 years experience, I am fully aware of the purpose of fsync, fdatasync and use them if and only if the semantics I want are "really commit to disk right now". To use them at any other time would be an implementation error.

I further agree delayed allocation is a good thing and believe application developers who use the first command sequence you describe above get what they deserve and that is it a mistake for the filesystem to perform an implicit sync in that case.

Where I strongly disagree with you is for the open-write-close-rename call sequence (your second scenario). It is very common for an application to need "atomic replace, defer ok" semantics when updating a file (more common, in fact, than cases where fsync is really needed). The only way to express that semantic is open-write-close-rename, and furthermore that semantic is the only useful interpretation of that call sequence. Adding an fsync expresses a different and less useful semantic. For example, when I do "atomic replace, defer ok" twice in a flush interval I would expect an optimal filesystem to discard the intermediate version without ever committing it to disk. So I find the workaround you've implemented undesirable as it results in non-optimal and unnecessary disk commits.

Now your not-useful interpretation of open-write-close-rename is Posix compliant under a narrow interpretation. But I can interpret any standard in a not-useful way. An IMAP server that delivers all new mail to a mailbox "NEWMAIL" and has no "INBOX" would be strictly compliant with the spec and also not useful. Any reasonable IMAP client vendor will simply state they don't support that server. And that's exactly what will happen to EXT4, XFS and other filesystems that interpret the open-write-close-rename call sequence in a not useful way. You will find applications declare your filesystem unsupported because you interpret a useful call sequence in a not-useful fashion.

The right interpretation of open-write-close-rename is "atomic replace, defer ok". There is no reason to spin up the disk or fsync until the next flush interval. What's important is that the rename is not committed until after the file data is committed.

If you disagree, I invite you to suggest how you would express "atomic replace, defer ok" using Posix APIs when writing an application.