Comment 219 for bug 317781

ted ts'o:

"You can opine all you want, but the problem is that POSIX does not specify anything ..."

I'll opine that POSIX needs to be updated.

The use of the create-new-file-write-rename design pattern is pervasive and expected that after a crash either the new contents or the old contents of the file will be found there, but zero length is unacceptable. This is the behavior that we saw with ext2 where the metadata and data writes could get re-ordered and result in zero-length files. With the 800 servers that I was maintaining then, it meant that the perl scripts for our account management software would zero-length out /etc/passwd, along with other corruption often enough that we were rebuilding servers every week or two. As the site grew and roles and responsibilites grew that meant that with 30,000 linux boxes, even with 1,000-day uptimes there were 30 server crashes per day ( even without crappy graphics drivers, a linux server busy doing apache and a bunch of mixed network/cpu/disk-io seems to have about this average uptime -- i'm not unhappy with this, but at large numbers of servers, then server crashes catch up with you ). And while I've never seen this result in data loss, it does result in churn in rebuilding and reimaging servers. It could also cause issues where a server is placed back into rotation looking like it is working (nothing so obvious as /etc/passwd corrupted), but is still failing on something critical after a reboot. You can jump through intellectual hoops about how servers shouldn't be put back into rotation without validation, but even at the small site that I'm at now with 2,000 servers and about 300 different kinds of servers, we don't have good validation, don't have the resources to build it, and rely on servers being able to be put back into rotation after they reboot without worrying about subtle corruption issues.

There is now an expectation that filesystems have transactional behavior. Deal with it. If it isn't explicitly part of POSIX then POSIX needs to be updated in order to reflect the actual realities of how people are using Unix-like systems these days -- POSIX was not handed down from God to Linus on the Mount. It can and should be amended. And this should not damage the performance benefits of doing delayed writes. Just because you have to be consistent doesn't mean that you have to start doing fsync()s for me all the time. If I don't explictly call fsync()/fdatasync() you can hold the writes in memory for 30 minutes and abusively punish me for not doing that explicitly myself. But just delay *both* the data and metadata writes so that I either get the full "transaction" or I don't. And stop whining about how people don't know how to use your precious filesystem.