XFS leaves garbage in file if app does write-new-then-rename without f(data)sync

Bug #37435 reported by Stewart Smith on 2006-03-31
Affects Status Importance Assigned to Milestone
linux-source-2.6.15 (Ubuntu)

Bug Description

update-alternatives, like many other programs, uses the standard technique of writing updated data to a new file, and then renaming the new file over the top of the old one.

XFS commits metadata (the rename) to disk before data (the file contents) so that an unfortunately-timed crash leaves the file containing garbage (ie whatever those disk blocks happened to contain before).

Either XFS needs to not do this, or all programs that use the above-described technique to update files (practically all of them) have to be changed to call f(data)sync at an appropriate point - otherwise there is a risk that those programs will break later because their data files contain nonsense. Note that the latter change would probably significantly slow down installation and upgrade as well as other functions and would involve the editing of many applications.

Stewart Smith (stewart) wrote :

setting to critical as corruption prevents a lot of packages from upgrading and there is no obvious way to fix said corruption.

Colin Watson (cjwatson) wrote :

update-alternatives in fact does write a new file and rename it over the top of the old one; the code is very careful about that. (There's no need to sync in between, since if your filesystem processes the rename before finishing with the write then it is fatally buggy anyway.)

Are you sure that this is not just filesystem corruption caused by the hard crash? What filesystem do you use?

Stewart Smith (stewart) wrote :


I don't see why it would have to flush before rename. It would only have to flush the inode to disk for the fs on disk to be consistent.

the fsync before rename is needed.

Gary Coady (garycoady) wrote :

There's a mail from Chris Wedgwood (who seems to be one of the main XFS developers) in 2004 on the subject at http://marc.theaimsgroup.com/?l=linux-xfs&m=109030382432326&w=2

----- quote ------
sounds like:

  old file foo on disk, all safe

  new file bar is written metadata on disk, file
      data in ram


  rename bar to foo old file unlinked, new
      file in place but data
      not flushed yet

now, if there was an fsync at [*] it would work just fine
---- end quote ------

I had a look at the POSIX semantics for rename() at http://www.opengroup.org/onlinepubs/009695399/functions/rename.html and there is no reference to the referenced file having been written to persistent storage. XFS only claims to follow POSIX semantics.

I used to run XFS on my desktop a few years ago, and had quite a few instances of data corruption - its delayed allocation DOES seem to make problems like this more likely to occur than with other systems (I saw lots of NULs in corrupted files after a crash).

This patch does the minimum required to sync the alternatives file when writing. IO::File is in perl-base, so no new dependencies.

If you think this approach is viable, I can have a look and do something similar with dpkg-statoverride and dpkg-divert, but I'll wait for comments on this first.

Changed in dpkg:
status: Unconfirmed → Confirmed

Patch available for review

Changed in dpkg:
assignee: nobody → ijackson

I'm afraid I agree with Colin Watson and disagree with the approach
taken in the patch. The quote from SuS isn't really helpful since SuS
doesn't ever specify under what circumstances corruption is allowed to
happen after a crash.

Changing update-alternatives to sync is only one small fragment of
this issue: very very many programs do write-to-new-file-and-rename.
If we change them all to do fsync before rename, then the whole system
will become significantly slower when it is doing many of these kind
of operations - for example, during installation and upgrades.

The correct change here is to have the filesystem not write the
metadata until the data is also committed. For example, quoting the
mount(8) manpage, in the section for ext3:

   data=journal / data=ordered / data=writeback
          Specifies the journalling mode for file data. Metadata is
          always journaled. To use modes other than ordered on the root
          file system, pass the mode to the kernel as boot parameter, e.g.

                 All data is committed into the journal prior to being
                 written into the main file system.

                 This is the default mode. All data is forced directly
                 out to the main file system prior to its metadata being
                 committed to the journal.

                 Data ordering is not preserved - data may be written into
                 the main file system after its metadata has been commit-
                 ted to the journal. This is rumoured to be the highest-
                 throughput option. It guarantees internal file system
                 integrity, however it can allow old data to appear in
                 files after a crash and journal recovery.

Note that the default mode for ext3 is exactly what is required, and
that update-alternatives works correctly with that. It still isn't
guaranteed, of course, after a crash, that everything is consistent,
but the update-alternatives databases will not be corrupt.

See also similar but slightly gnomic comments in the mount(8) entry
for jfs.


Changing file system behaviour isn't a fix.

Software should be written to the standards. While ext3 is common, it's behaviour here is not indicative of most file systems.

The patch looks good and should prevent this problem.

What ext3 does is close to what this patch does, so there shouldn't be much of a speed difference.

Besides, IMHO correctness is more important than speed. It's better to take more time in upgrade and have a system that's recoverable (or more likely to be recoverable) in the event of a crash then end up with something that's nearly completely hosed.

Stewart Smith writes ("[Bug 37435] Re: maintainer of /var/lib/dpkg/alternatives/ doesn't do updates safely"):
> Changing file system behaviour isn't a fix.

The file system behaviour is buggy.


Based on Ian and Colin's comments, reassigning as a possible kernel bug to linux-source-2.6.15.

Stewart Smith (stewart) wrote :

It is not a file system bug.

There is no gaurentee that the data hits disk without an explicit fsync (or clean shutdown).

Nowhere in POSIX does it specify that rename will sync data. That is what fsync is for.

(even with fsync, without additional checks, data is not gaurenteed to have hit disk... but that's another discussion.)

Stewart Smith writes ("[Bug 37435] Re: maintainer of /var/lib/dpkg/alternatives/ doesn't do updates safely"):
> There is no gaurentee that the data hits disk without an explicit
> fsync (or clean shutdown).

But update-alternatives doesn't care whether the data hits the disk.
It would be just fine if neither the file data nor the rename were
committed. Indeed, this is the desired behaviour because committing
to disk is comparatively slow and installation/upgrade performance is

The bug in the filesystem isn't that it doesn't sync the data. It's
that it writes the metadata without writing the contents, and has no
means for fixing this up after a hard crash. It leaves the disk and
the system in a state where the file contains garbage.

Note that _almost any_ application which updates a file via the UNIX
API _must_ do exactly what update-alternatives does: write out new
data to a new file, and rename it over the old version. Are you
saying that all of those applications are wrong and should always sync
the data before renaming ?


The file won't have garbage, at least for XFS.

If these applications want to ensure that the data is on disk, then they should sync before rename, yes.

See the previously quoted message from Chris Wedgewood at:

Ian Jackson (ijackson) on 2006-05-10
description: updated
Gary Coady (garycoady) wrote :

For some more context, this appears to be the last major thread on lkml on the topic:

I'm afraid that your response leads me to conclude that further
discussion here is not going to be productive. We seem to be going
round in circles, and an attempt I just made at a substantive response
had me simply repeating myself. I think my communication with you has
broken down.

I'm going to reassign this to the kernel again; we will see what the
Ubuntu kernel developers think. Please do not undo my reassignment
without consulting them.


Gary Coady writes ("[Bug 37435] Re: XFS leaves garbage in file if app does write-new-then-rename without f(data)sync"):
> For some more context, this appears to be the last major thread on lkml on the topic:
> http://www.ussg.iu.edu/hypermail/linux/kernel/0407.1/0359.html

Thank you for doing the research; it's always good to know what
upstream views are.

Unfortunately in this case that thread is largely full of drivel from
people who don't know what they're talking about. I think that shows
only that this topic is now stale and none of the competent people
want to talk about it any more.


It should be strongly noted that XFS will NOT leave garbage in the file. The only thing you'll ever get that is'nt your data is NULLs.

see here http://www.ussg.iu.edu/hypermail/linux/kernel/0407.1/0630.html

(better to show NULL than bogus data).

On the whole fsync issue,

(Chris is one of the guys who does know what he's talking about).

The followup message is hpa agreeing.

I also feel like I'm going in circles here. I've provided the correct semantics that need to be followed and backed it up (as have others) yet still you claim that file system semantics that follow the standard are somehow buggy.

Also, considering that dpkg runs on other systems (some where "fixing" the kernel simply isn't an option) it's best to fix dpkg, no?

Note that you'd possibly have the same bug with ext2 as well, possibly reiser 3 and UFS without soft updates.

I agree with Colin and Ian here.

Any filesystem that processes file operations out of order is BUGGY!

Scott, you are wrong. Out of order processing of file operations is a massive performance boost for the general case. It lets the operating system work out an optimal order to write things to disk to avoid seeks (and batch things up).

How can anybody advocate that dpkg should only work correctly on "linux ext3 with data=ordered" ? (ext3 also has data=writeback - should this option be removed because it's "buggy"?)

Ian Jackson (ijackson) wrote :

This bug report is perhaps not the right place to have this discussion. Nevertheless, saying that this is a bug in dpkg is declaring it to be a bug in nearly every other piece of software ever written (at least, all the other pieces of software that don't have worse reliability problems already).

This will be fixed by fixing the kernel or not at all.

Changed in linux-source-2.6.15:
assignee: ijackson → nobody

Hi All,

This bug was opened quite a while ago and there hasn't been any recent activity. The Hardy Heron Alpha series is currently under development and contains an updated version of the kernel. It would be helpful if you could test the latest Hardy Alpha release: http://www.ubuntu.com/testing and verify if this is still an issue. Thanks.

Changed in linux-source-2.6.15:
status: Confirmed → Incomplete

I'm closing this until we get verification that this issue exists or not in the actively developed kernel. A decision regarding a stable release update for 2.6.15 can be made at that time. For more information regarding Stable Release Updates, please refer to https://wiki.ubuntu.com/StableReleaseUpdates . Thanks.

Changed in linux-source-2.6.15:
status: Incomplete → Won't Fix

This is not a bug. It is a feature. :-)

A write in XFS imediatelly grows the file and if a crash occurs between the write and data being flushed to disk the extra space if filled with zeroes. This is explained this text from http://madduck.net/blog/2006.08.11:xfs-zeroes/ pasted below:

This must be the most misunderstood feature of XFS. What happens is that XFS logs all metadata changes to the journal, except for the inode size, which gets flushed to disk immediately for performance reasons [*]_. At this point, the file will actually be a sparse file, which is nothing more than a file whose metadata lists a file as being of a size different than it currently is (I realise the “sparse” does not really apply when the file is “overfull”, i.e. when the metadata lists it as smaller than it really is, but I am lacking a good word for that). The disk extents get allocated only when the data actually hits the disk (that’s XFS’s famous delayed allocation mechanism). If the power fails before the data was flushed to disk and the journal entry cleared, XFS will serve zeroes, rather than the potentially random or sensitive data that is actually on disk. This is a good thing.

.. [] sincealmost every* write() changes the file size, it would be : a massive performance hit if every size change was logged. However, XFS actually violates its own journaling rules by doing this.

You can run into more or less the same problem with any journaling filesystem; the others just don’t serve zeroes. Instead, they give you the data that’s physically on the medium. Imagine the situation when the corrupt /etc/motd suddenly becomes a window to your previous /etc/shadow contents… I really prefer how XFS handles that. Sometimes you do get the old data back with the other filesystems, but this is because the filesystems may reuse the blocks of the old file. So it’s a trade-off, and your choice between security and, uh, convenience.

The only way to protect against this is to use “physical-block journaling” (as opposed to “logical journaling”), which is only supported by ext3 as far as I know (option data=journal), at a massive performance loss. See this mailing list post (http://linuxmafia.com/faq/Filesystems/reiserfs.html) by Theodore Ts’o for more info.

jt (jt1234567) wrote :

For what it's worth, https://lwn.net/Articles/638546/#The%20near%20future claims this was actually considered a bug by XFS developers and fixed. So next time it might be worth asking them about it

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers