Replaying journals of other OS's filesystems, by mounting them, is unsafe

Bug #41624 reported by Ian Jackson
30
This bug affects 3 people
Affects Status Importance Assigned to Milestone
iso-scan (Ubuntu)
Triaged
High
Unassigned
linux (Ubuntu)
Won't Fix
Undecided
Unassigned
lupin (Ubuntu)
Triaged
Low
Unassigned
os-prober (Debian)
Fix Released
Unknown
partman-basicfilesystems (Ubuntu)
Fix Released
High
Colin Watson

Bug Description

I have just done a clean install of recent dapper (20060426.1 live i386) on my main testbed machine.

The automatic volume discovery system has not only found the filesystems from various of the other installations (which is not quite so bad) but has dug into my LVM system and found the fs for a frozen Xen image !

This kind of thing can cause serious data loss. Modern journalling filesystems go even more badly wrong than traditional fs's if they are accessed by two running systems in an interleaved fashion, which is what results if Dapper automatically finds and mounts these filesystems, replaying the journal, while a frozen (whether by a VM like Xen or by ordinary hibernation) image has them mounted.

In the current setup I think it would be easy to cause disaster simply by installing dapper twice on the same machine and then continuously hibernating one while using the other. More complex schemes are also possible.

All of these filesystems discovered in this way should be made read-only unless it can be somehow known that it's safe to make them r/w.

Tags: cft-2.6.27

Related branches

Revision history for this message
Matthew Garrett (mjg59) wrote :

Mounting a filesystem read-only doesn't prevent the journal from being replayed.

Revision history for this message
Colin Watson (cjwatson) wrote :

I think the best approach here is probably to keep putting journalling filesystems into /etc/fstab, but mark them noauto; traditional Unix-ish filesystems can be marked ro.

The main use for this feature is automatically making your Windows filesystems available. I really don't want to disable this if I can possibly avoid it since it's a very popular feature (we got a lot of flak for *not* doing it pre-Breezy). I think that case is probably relatively safe compared to the case of Unix filesystems; having a Windows installation that hasn't been properly shut down before doing an Ubuntu installation seems like it's going to be exceedingly rare.

Changed in ubiquity:
status: Unconfirmed → Confirmed
Revision history for this message
Ian Jackson (ijackson) wrote : Re: [Bug 41624] Re: aggressive volume discovery => data loss

Matthew Garrett writes ("[Bug 41624] Re: aggressive volume discovery => data loss"):
> Mounting a filesystem read-only doesn't prevent the journal from being
> replayed.

What, to disk, rather than to the buffer cache ? That's insane !

Ian.

Revision history for this message
Ian Jackson (ijackson) wrote :

Colin Watson writes ("[Bug 41624] Re: aggressive volume discovery => data loss"):
> The main use for this feature is automatically making your Windows
> filesystems available. I really don't want to disable this if I can
> possibly avoid it since it's a very popular feature (we got a lot of
> flak for *not* doing it pre-Breezy). I think that case is probably
> relatively safe compared to the case of Unix filesystems; having a
> Windows installation that hasn't been properly shut down before
> doing an Ubuntu installation seems like it's going to be exceedingly
> rare.

Is NTFS journalling and does Linux replay the journal when
mounting it ? I thought the Linux NTFS support was entirely
read-only, which is quite safe.

Ian.

Revision history for this message
Matthew Garrett (mjg59) wrote : Re: aggressive volume discovery => data loss

NTFS is journalling, but Linux doesn't replay it. I don't think Linux has any way to support flagging parts of the buffer cache as "writable but non-flushable" - it might be a reasonably elegant way of working, but it's also possible for the journal to be relatively large compared to the filesystem size.

Revision history for this message
Colin Watson (cjwatson) wrote : Re: [Bug 41624] Re: [Bug 41624] Re: aggressive volume discovery => data loss

On Tue, May 02, 2006 at 11:09:12AM -0000, Ian Jackson wrote:
> Is NTFS journalling and does Linux replay the journal when
> mounting it ? I thought the Linux NTFS support was entirely
> read-only, which is quite safe.

NTFS is journalled, yes. I've checked the kernel code and it indeed
doesn't seem to replay the journal when mounting read-only.

As Matthew says, ext3 replays the journal even when mounting read-only:

        if (EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER)) {
                if (sb->s_flags & MS_RDONLY) {
                        printk(KERN_INFO "EXT3-fs: INFO: recovery "
                                        "required on readonly filesystem.\n");
                        if (really_read_only) {
                                printk(KERN_ERR "EXT3-fs: write access "
                                        "unavailable, cannot proceed.\n");
                                return -EROFS;
                        }
                        printk (KERN_INFO "EXT3-fs: write access will "
                                        "be enabled during recovery.\n");
                }
        }

Having looked through other bits of kernel code, ReiserFS and XFS do the
same (although XFS has a norecovery mount option to suppress replay);
all of these three filesystems explicitly avoid replaying the journal if
the underlying block device is read-only. I can't find the log replay
code for JFS.

Revision history for this message
Ian Jackson (ijackson) wrote : Re: [Bug 41624] Re: aggressive volume discovery => data loss

Matthew Garrett writes ("[Bug 41624] Re: aggressive volume discovery => data loss"):
> NTFS is journalling, but Linux doesn't replay it. I don't think Linux
> has any way to support flagging parts of the buffer cache as "writable
> but non-flushable" - it might be a reasonably elegant way of working,
> but it's also possible for the journal to be relatively large compared
> to the filesystem size.

Then it can't mount devices read only and should say ENOSYS or some
such. Indeed, ideally, the block device would be opened in a way
that would prevent the fs from accidentally writing.

Ian.

Revision history for this message
Matt Zimmerman (mdz) wrote :

Mount options in fstab don't seem like a complete solution; doesn't os-prober mount filesystems as well?

Revision history for this message
Phillip Susi (psusi) wrote :

Sounds like this is a kernel bug then? Mounting a filesystem r/o should NEVER modify the disk. Should another bug be created for that and filed upstream?

Ian Jackson (ijackson)
Changed in partman-basicfilesystems:
assignee: nobody → ijackson
Revision history for this message
Colin Watson (cjwatson) wrote :

I would suggest teaching parted_server (in partman-base) about a new command to tell whether the fs on a given partition is dirty, and making use of that in partman-basicfilesystems where it decides whether to automount things.

As far as os-prober goes, there was discussion about this recently upstream in http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=417407; there was a suggestion here to "protect" block devices by using 'blockdev --setro', which ought to convince the kernel not to do anything at all to the contents of the block device for the duration. I think this ought to require resurrecting the port we used to have of blockdev to busybox, rather than creating a new blockdev-udeb.

Changed in os-prober:
status: Unknown → Unconfirmed
Revision history for this message
Phillip Susi (psusi) wrote :

That sounds like an acceptable workaround but I still maintain that this is a kernel bug. Read only means you no touch.

Revision history for this message
TJ (tj) wrote :

Marking the Gutsy-allocated part of this report Incomplete and assigning to kernel-team to remove it from the list of outstanding new/undecided bugs against Gutsy. Hopefully at some point a decision can be made over what to do with this, but it looks like as far as the kernel is concerned that might be a long way off, if at all.

I've noted Phillip's comments in theLKML discussion thread on this at http://lkml.org/lkml/2007/4/8/97

Changed in linux-source-2.6.22:
assignee: nobody → ubuntu-kernel-team
status: New → Incomplete
Revision history for this message
Ian Jackson (ijackson) wrote : Re: [Bug 41624] Re: Replaying journals of other OS's filesystems, by mounting them, is unsafe

TJ writes ("[Bug 41624] Re: Replaying journals of other OS's filesystems, by mounting them, is unsafe"):
> I've noted Phillip's comments in theLKML discussion thread on this at
> http://lkml.org/lkml/2007/4/8/97

Having read that thread, I'm deeply unimpressed by the head in the
sand attitude displayed by some participants. Phillip Susi is of
course absolutely right.

Ian.

Revision history for this message
TJ (tj) wrote :

Ian, yes it does seem a bit pedantic although to be fair there was a devils-advocate stance :)

I can see both arguments: On the one hand read-only should mean just that, it should have the same effect as read-only media. On the other hand a journalled file-system does need to replay the log to look consistent, even if it is only replayed to RAM.

On balance, I'd say a read-only file-system shouldn't have the log file replayed (to RAM or disk) no matter if it appears inconsistent at that time. When the file-system is next mounted read-write (in this scenario, by the OS that 'owns' it) the file system will be consistent.

Revision history for this message
Phillip Susi (psusi) wrote :

Exactly. If you do a read only mount of an inconsistent filesystem, you /expect/ to get inconsistent results from trying to read an inconsistent filesystem. The whole idea though, is that you make your best effort to access the data without modifying it and possibly causing more damage.

Since the lkml seems to have their head in the sand on this one, what are the odds on Ubuntu diverging?

Ian Jackson (ijackson)
Changed in partman-basicfilesystems:
assignee: ijackson → nobody
Revision history for this message
Phillip Susi (psusi) wrote :

Marking as Triaged since the report is complete.

Changed in linux-source-2.6.22:
status: Incomplete → Triaged
Revision history for this message
Colin Watson (cjwatson) wrote :

partman-basicfilesystems no longer automounts by default, as of Hardy, which takes care of that part of this bug:

partman-basicfilesystems (56ubuntu4) hardy; urgency=low

  * Disable automounting unless partman/automount is preseeded to true. This
    makes LP #106209 much less likely to occur, since future installations
    are less likely to format a partition whose UUID we have in /etc/fstab.

 -- Colin Watson <email address hidden> Wed, 09 Apr 2008 08:18:47 +0100

Changed in partman-basicfilesystems:
assignee: nobody → kamion
status: Confirmed → Fix Released
Revision history for this message
Colin Watson (cjwatson) wrote :

See also bug 230703.

Changed in iso-scan:
importance: Undecided → High
status: New → Triaged
Changed in lupin:
importance: Undecided → Low
status: New → Triaged
Revision history for this message
Colin Watson (cjwatson) wrote :

This definitely also affects iso-scan. lupin is affected in theory, but in practice I think it's quite unlikely that somebody will start Wubi and then hibernate Windows rather than simply rebooting.

Revision history for this message
Colin Watson (cjwatson) wrote :

... and of course the Wubi installation process really does require writing to the Windows filesystem, so the only thing that could be done in lupin would be to refuse to function at all.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

The Ubuntu Kernel Team is planning to move to the 2.6.27 kernel for the upcoming Intrepid Ibex 8.10 release. As a result, the kernel team would appreciate it if you could please test this newer 2.6.27 Ubuntu kernel. There are one of two ways you should be able to test:

1) If you are comfortable installing packages on your own, the linux-image-2.6.27-* package is currently available for you to install and test.

--or--

2) The upcoming Alpha5 for Intrepid Ibex 8.10 will contain this newer 2.6.27 Ubuntu kernel. Alpha5 is set to be released Thursday Sept 4. Please watch http://www.ubuntu.com/testing for Alpha5 to be announced. You should then be able to test via a LiveCD.

Please let us know immediately if this newer 2.6.27 kernel resolves the bug reported here or if the issue remains. More importantly, please open a new bug report for each new bug/regression introduced by the 2.6.27 kernel and tag the bug report with 'linux-2.6.27'. Also, please specifically note if the issue does or does not appear in the 2.6.26 kernel. Thanks again, we really appreicate your help and feedback.

Revision history for this message
Phillip Susi (psusi) wrote :

Which kernel you are using does not really matter because the linux kernel developers consider this to be working as intended. I have tried arguing with the on the LKML a few times with no success. They seem to think that the read only mount flag does not mean "do not write to this disk" but rather "do not allow files to be opened for write access".

If we want this fixed, we're going to have to fix it ourselves it looks like.

Revision history for this message
Ian Jackson (ijackson) wrote :

Phillip Susi writes ("[Bug 41624] Re: Replaying journals of other OS's filesystems, by mounting them, is unsafe"):
> Which kernel you are using does not really matter because the linux
> kernel developers consider this to be working as intended. I have tried
> arguing with the on the LKML a few times with no success. They seem to
> think that the read only mount flag does not mean "do not write to this
> disk" but rather "do not allow files to be opened for write access".

Phillip is correct.

> If we want this fixed, we're going to have to fix it ourselves it looks
> like.

I think we should do so.

Just this week I was helping someone recover a machine which was
already damaged at the time and was made worse when they typed
   mount -o ro /dev/mapper/volumegroup-logicalvolume-real /mnt
which causes ext3 to write the journal back into the snapshotted
volume bypassing the LVM system.

You can say "don't do that then" but
   mount -o ro
is exactly what every administrator reaches for in time of trouble,
and they expect it to do no harm.

Ian.

Revision history for this message
Phillip Susi (psusi) wrote : Re: [Bug 41624] Re: Replaying journals of other OS's filesystems, by mounting them, is unsafe

Ian Jackson wrote:
> You can say "don't do that then" but
> mount -o ro
> is exactly what every administrator reaches for in time of trouble,
> and they expect it to do no harm.

That was exactly the point I argued on the LKML but they don't seem to
see it that way. Maybe Ben Collins can try to knock this sense into
them, or fix it for Ubuntu?

Revision history for this message
Launchpad Janitor (janitor) wrote : Kernel team bugs

Per a decision made by the Ubuntu Kernel Team, bugs will longer be assigned to the ubuntu-kernel-team in Launchpad as part of the bug triage process. The ubuntu-kernel-team is being unassigned from this bug report. Refer to https://wiki.ubuntu.com/KernelTeamBugPolicies for more information. Thanks.

Revision history for this message
Stefan Bader (smb) wrote :

Closing this as won't fix. Upstream will not change this behavior and have their arguments for that. A deviation for Ubuntu is unmaintainable. The only way to prevent write access by Linux is to set the device access rights to read-only.

Changed in linux (Ubuntu):
status: Triaged → Won't Fix
Changed in os-prober (Debian):
status: New → Fix Released
Revision history for this message
Colin Watson (cjwatson) wrote :

What we're going to end up with here in the installer is using grub-mount rather than mount, which guarantees a true read-only mount via GRUB's filesystem drivers plus FUSE. os-prober has already switched to this, and we'll switch the rest of the installer over as time permits.

Revision history for this message
Glenn Washburn (crass) wrote :

Over a decade later... but this might help someone. There is a way around this mess, with some caveats. You can set the block device to readonly using blockdev. This works as desired a lot of the time. However, sometimes (or perhaps everytime certain filesystems need to replay the journal), the will cause the kernel to refuse to mount the filesystem. The way around this is to, as mentioned above, replay the journal to ram. As far as I know, no filesystems support this natively. What can be done is to use dm-snapshot to create a snapshot of the block device with the cow file residing on a tmpfs. Then mount the snapshot device's filesystem as readonly. In this case the log will be replayed to ram.

The caveat here is that, when in this state, the underlying block device should not be written to. So effectively the block device is locked up until the snapshot device is removed. A scenario where this might be an issue is as follows. You have a harddrive with isos on an ext4 filesystem that you use to grub iso loopback mount to boot from. When booting one of the isos, you use the snapshot setup with this filesystem because you don't know if its part of a hibernation image or not. When the livecd is up and running, you will not be able to modify the filesystem (you can write to it, but all changes are in ram and will be lost). That's as it should be if the filesystem is part of a hibernation image, but it might be more likely that its not. And in that case, it might be confusing why this restriction exists.

What would be good is to have code that can detect if the system has a hibernation image or even better if the filesystem was mounted while a hibernation happened (not sure if that's possible). Then the snapshot work around could be done only when needed.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.