fsck should do some sanity checks to avoid damaging an ext3/ext4 file system

Bug #1375640 reported by Michael Mess
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
e2fsprogs (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

After some crashes due to a broken nvidia driver, the system didn't boot and presented me a shell to manually repair the file system.
Then I didn't think and typed fsck without the -t ext4 parameter and the program asked me many questions, if I want to fix something which I answered with y.
After some time I remembered, that I forgot to specify the -t ext4 parameter, but the filesystem was broken already, so it was already too late, when I do Ctrl-C.
And I didn't think about that there could be already severe file system corruption, making the things even worse.

Thus I would recommend that a warning message is displayed, when calling fsck manually with the wrong filesystem type.
Also a big note that this is now the right time to make an image backup of the disk could be helpful for people like me that always tell other people how important backups are before working on something, but sometimes fail to make one, because they forget to turn their brain on before starting work. ;-)

I would expect that when calling fsck on a filesystem, not just defaults are used, but the filesystem type is determined.
Then a big scary warning message should say something like this:

"
You have called fsck on an ext4 filesystem.
For an ext4 file system you *must* specify the filesystem type with option "-t ext4", otherwise severe damage to the file system will occur.
NOTE: It is highly recommended that you make a backup image of your disk device before you continue, so that you have something to recover from, if things go wrong.
This can be especially important, if the disk drive hardware you are trying to recover the filesystem on, is possibly dying.
You can use dd_rescue or similar tools to try to get a disk image to recover as much as possible of your data from a damaged hard drive.

Are you really sure, you want to continue (and probably damage the file system)? (yes/NO)
"

Revision history for this message
Michael Mess (michael-michaelmess) wrote :

It should be necessary to type "yes" to continue. Just "y" or hitting enter should not not do it and abort the command.

Revision history for this message
Theodore Ts'o (tytso) wrote :

You're jumping to conclusions here that the damage was caused by the lack of "-t ext4" to fsck. #1, the fsck program will determine the file system type based on the file system type found in /etc/fstab. So if the /etc/fstab file indicates that you were using ext4, then it wlil do the equivalent of running "fsck -t ext4". #2, even if that was wrong, all that would have happened was the wrong fsck program would have been called. If it was really a btrfs or xfs file system, it would have run fsck.btrfs or fsck.xfs, and they would have rejected the file system as not a btrfs or xfs. If the /etc/fstab wrongly thought it was a ext2 or ext3, then fsck would have run /sbin/fsck.ext2 or /sbin/fsck.ext3 --- but /sbin/fsck.ext* are all symlinks to the same program, /sbin/e2fsck. So it wouldn't have made a difference.

So the "severe file system damage" had nothing to do with the use or lack thereof of "-t ext4", but rather because the file system was badly damaged by the Nvidia-induced crash. If the closed source kernel module managed to scribble on kernel memory, it could have caused metadata blocks to be written to the wrong place on disk, or corrupting critical metadata blocks. Or if you had to hard power cycle the system, and you were using a SSD that doesn't have power fail protection (most consumer grade SSD's don't), then that might also explain the severe file system corruption.

Revision history for this message
Michael Mess (michael-michaelmess) wrote :

Thank you for the information.
I was thinking, that I have destroyed the extents and other features of the ext4 file system by using a tool that is not aware of the special ext4 features, leading to filesystem coruption.
What happens, if I hit CTRL-C during fsck? Can that leave the filesystem in a even more inconsistent state?

I used a SSD, so it is likely that resetting the system could have damaged the file system.
Probably it is a good idea to keep the /home directory on a separate partition.
This is how I used to do; when there was file system corruption in the past (mostly dying disks), the root filesystem got broken, but the home partition only had none or minor issues, so recovery was always easy.
But it is not the default when installing ubuntu and for this installation I didn't care for it.

Also the nvidia driver could have caused the damage by overwriting memory which finally led to the crash.
But if that was nvidia driver's memory corruption, it is likely that it could also also hit a separate home partition or any other mounted drive.

Is there a way to recover the file system manually? There is still hope that the areas with data are mostly still present. I still keep a copy of the disk image and I am still trying some recovery tools on a copy of that image.

fsck.ext4 resulted in that some data still exists, some files have gone to lost+found , but many files seem to be lost (the partition now appears to be mostly unused).
It seems that much space has been unallocated and as a result some files could have been partially overwritten during system boot, when the system writes to the disk.

As far as I know, fsck.ext4 does not scan the whole disk for files to be linked to lost+found, but only these areas that have been marked as used to improve speed.
Is it possible to force searching the whole disk image for existing fragments of the file system that are not allocated as used and do not have any linkage to the existing filesytem any more?
I know that this could deliver much garbage (very old files) that have been deleted or overwritten far before and are not of use, but probably it could be useful to recover some useful data where the linkage to the file system has been lost and that have been unallocated by filesystem coruption.

Could you provide me a link to current documentation of ext4 and fsck? Where can I get the sources?

Can you recommend some good open source recovery tools?

Revision history for this message
Theodore Ts'o (tytso) wrote :

1) e2fsck (which is what fsck.ext[234] is hard linked to) is designed to be safe when getting interrupted by ^C

2) The all kernel code uses the same address space. So if there is a wild pointer in the Nvidia driver scribbles on random kernel memory, literally anything can happen. There is a reason why upstream kerenl developers will generally refuse to debug a kernel crash where the kernel is "tainted" by using a proprietary closed source kernel module. The kernel oops has a taint flag specifically because buggy video drivers in particular have historically been responsible for a huge number of failures that were at first not obviously related to the video driver, and upstream kernel developers wasted a lot of time before concluding: try again without any closed source modules loaded; we refuse to debug a tainted kernel.

And before you ask why not put the Nvidia driver in a separate address space --- Microsoft tried this a long time ago with the version of Windows NT --- and the resulting video performance was sooooo awful that the first version of Windows NT was a complete failure, and they had to reverse course and allow video drivers to run in the same address space. Linux cares about performance as well, and besides, we have zero interest in enabling crappy closed-source drivers, especially those from video card companies. Personally, I use laptops with Intel video chips, since that allows me to use 100% open source kernel.

3) It may be possible to recover the data by using e2fsck and debugfs, if you can find someone with ext4 skills who is willing to give you some free consulting time. Alternatively, you can try using photorec, which is one of the better tools for scanning the raw partition and finding and recongizing various data formats. It was originally designed to find image files (hence photorec), but it has since been extended to support a huge number of document types, including Libreoffice, Microsoft Office, etc., etc.

Revision history for this message
Michael Mess (michael-michaelmess) wrote :

Thank you for you recommendations.
I was able to restore some of the data which has been lost.

The nvidia graphics card was causing all the trouble, after a replacement with intel graphics, I have never experienced any issues again.
It is likely that the closed source driver caused also other issues which didn't always lead to crashes, but to an unreliable system where things got corrupted silently.

But this is sure not an issue in tools (like fsck) which produce much i/o load and thus have a greater probability to get hit by the buggy nvidia driver, thus I can close the issue.

Since then I have always avoided buying equipment from firms which do not provide open source drivers, and I have never experienced any issues with data corruption any more.

So I call out to the world: If you are a hardware manufacturer and want me to buy your product: Provide good open source drivers!

Changed in e2fsprogs (Ubuntu):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.