fsck.ext4 marks FS clean when still damaged

Bug #615899 reported by vandyswa on 2010-08-10
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
e2fsprogs (Ubuntu)
Undecided
Unassigned

Bug Description

Binary package hint: e2fsprogs

During repair of a corrupted root filesystem (a whole other story), on a whim due to my old UNIX experiences I re-ran fsck.ext4 on the filesystem a second time. The filesystem was found to contain more corruption. Because I was running without my filesystem available, I was unable to grab a log of the fsck output, but the corruption generally involves blocks allocated to more than one file, with file cloning involved during the repairs.

I suggest that regression testing for fsck.ext4 include running it a second time (with -f, natch) to verify that the FS is truly clean. In this case it made it much harder to detect new corruption during debug of the corruption bug, since it turns out I might have been seeing residual corruption from an incomplete repair. But in general it means that you'll mark a FS clean and put it back into operation when it still has problems.

Theodore Ts'o (tytso) wrote :

The regression testing for e2fsck already involves running it a second time to make sure the file system is truly clean.

That is certainly something we strive for. If you get errors after running e2fsck -y (i.e., you say yes to fix all corruptions), then one of the following is true: (a) the hardware is buggy, (b) you ran e2fsck while the file system was mounted, or (c) there is a bug in e2fsck.

Unfortunately I need a reproducible test case to be able be able do much more with this report. We already have regression tests that test blocks claimed by multiple files and where the blocks are cloned to fix this.

vandyswa (ajv-cauriumbin) wrote :

Thanks. This was certainly standalone (root fsck failed during bootup sequence), so either HW or SW bug. I will be swapping the server out in about a week, and will look at putting a second drive in so I can dd an image before fsck'ing. Then, assuming it's SW and not HW, I should be able to reproduce the behavior on demand. I'm happy to hear about your thorough regression testing! Please feel free to close this bug and I'll post something new if I can get a reproducible situation for your consideration.

Regards,
Andy Valencia

Theodore Ts'o (tytso) wrote :

There are around 100 tests in e2fsprogs which test various file system corruptions. Please they are in the e2fsprogs/tests/f_* sub directories. Each include a smallish file system, and then the expected output after the first and second fsck run.

Note that Ubuntu saves fsck output in /var/log/fsck/ --- although if it is a failed root fs check, it probably won't be able for the output to be written into /var/log/fsck. This is why I recommend that most people use a small root file system, and keep most of their data on file systems which are then mounted on top of that small root file system. It makes it easier to capture fsck logs, and otherwise debug things when they go south.

Theodore Ts'o (tytso) wrote :

Without more information, should we close this particular bug as unreproducible?

vandyswa (ajv-cauriumbin) wrote :

Please close, as that server no longer exists and it doesn't happen elsewhere. It very possibly could just have been hardware issues. Thanks.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers