Comment 5 for bug 556621

Revision history for this message
Theodore Ts'o (tytso) wrote : Re: [Bug 556621] Re: lazy_itable_init not on by default

On Thu, Apr 08, 2010 at 04:38:44PM -0000, Colin Watson wrote:
> Thanks for that information, Ted!
>
> We can certainly tell in the partitioner whether we're reformatting an
> existing partition or not; maybe not 100% reliably if somebody deletes a
> partition, commits, and re-creates another one in the same place, but
> with some reliability. That said I'm not especially keen on creating
> bugs at this point even if we think we can get it right.

Hi Colin,

I've thought about adding checks like this in mke2fs. The problem
comes if someone delets a partition, and recreates its _slightly_
_shifted_, such that the superblock isn't in the same place. The
problem is that if there are blocks in the inode table that look like
valid inodes, then when e2fsck scans the disk, it normally stops at
the last block marked as containing valid inode table in each block
group descriptor. However, if the block group checksum is invalid,
e2fsck can't trust the "last valid inode" field, so it needs to scan
the whole inode table.

The problem is very similar to reiserfs's "scan the whole disk looking
for things that _look_ like reiserfs b-tree blocks", where if you have
file system images of reiserfs file systems (for KVM, or VMWare, for
example) in a reiserfs filesystem, and reiserfs decides it needs to
rebuild the top level filesystem's b-tree, it scans the whole disk and
finds the btree blocks from the image files, and Hilarty Ensues.

With ext3/4, e2fsck at least only has to look at where the inode table
blocks are located. (With reseirfs, if you reuse a partition or a
part of a partition, you need to completely zero out the whole disk to
avoid this problem since its b-tree blocks can be located anywhere.)

There are two solutions planned for ext4. The first which doesn't
require any file system format changes, is one where we zero out the
inode table block in the background via a kernel thread. This is much
like how md rebuilds a raid 1 mirror after a system crash. The system
is a little vulernable and a little slower than normal while this is
going on, but it's considered an acceptable tradeoff, since the user
can start using the machine immediately, and usually the user doesn't
have a lot of precious data immediately after the install. (There
would be a mount option to suppress the background thread, so the
installer can complete quickly.)

The second solution planned is to add inode table block checksums into
the inode table, where the checksum would include a filesystem unique
seed stored in the superblock. That way, e2fsck can validate the
checksum and skip the inode table block from a previous file system,
since the checksum will be incorrect in that case.

Both of these are not that hard, but it's a matter of finding the time
to implement them.... Unfortunately, it's not something $DAYJOB is
going to pay me to do on company time, so it's going to either come if
I can find some reliable OSS minions, or when I can find my own
personal spare time.

      - Ted