Bug #556621 “lazy_itable_init not on by default” : Bugs : e2fsprogs package : Ubuntu

Phillip Susi (psusi) on 2010-04-06

Changed in e2fsprogs (Ubuntu):
assignee:	nobody → Phillip Susi (psusi)
status:	New → In Progress

Revision history for this message

Phillip Susi (psusi) wrote on 2010-04-08:

#1

Just tested installing lucid beta 2 to a 1.5 TB WD15EARS drive from a liveusb and setting this option in /etc/mke2fs.conf cut the install time in HALF. Without the option 5.5-6 minutes are spent sitting at 5% complete with no visible progress while mkfs runs, for an 11+ minute total install. After setting the option, the time spent in mkfs drops to ~15 seconds and gives just over 6 minutes total time to install.

If this option is not set as a default in mke2fs.conf, then ubiquity should at least consider specifying it when calling mke2fs.

Revision history for this message

Colin Watson (cjwatson) wrote on 2010-04-08:

#2

(We don't use the ubiquity upstream project for bug tracking. I've opened a distribution task on partman-ext3 instead.)

Changed in ubiquity:
status:	New → Invalid

Revision history for this message

Theodore Ts'o (tytso) wrote on 2010-04-08: Re: [Bug 556621] Re: lazy_itable_init not on by default

#3

On Thu, Apr 08, 2010 at 04:00:34AM -0000, Phillip Susi wrote:
> Just tested installing lucid beta 2 to a 1.5 TB WD15EARS drive from a
> liveusb and setting this option in /etc/mke2fs.conf cut the install time
> in HALF. Without the option 5.5-6 minutes are spent sitting at 5%
> complete with no visible progress while mkfs runs, for an 11+ minute
> total install. After setting the option, the time spent in mkfs drops
> to ~15 seconds and gives just over 6 minutes total time to install.
>
> If this option is not set as a default in mke2fs.conf, then ubiquity
> should at least consider specifying it when calling mke2fs.

It's safe to use lazy_itable_init on brand-spanking-new disks. It's
safe if you are reformatting an existing partition, AND you never have
any errors in the block group descriptors that cause the block group
checksums to be invalid during the life of the file system.

If there are invalid block group checksums and a previous file system
is reformatted using lazy_itable_init, e2fsck can get confused with a
inodes from previous file systems. This is why the default is zero
out the entire inode table.

Given that Ubuntu users tend to be, ah, less sophisticated, and _very_
loud about complaining on Launchpad when things go wrong in confusing
ways, I can't really recommend enabling lazy_itable_init by default in
the Ubuntu installer at this time. There are some kernel development
work that I have planned that will make it be safe, but that work
hasn't happened yet.

- Ted

Revision history for this message

Colin Watson (cjwatson) wrote on 2010-04-08:

#4

Thanks for that information, Ted!

We can certainly tell in the partitioner whether we're reformatting an existing partition or not; maybe not 100% reliably if somebody deletes a partition, commits, and re-creates another one in the same place, but with some reliability. That said I'm not especially keen on creating bugs at this point even if we think we can get it right.

So, I'm quite tempted to add this but make it configurable using an off-by-default preseed question, so you could say partman-ext3/lazy_itable_init=true as a boot parameter. That would make testing on my 1TB external disk more bearable.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2010-04-08:

#5

On Thu, Apr 08, 2010 at 04:38:44PM -0000, Colin Watson wrote:
> Thanks for that information, Ted!
>
> We can certainly tell in the partitioner whether we're reformatting an
> existing partition or not; maybe not 100% reliably if somebody deletes a
> partition, commits, and re-creates another one in the same place, but
> with some reliability. That said I'm not especially keen on creating
> bugs at this point even if we think we can get it right.

Hi Colin,

I've thought about adding checks like this in mke2fs. The problem
comes if someone delets a partition, and recreates its _slightly_
_shifted_, such that the superblock isn't in the same place. The
problem is that if there are blocks in the inode table that look like
valid inodes, then when e2fsck scans the disk, it normally stops at
the last block marked as containing valid inode table in each block
group descriptor. However, if the block group checksum is invalid,
e2fsck can't trust the "last valid inode" field, so it needs to scan
the whole inode table.

The problem is very similar to reiserfs's "scan the whole disk looking
for things that _look_ like reiserfs b-tree blocks", where if you have
file system images of reiserfs file systems (for KVM, or VMWare, for
example) in a reiserfs filesystem, and reiserfs decides it needs to
rebuild the top level filesystem's b-tree, it scans the whole disk and
finds the btree blocks from the image files, and Hilarty Ensues.

With ext3/4, e2fsck at least only has to look at where the inode table
blocks are located. (With reseirfs, if you reuse a partition or a
part of a partition, you need to completely zero out the whole disk to
avoid this problem since its b-tree blocks can be located anywhere.)

There are two solutions planned for ext4. The first which doesn't
require any file system format changes, is one where we zero out the
inode table block in the background via a kernel thread. This is much
like how md rebuilds a raid 1 mirror after a system crash. The system
is a little vulernable and a little slower than normal while this is
going on, but it's considered an acceptable tradeoff, since the user
can start using the machine immediately, and usually the user doesn't
have a lot of precious data immediately after the install. (There
would be a mount option to suppress the background thread, so the
installer can complete quickly.)

The second solution planned is to add inode table block checksums into
the inode table, where the checksum would include a filesystem unique
seed stored in the superblock. That way, e2fsck can validate the
checksum and skip the inode table block from a previous file system,
since the checksum will be incorrect in that case.

Both of these are not that hard, but it's a matter of finding the time
to implement them.... Unfortunately, it's not something $DAYJOB is
going to pay me to do on company time, so it's going to either come if
I can find some reliable OSS minions, or when I can find my own
personal spare time.

- Ted

On Thu, Apr 08, 2010 at 04:38:44PM -0000, Colin Watson wrote:
> Thanks for that information, Ted!
> 
> We can certainly tell in the partitioner whether we're reformatting an
> existing partition or not; maybe not 100% reliably if somebody deletes a
> partition, commits, and re-creates another one in the same place, but
> with some reliability.  That said I'm not especially keen on creating
> bugs at this point even if we think we can get it right.

Hi Colin,

I've thought about adding checks like this in mke2fs.  The problem
comes if someone delets a partition, and recreates its _slightly_
_shifted_, such that the superblock isn't in the same place.  The
problem is that if there are blocks in the inode table that look like
valid inodes, then when e2fsck scans the disk, it normally stops at
the last block marked as containing valid inode table in each block
group descriptor.  However, if the block group checksum is invalid,
e2fsck can't trust the "last valid inode" field, so it needs to scan
the whole inode table.

The problem is very similar to reiserfs's "scan the whole disk looking
for things that _look_ like reiserfs b-tree blocks", where if you have
file system images of reiserfs file systems (for KVM, or VMWare, for
example) in a reiserfs filesystem, and reiserfs decides it needs to
rebuild the top level filesystem's b-tree, it scans the whole disk and
finds the btree blocks from the image files, and Hilarty Ensues.

With ext3/4, e2fsck at least only has to look at where the inode table
blocks are located.  (With reseirfs, if you reuse a partition or a
part of a partition, you need to completely zero out the whole disk to
avoid this problem since its b-tree blocks can be located anywhere.)

There are two solutions planned for ext4.  The first which doesn't
require any file system format changes, is one where we zero out the
inode table block in the background via a kernel thread.  This is much
like how md rebuilds a raid 1 mirror after a system crash.  The system
is a little vulernable and a little slower than normal while this is
going on, but it's considered an acceptable tradeoff, since the user
can start using the machine immediately, and usually the user doesn't
have a lot of precious data immediately after the install.  (There
would be a mount option to suppress the background thread, so the
installer can complete quickly.)

The second solution planned is to add inode table block checksums into
the inode table, where the checksum would include a filesystem unique
seed stored in the superblock.  That way, e2fsck can validate the
checksum and skip the inode table block from a previous file system,
since the checksum will be incorrect in that case.

Both of these are not that hard, but it's a matter of finding the time
to implement them....  Unfortunately, it's not something $DAYJOB is
going to pay me to do on company time, so it's going to either come if
I can find some reliable OSS minions, or when I can find my own
personal spare time.

- Ted

Revision history for this message

Phillip Susi (psusi) wrote on 2010-04-08:

#6

On 4/8/2010 9:17 AM, Theodore Ts'o wrote:
> If there are invalid block group checksums and a previous file system
> is reformatted using lazy_itable_init, e2fsck can get confused with a
> inodes from previous file systems. This is why the default is zero
> out the entire inode table.

Isn't that why there are multiple copies of the block group descriptor
table? If one got corrupted, wouldn't the backup then be consulted
which would have the uninitialized flag set correctly so fsck would know
to ignore the inode table?

Revision history for this message

Theodore Ts'o (tytso) wrote on 2010-04-10:

#7

On Thu, Apr 08, 2010 at 07:02:54PM -0000, Phillip Susi wrote:
> On 4/8/2010 9:17 AM, Theodore Ts'o wrote:
> > If there are invalid block group checksums and a previous file system
> > is reformatted using lazy_itable_init, e2fsck can get confused with a
> > inodes from previous file systems. This is why the default is zero
> > out the entire inode table.
>
> Isn't that why there are multiple copies of the block group descriptor
> table? If one got corrupted, wouldn't the backup then be consulted
> which would have the uninitialized flag set correctly so fsck would know
> to ignore the inode table?

We don't update the backup copies of the block group descriptor
tables; we use the backup copies to retrieve static data so we can
recover the file system, and how many of the inodes in a block group
have been used/initialized is dynamic data.

I suppose we could try to use the timestamps in the inodes to see if
the data is stale, but as we know from loud complaints on Launchpad,
people are incapable of keeping their system clocks set correctly, so
that's out....

- Ted

Revision history for this message

Phillip Susi (psusi) wrote on 2010-04-11:

#8

On Sat, 10 Apr 2010 12:04:33 -0000
Theodore Ts'o <email address hidden> wrote:
> We don't update the backup copies of the block group descriptor
> tables; we use the backup copies to retrieve static data so we can
> recover the file system, and how many of the inodes in a block group
> have been used/initialized is dynamic data.

Oh dear, so the kernel and fsck never update the backup copies? Seems
to defeat the purpose doesn't it? I mean if they are never updated,
why have them? Their initial content could be regenerated by mkfs -n
couldn't it?

So if fsck finds the main bg descriptor corrupt and isn't sure if the
inode table was zeroed or not, and happens to find old data that looks
like inodes in the uninitialized table, wouldn't it be highly likely
that such inodes would claim blocks that are either marked as free, or
allocated by other inodes? And these inodes would not have any
directory entries pointing to them, so wouldn't the worst case then be
that fsck places copies of garbage in lost+found, but no actual data
would be lost?

If the worst case scenario in the unlikely event of the bg descriptors
being corrupted is that fsck finds garbage data and puts it in
lost+found, then I don't see a downside to enabling this option.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2010-04-11:

#9

On Apr 10, 2010, at 10:45 PM, Phillip Susi wrote:

> Oh dear, so the kernel and fsck never update the backup copies? Seems
> to defeat the purpose doesn't it? I mean if they are never updated,
> why have them? Their initial content could be regenerated by mkfs -n
> couldn't it?

*If* you remember exactly all of the command-line parameters you gave to mkfs, *and* if you have exactly the same contents in /etc/mke2fs.conf, and *if* you've never resized the file system, and *if* you are using exactly the same version of mke2fs ---- then why yes, you could find these parameters by using mkfs -n. (Or if you are really brave, you could try restoring the world using mke2fs -S

> So if fsck finds the main bg descriptor corrupt and isn't sure if the
> inode table was zeroed or not, and happens to find old data that looks
> like inodes in the uninitialized table, wouldn't it be highly likely
> that such inodes would claim blocks that are either marked as free, or
> allocated by other inodes? And these inodes would not have any
> directory entries pointing to them, so wouldn't the worst case then be
> that fsck places copies of garbage in lost+found, but no actual data
> would be lost?
>
> If the worst case scenario in the unlikely event of the bg descriptors
> being corrupted is that fsck finds garbage data and puts it in
> lost+found, then I don't see a downside to enabling this option.

Ah, if only it would be so simple. The other problem is that these inodes may point at blocks used by the current '"real" inodes in the file system. This means that invoking pass 1b/1c/1d processing, which is slow, takes a long time, and requires asking the user if they would like to clone or delete the multiply-claimed blocks. For very large disks, if users don't have enough memory or who don't have enough swap enabled, e2fsck could run out of memory during the pass 1b/1c/1d passes.

So again, I really do not recommend enabling lazy_itable_init at this time.

-- Ted

On Apr 10, 2010, at 10:45 PM, Phillip Susi wrote:

> Oh dear, so the kernel and fsck never update the backup copies?  Seems
> to defeat the purpose doesn't it?  I mean if they are never updated,
> why have them?  Their initial content could be regenerated by mkfs -n
> couldn't it?

*If* you remember exactly all of the command-line parameters you gave to mkfs, *and* if you have exactly the same contents in /etc/mke2fs.conf, and *if* you've never resized the file system, and *if* you are using exactly the same version of mke2fs ---- then why yes, you could find these parameters by using mkfs -n.  (Or if you are really brave, you could try restoring the world using mke2fs -S

> So if fsck finds the main bg descriptor corrupt and isn't sure if the
> inode table was zeroed or not, and happens to find old data that looks
> like inodes in the uninitialized table, wouldn't it be highly likely
> that such inodes would claim blocks that are either marked as free, or
> allocated by other inodes?  And these inodes would not have any
> directory entries pointing to them, so wouldn't the worst case then be
> that fsck places copies of garbage in lost+found, but no actual data
> would be lost?
> 
> If the worst case scenario in the unlikely event of the bg descriptors
> being corrupted is that fsck finds garbage data and puts it in
> lost+found, then I don't see a downside to enabling this option.

Ah, if only it would be so simple.   The other problem is that these inodes may point at blocks used by the current '"real" inodes in the file system.  This means that invoking pass 1b/1c/1d processing, which is slow, takes a long time, and requires asking the user if they would like to clone or delete the multiply-claimed blocks.   For very large disks, if users don't have enough memory or who don't have enough swap enabled, e2fsck could run out of memory during the pass 1b/1c/1d passes.

So again, I really do not recommend enabling lazy_itable_init at this time.

-- Ted

Revision history for this message

Launchpad Janitor (janitor) wrote on 2010-04-15:

#10

This bug was fixed in the package partman-ext3 - 58ubuntu3

---------------
partman-ext3 (58ubuntu3) lucid; urgency=low

  * Add preseedable partman-ext3/lazy_itable_init question, which if true
    runs mkfs.ext* with '-E lazy_itable_init', greatly speeding up mkfs on
    large drives (LP: #556621). This defaults to false since it is
    currently unsafe for use on areas of disk that previously contained a
    filesystem.
-- Colin Watson <email address hidden> Thu, 15 Apr 2010 00:52:48 +0100

Changed in partman-ext3 (Ubuntu):
status:	New → Fix Released

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2010-04-22:

#11

From the sounds of it, the only e2fsprogs bug then is that the man page is wrong by claiming it's the default?

Changed in e2fsprogs (Ubuntu):
status:	In Progress → Triaged
assignee:	Phillip Susi (psusi) → nobody
importance:	Undecided → Low

Phillip Susi (psusi) on 2010-04-22

description:

updated

Revision history for this message

Phillip Susi (psusi) wrote on 2010-04-22:

#12

Actually I misunderstood the man page. What it meant is that if you specify lazy_itable_init, without a value, then the VALUE defaults to =1. Changing this to wishlist and I guess we should leave things as they are until the kernel gets its thread to zero the tables in the background or fsck is improved.

I have run into the problem Ted feared while working on e2defrag where the crc on the group descriptors was wrong, and fsck did indeed ignore the inode allocation map and scan the entire inode table and found old inodes that looked valid, then got upset that they appeared to have multiply claimed blocks. This could possibly be resolved sanely by having fsck compare the ctime of the inode with the creation time of the filesystem and toss out the inode created before the filesystem. Of course, this relies on having working real time clocks.

Changed in e2fsprogs (Ubuntu):
importance:	Low → Wishlist

Revision history for this message

Theodore Ts'o (tytso) wrote on 2010-04-22:

#13

... and as we've discovered, too many people, including embedded hardware manufacturers (which Canonical seems to care about) don't seem to care about relible clocks, and apparently virtualization managers can't be bothered to deal with mapping the time zone correctly before setting up the real-time clock in the guest OS.... and instead of filing bugs against the hardware manufactures and the virtualization managers, they get filed against the file system.

I've essentially despaired at this point that we can get people to care about setting time and time zones correctly with Linux. :-(

In fact, Scott has basically argued successfully that we need a "clock totally broken" option to e2fsck. :-( :-( :-(

Revision history for this message

Phillip Susi (psusi) wrote on 2010-04-22:

#14

On 4/22/2010 12:04 PM, Theodore Ts'o wrote:
> I've essentially despaired at this point that we can get people to care
> about setting time and time zones correctly with Linux. :-(
>
> In fact, Scott has basically argued successfully that we need a "clock
> totally broken" option to e2fsck. :-( :-( :-(

Yes, I've seen some of that. If the clock really is totally broken then
a sane recovery from this situation may not be possible. As long as the
clock at least is never set to a value < the fs creation time, it should
not be a problem though. Maybe these systems with broken clocks could
at least set them to the creation time of the root fs when mounting?

If they do that, then any inodes created since the fs would at least
have a ctime after the fs creation timestamp, and therefore you could
identify inodes left over from before mkfs and ignore them.

Revision history for this message

Phillip Susi (psusi) wrote on 2010-05-28:

#15

clear-old-inodes.patch Edit (3.2 KiB, text/plain)

I have come up with the attached patch to fix e2fsck to handle the old inodes left behind when you use lazy_itable_init. It simply checks if ctime < s_mkfs_time and offers to clear the inode. The check is skipped if the clock is detected to be broken. The first prompt informs you that "Inodes with ctime older than the mkfs time detected. Either you have a broken clock, or they are leftover from a previous fs." If you answer yes to clear the inode, the answer is latched and all subsequent inodes with the same problem are also cleared.

I feel that this addresses the issue and should allow us to turn on lazy_itable_init by default. What do you all think?

Changed in e2fsprogs (Ubuntu):
assignee:	nobody → Phillip Susi (psusi)
status:	Triaged → In Progress

Brian Murray (brian-murray) on 2010-05-29

tags:

added: patch

Revision history for this message

Phillip Susi (psusi) wrote on 2010-06-13:

#16

Got any comments Ted?

Revision history for this message

Theodore Ts'o (tytso) wrote on 2010-06-13:

#17

Other people from Canonical have been filing bugs complaining about how they can't count on the time being correct.

In particular there are certain embedded devices which apparently Canonical has been building for where the time reliably resets itself to some time in the distant past every single time you reboot. Sure, if you happen to be on the network hopefully ntp or some other time daemon will skew the time back to reality, but what if that doesn't happen, and there are inodes which are created back in the 1900's? A patch to e2fsck which unceremoniously offers to clear them is likely going to cause Unbuntu users to go up in arms.

I suppose we could code in some hard-coded dates. (If the time is before when Linux was invented, clearly it's bogus.) However, I'm concerned that such hueristics aren't going to catch them all.

The best way to fix this is to have a kernel patch which clears uninitialized inodes in the background, so it's not done as a blocking activity during mke2fs. That is a much safer thing to do, IMHO.

Revision history for this message

Phillip Susi (psusi) wrote on 2010-06-14:

#18

On 6/13/2010 6:23 PM, Theodore Ts'o wrote:
> there are inodes which are created back in the 1900's? A patch to
> e2fsck which unceremoniously offers to clear them is likely going to
> cause Unbuntu users to go up in arms.

Couldn't/shouldn't systems with broken real time clocks be fixed to
force the system clock up to mkfs time before mounting the root fs, and
wouldn't that take care of that and other problems?

> I suppose we could code in some hard-coded dates. (If the time is
> before when Linux was invented, clearly it's bogus.) However, I'm
> concerned that such hueristics aren't going to catch them all.

That would tell you that the time is broken, but would not tell you
whether the inode belongs to this fs or not.

> The best way to fix this is to have a kernel patch which clears
> uninitialized inodes in the background, so it's not done as a blocking
> activity during mke2fs. That is a much safer thing to do, IMHO.

I'd rather avoid the need to zero the table completely since that has
negative consequences other than just using up disk IO in the
background. For instance, if the fs is on a snapshot, thin provisioned
san disk, or SSD, the writes cause allocations that aren't needed just
to hold zeroes, which reads there would already return.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2010-06-14:

#19

Download full text (4.8 KiB)

>Couldn't/shouldn't systems with broken real time clocks be fixed to
>force the system clock up to mkfs time before mounting the root fs, and
>wouldn't that take care of that and other problems?

Unfortunately, it's not so simple. What about people running old systems, like Ubuntu LTS and then upgrade to a newer e2fsprogs? What if someone else from Ubuntu doesn't realize about this dependency and releases a version of e2fsprogs with this change?

More seriously, what if there is more than one filesystem? What about an USB storage device containing an extN filesystem which is hotplugged in after system boot?

What about someone running an Live CD system with a crappy clock far in the future, and who then formats the filesystem? What if they install a system using an Ubuntu installation CD with the date far into the future?

I agree with you that it would be highly convenient if we could count on the system clock. I come from the Unix world, where we could, and it makes life much easer. In the good old days, Multics simply wouldn't allow the system to come up at all if it would result in time going backwards, and then they utilized this property in all sorts of really cool ways. (Heh, Multics could even allow you to detect a file system corruption, and then repair a live mounted filesystem with that error, all while the file system was mounted. It could even survive a third of its memory suddenly disappearing, and it would only kill the processes which had memory disappear after the janitor plugged the floor waxer into the wrong circuit, blowing an electrical breaker and disabling one of the cabinets containing some of the system's memory.)

Unfortunately, we don't live in that world. We live in the world of clueless users, people who want to dual boot Windows, crap hardware with CMOS crystals that are off by plus or minus 20%, so simply keeping crapola tablet or embedded device turned off while it is shipped from Taiwan to the US in a container ship will cause the clock to be at some random time. We live in the world of crappy virtualization manager software which sets the CMOS time from the Unix system time and then doesn't bother to do the time zone virtualization. (I'm looking at you, Ubuntu; the bug was filed in Launchpad a while back IIRC --- no, not against the virtualization manager, but in e2fsprogs; it's always e2fsprogs fault when something goes wrong because people can't deal with system clock bugs.) And of course, the clock could be bad in the factory in Taiwan where the system is installed.

And because of this, the sorts of problems are legion. The hueristic I proposed won't handle the case where the s_mkfs time is set into the future because the clock was bad at installer time, and then the system boots, and then the time gets warped back to the correct time by NTP, and the hardware clock is set correctly, and on the next reboot, e2fsck with your proposed patch goes wild and started deleting inodes as "belonging to the previous filesystem format".

>I'd rather avoid the need to zero the table completely since that has
>negative consequences other than just using up disk IO in the
>background. For instance, if ...

	Status	Importance	Assigned to
e2fsprogs	Unknown	Unknown	sf #2982730
ubiquity	Invalid	Undecided	Unassigned
e2fsprogs (Ubuntu)	Fix Released	Wishlist	Unassigned
partman-ext3 (Ubuntu)	Invalid	Undecided	Unassigned

Ubuntu
e2fsprogs package

lazy_itable_init not on by default

Bug Description

Related branches

Other bug subscribers

Patches

Remote bug watches

Changed in partman-ext3 (Ubuntu):
status:	Fix Released → New

Ubuntue2fsprogs package

lazy_itable_init not on by default

Bug Description

Related branches

Other bug subscribers

Patches

Remote bug watches

Ubuntu
e2fsprogs package