Comment 7 for bug 2036467

Revision history for this message
Krister Johansen (kmjohansen) wrote (last edit ): Re: superblock checksum mismatch in resize2fs

Thanks for all the responses. I'm not sure how quickly I'll be able to get to this either, so I'm hesitant to commit to fixing myself. That said, if I can get time to send patches before your team gets to fixing it, I'll do my best.

To answer the question about how frequently we see this: it was about 4-5 times a day until I applied the patches to our forked version of e2fsprogs.

A few other things to note about what's going on here. In 1.45.7, e2fsprogs added some additional retries to the checksum validation path on open:

https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=6338a8467564c3a0a12e9fcb08bdd748d736ac2f

I picked up this patch as well, and found that it helped a bit, but I was still able to reproduce the problem with the reproducer that I shared.

My team is running on the linux-aws-5.15 HWE kernel that's from jammy but shipped to focal. There's a kernel fix that may help with this problem too, and it has been present since 5.10. That said, I haven't tested this on systems that are running <= 5.4. (We don't have very many of these anymore.)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=05c2c00f3769abb9e323fcaca70d2de0b48af7ba

The 05c2c00f3769 ("ext4: protect superblock modifications with a buffer lock") may help to ensure that the superblock contents are always consistent on disk, prior to the DIO read, since the directio path writes out any dirty cached sb pages prior to issuing the read.

Additionally, there's another known issue with consecutive online resize attempts:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=a408f33e895e455f16cf964cb5cd4979b658db7b

We've gotten the fix for this in linux-aws-5.15 from Ubuntu, but it may be germane for testing on older releases.