Ubuntu
e2fsprogs package

Resizing cloud-images occasionally fails due to superblock checksum mismatch in resize2fs

Bug #2036467 reported by Krister Johansen on 2023-09-18

276

This bug affects 2 people

	Status	Importance	Assigned to
cloud-images	New	Critical	Unassigned
e2fsprogs (Ubuntu)	Status tracked in Oracular
Trusty	Won't Fix	Critical	Matthew Ruffell
Xenial	Won't Fix	Critical	Matthew Ruffell
Bionic	Won't Fix	Critical	Matthew Ruffell
Focal	In Progress	Critical	Matthew Ruffell
Jammy	In Progress	Critical	Matthew Ruffell
Lunar	Won't Fix	Critical	Matthew Ruffell
Mantic	Won't Fix	Critical	Matthew Ruffell
Noble	In Progress	Critical	Matthew Ruffell
Oracular	Fix Released	Critical	Matthew Ruffell

Bug Description

[Impact]

This is a long running bug plaguing cloud-images, where on a rare occasion resize2fs would fail and the image would not resize to fit the entire disk.

Online resizes would fail due to a superblock checksum mismatch, where the superblock in memory differs from what is currently on disk due to changes made to the image.

$ resize2fs /dev/nvme1n1p1
resize2fs 1.47.0 (5-Feb-2023)
resize2fs: Superblock checksum does not match superblock while trying to open /dev/nvme1n1p1
Couldn't find valid filesystem superblock.

Changing the read of the superblock to Direct I/O solves the issue.

[Testcase]

Start an c5.large instance on AWS, and attach a 60gb gp3 volume for use as a scratch disk.

Run the following script, courtesy of Krister Johansen and his team:

#!/usr/bin/bash
set -euxo pipefail

   while true
   do
           parted /dev/nvme1n1 mklabel gpt mkpart primary 2048s 2099200s
           sleep .5
           mkfs.ext4 /dev/nvme1n1p1
           mount -t ext4 /dev/nvme1n1p1 /mnt
           stress-ng --temp-path /mnt -D 4 &
           STRESS_PID=$!
           sleep 1
           growpart /dev/nvme1n1 1
           resize2fs /dev/nvme1n1p1
           kill $STRESS_PID
           wait $STRESS_PID
           umount /mnt
           wipefs -a /dev/nvme1n1p1
           wipefs -a /dev/nvme1n1
   done

Test packages are available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/lp2036467-test

If you install the test packages, the race no longer occurs.

[Where problems could occur]

We are changing how resize2fs reads the superblock from underlying disks.

If a regression were to occur, resize2fs could fail to resize offline or online volumes. As all cloud-images are online resized during their initial boot, this could have a large impact to public and private clouds should a regression occur.

[Other info]

Upstream mailing list discussion:
https://<email address hidden>/
https://<email address hidden>/

This was fixed in the below commit upstream:

commit 43a498e938887956f393b5e45ea6ac79cc5f4b84
Author: Theodore Ts'o <email address hidden>
Date: Thu, 15 Jun 2023 00:17:01 -0400
Subject: resize2fs: use Direct I/O when reading the superblock for
online resizes
Link: https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84

The commit has not been tagged to any release. All supported Ubuntu releases require this fix, and need to be published in standard non-ESM archives to be picked up in cloud images.

See original description

Tags:

Revision history for this message

Launchpad Janitor (janitor) wrote on 2023-09-22:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in e2fsprogs (Ubuntu):
status:	New → Confirmed

Revision history for this message

Ye Lu (luye98) wrote on 2023-09-23:

Hi, we were seeing similar issues when bootstrapping AWS EC2 hosts in our service. We applied the patch provided in upstream internally and it indeed resolved the filesystem resize errors we previously encountered. It would be helpful to also backport the patch in Ubuntu and make it generally available for focal and jammy distributions.

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2023-10-05:

@foundations & EDM can we have this backported all the way to trusty? Xenial for sure.

tags:	added: rls-mm-incoming
information type:	Public → Public Security
Changed in cloud-images:
importance:	Undecided → Critical

Revision history for this message

Julian Andres Klode (juliank) wrote on 2023-10-05:

@Krister If you are interested in driving the process to get the patch landed, you can follow the procedure at

https://packaging.ubuntu.com/html/fixing-a-bug.html

And

https://wiki.ubuntu.com/StableReleaseUpdates

To prepare updates for all releases. Feel free to ask for help on IRC.

If not, no worries, we'll get to it, I tagged it foundations-todo for the team to do! But if you want to gain experience in packaging this is a good place to start!

tags:

added: foundations-todo
removed: rls-mm-incoming

Revision history for this message

Robby Pocase (rpocase) wrote on 2023-10-05:

@Krister additionally, can you clarify "occasionally"? It's clearly frequently enough to prioritize upstreaming a fix. Knowing frequency could aide in priority internally.

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2023-10-06:

@rpocase please check internally I believe we have multiple azure & Aws affected customers as per previous Salesforce escalations.

Revision history for this message

Krister Johansen (kmjohansen) wrote on 2023-10-07 (last edit on 2023-10-07):

Thanks for all the responses. I'm not sure how quickly I'll be able to get to this either, so I'm hesitant to commit to fixing myself. That said, if I can get time to send patches before your team gets to fixing it, I'll do my best.

To answer the question about how frequently we see this: it was about 4-5 times a day until I applied the patches to our forked version of e2fsprogs.

A few other things to note about what's going on here. In 1.45.7, e2fsprogs added some additional retries to the checksum validation path on open:

https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=6338a8467564c3a0a12e9fcb08bdd748d736ac2f

I picked up this patch as well, and found that it helped a bit, but I was still able to reproduce the problem with the reproducer that I shared.

My team is running on the linux-aws-5.15 HWE kernel that's from jammy but shipped to focal. There's a kernel fix that may help with this problem too, and it has been present since 5.10. That said, I haven't tested this on systems that are running <= 5.4. (We don't have very many of these anymore.)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=05c2c00f3769abb9e323fcaca70d2de0b48af7ba

The 05c2c00f3769 ("ext4: protect superblock modifications with a buffer lock") may help to ensure that the superblock contents are always consistent on disk, prior to the DIO read, since the directio path writes out any dirty cached sb pages prior to issuing the read.

Additionally, there's another known issue with consecutive online resize attempts:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=a408f33e895e455f16cf964cb5cd4979b658db7b

We've gotten the fix for this in linux-aws-5.15 from Ubuntu, but it may be germane for testing on older releases.

Matthew Ruffell (mruffell) on 2023-10-09

Changed in e2fsprogs (Ubuntu Mantic):
status:	Confirmed → In Progress
Changed in e2fsprogs (Ubuntu Lunar):
status:	New → In Progress
Changed in e2fsprogs (Ubuntu Jammy):
status:	New → In Progress
Changed in e2fsprogs (Ubuntu Focal):
status:	New → In Progress
Changed in e2fsprogs (Ubuntu Bionic):
status:	New → In Progress
Changed in e2fsprogs (Ubuntu Xenial):
status:	New → In Progress
Changed in e2fsprogs (Ubuntu Trusty):
status:	New → In Progress
Changed in e2fsprogs (Ubuntu Mantic):
importance:	Undecided → Critical
Changed in e2fsprogs (Ubuntu Lunar):
importance:	Undecided → Critical
Changed in e2fsprogs (Ubuntu Jammy):
importance:	Undecided → Critical
Changed in e2fsprogs (Ubuntu Focal):
importance:	Undecided → Critical
Changed in e2fsprogs (Ubuntu Bionic):
importance:	Undecided → Critical
Changed in e2fsprogs (Ubuntu Xenial):
importance:	Undecided → Critical
Changed in e2fsprogs (Ubuntu Trusty):
importance:	Undecided → Critical
Changed in e2fsprogs (Ubuntu Mantic):
assignee:	nobody → Matthew Ruffell (mruffell)
Changed in e2fsprogs (Ubuntu Lunar):
assignee:	nobody → Matthew Ruffell (mruffell)
Changed in e2fsprogs (Ubuntu Jammy):
assignee:	nobody → Matthew Ruffell (mruffell)
Changed in e2fsprogs (Ubuntu Focal):
assignee:	nobody → Matthew Ruffell (mruffell)
Changed in e2fsprogs (Ubuntu Bionic):
assignee:	nobody → Matthew Ruffell (mruffell)
Changed in e2fsprogs (Ubuntu Xenial):
assignee:	nobody → Matthew Ruffell (mruffell)
Changed in e2fsprogs (Ubuntu Trusty):
assignee:	nobody → Matthew Ruffell (mruffell)

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2023-10-09:

Attached is a debdiff for e2fsprogs on mantic which fixes this issue.

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2023-10-09:

Attached is a debdiff for e2fsprogs on lunar which fixes this issue.

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2023-10-09:

#10

Debdiff for e2fsprogs on jammy Edit (3.1 KiB, text/plain)

Attached is a debdiff for e2fsprogs on jammy which fixes this issue.

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2023-10-09:

#11

Debdiff for e2fsprogs on focal Edit (3.1 KiB, text/plain)

Attached is a debdiff for e2fsprogs on focal which fixes this issue.

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2023-10-09:

#12

Debdiff for e2fsprogs on bionic Edit (3.1 KiB, text/plain)

Attached is a debdiff for e2fsprogs on bionic which fixes this issue.

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2023-10-09:

#13

Debdiff for e2fsprogs on xenial Edit (3.1 KiB, text/plain)

Attached is a debdiff for e2fsprogs on xenial which fixes this issue.

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2023-10-09:

#14

Debdiff for e2fsprogs on trusty Edit (3.1 KiB, text/plain)

Attached is a debdiff for e2fsprogs on trusty which fixes this issue.

Matthew Ruffell (mruffell) on 2023-10-09

summary:	- superblock checksum mismatch in resize2fs + Resizing cloud-images occasionally fails due to superblock checksum + mismatch in resize2fs
description:	updated
tags:	added: sts

Revision history for this message

Julian Andres Klode (juliank) wrote on 2023-10-09:

#15

trusty and xenial receive bug updates via Pro and not via the main archive anymore, you'll have to get SEG to add bug tasks for Pro and prepare +esm updates with them.

Changed in e2fsprogs (Ubuntu Trusty):
status:	In Progress → Won't Fix
Changed in e2fsprogs (Ubuntu Xenial):
status:	In Progress → Won't Fix

Revision history for this message

Julian Andres Klode (juliank) wrote on 2023-10-09:

#16

@mruffel did you mean to get sponsoring for the patches? you might then want to subscribe ~ubuntu-sponsors so this can be merged by the patch pilots.

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2023-10-10:

#17

@juliank I'm just doing a little bit more testing for the moment, as I really want to make sure this isn't going to cause any issues in the cloud images. It would be nice to have this bug fixed though, I have seen a few cases related to it over the years.

I'll ask my SEG colleagues for help with sponsoring in a day or two.

Matthew Ruffell (mruffell) on 2023-10-12

description:	updated
Changed in e2fsprogs (Ubuntu Bionic):
status:	In Progress → Won't Fix

Revision history for this message

Krister Johansen (kmjohansen) wrote on 2023-11-16:

#18

Hi,
Just wanted to check back to see if the reproducer and fix worked in your testing environments. I was also curious if it were possible to share any plans around when an update that contains this fix might be released. Thanks again.

Revision history for this message

Krister Johansen (kmjohansen) wrote on 2023-12-19:

#19

@mruffel just wanted to check back to see if the instructions in the report worked to reproduce the problem for you. If so, do you have any estimate when packages with the patch will be made available? Thanks!

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-01-11:

#20

Debdiff for e2fsprogs on mantic V2 Edit (3.1 KiB, text/plain)

Attached is a V2 patch for mantic with a different version number, due to it no longer being the devel release.

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-01-11:

#21

Attached is a patch for noble that solves this issue.

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-01-11:

#22

Hi Krister,

I apologise for the delay. The main issue I have been having with testing is that it reproduces significantly faster on some releases than others, and I still haven't managed to reproduce once on some releases. I'll set up some fresh reproducers now, and leave them running.

If you want to help test, there are test packages for all releases in:
https://launchpad.net/~mruffell/+archive/ubuntu/lp2036467-test

Regardless, I'll try move this forwards.

Thanks,
Matthew

Revision history for this message

Krister Johansen (kmjohansen) wrote on 2024-01-12:

#23

Hi Matthew,
Thanks for the update. I went ahead and tested your updated packages on a Focal, Jammy, and Noble image in EC2 this evening. With the latest packages installed, I was unable to reproduce the problem on any of the three installs. I'm uncertain which builds were inconsistent about triggering the problem for you, but it might be worth noting that the version of the package after Focal got an additional partial fix for the superblock checksum mismatch. In those cases, it'll re-try the read of the block up to 3 times before returning a failure. In my previous testing, this would increase the amount of time before one hits the problem, but not eliminate it entirely.

Thanks again for you help with getting these patches in. It's much appreciated!

-K

Revision history for this message

Brian Murray (brian-murray) wrote on 2024-01-25:

#24

Ubuntu 23.04 (Lunar Lobster) has reached end of life, so this bug will not be fixed for that specific release.

Changed in e2fsprogs (Ubuntu Lunar):
status:	In Progress → Won't Fix

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-02-02:

#25

Hi Krister,

I have finally seen this occur in real life with my own two eyes!

You are absolutely correct, the 4-retry doesn't seem to be sufficient sometimes.

The reproducer works on Focal and previous in about 20 minutes, so its easy to see the issue trigger on Focal. But Focal and previous doesn't retry at all.

On Jammy, Mantic and noble, it took about a week straight, but I managed to get it to trigger for each of them.

Start
----------------------------
Tue Jan 16 01:57:20 UTC 2024
Tue Jan 16 02:18:53 UTC 2024

End
----------------------------
Tue Jan 23 20:12:28 UTC 2024
Tue Jan 23 14:32:08 UTC 2024

The 4-retry does help, and helps quite a lot really.

Anyway, I upgraded my test environment to the test packages, and I will leave them running for a week.

If things look good then, I'll get these patches sponsored for SRU.

Sorry for the delay, but I really wanted to see it fail on Jammy, Mantic and Noble before we go patching them.

Thanks,
Matthew

Revision history for this message

Krister Johansen (kmjohansen) wrote on 2024-02-02:

#26

Hi Matthew,
Thanks for the update. I'm glad this finally reproduced in your environment. I don't have a great explanation for why it took so much longer there. I did observe that it seemed more likely to occur in us-west-2 during the 9a-5p window in the local timezone. The timing may be subtly affected by overall EBS utilization. Just a guess, though.

Thanks again,

-K

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-02-05:

#27

Hi Krister,

Fascinating. I'm in New Zealand, so I use ap-southeast-2 in Sydney, Australia for all my instances, and I never gave it any thought that this could depend on how busy EBS is on the availability zone.

I'll move my instances to us-west-2.

Thanks,
Matthew

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-02-22:

#28

I have been running the test packages on AWS with the reproducer running for 20 days now, and they are still running great. The change to direct IO really does fix this issue, and my testing has removed any and all concerns of causing a regression.

Previously focal wouldn't last more than 20 minutes, and jammy onward, a week.

I will get these patches sponsored now. Sorry for the delay Krister.

Revision history for this message

Krister Johansen (kmjohansen) wrote on 2024-02-26:

#29

Thanks for the update, Matthew. We're looking forward to getting this fix from Ubuntu. My team patched our version the e2fsprogs package from Focal about a year ago, before we submitted the fix upstream. Since that patch, we haven't had any re-occurrences of the problem. It used to show up about 4-5 times a day for us. Glad that the fix is working in your tests as well.

Revision history for this message

Krister Johansen (kmjohansen) wrote on 2024-04-25:

#30

Hi Matthew,
It's been a couple months. We'd really love to get the fix for Focal, Jammy, and Noble. Any chance this could get sponsored and approved soon?

I also checked up on upstream and it appears that they're preparing a 1.47.1 release of e2fsprogs that should include this fix. It hasn't been tagged yet, but they're starting the process: https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=3fcbc9ffbeaa0df3dd06113b61f9b3bed4efb92e

Revision history for this message

Lucas Kanashiro (lucaskanashiro) wrote on 2024-04-25:

#31

@Matthew I took a look at your debdiffs (I hope they are updated) and they look good in general, I checked the debdiffs for Focal, Jammy, Mantic and Noble. The Noble debdiff requires a rebase, now in Noble we have version 1.47.0-2.4~exp1ubuntu4, so we want version 1.47.0-2.4~exp1ubuntu4.1 with your changes (it will be a SRU for Noble as well at this point).

This will need to be fixed in the next development release (OO series) to avoid any future regression. But at the moment the archive is not yet open for that.

Please, fix that and someone can sponsor the uploads targeting all supported releases at once.

I am unsubscribing ~ubuntu-sponsors, once you address the comment above please subscribe it again and someone will take a look.

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-04-29:

#32

Debdiff for e2fsprogs on noble V2 Edit (3.1 KiB, text/plain)

Attached is a V2 patch for Noble for e2fsprogs.

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-04-29:

#33

Hi Krister,

Thanks for the heads up about 1.47.1 upstream, it does indeed look like a release is coming soon.

It seems Debian unstable already has 1.7.1-rc1:
https://packages.debian.org/sid/e2fsprogs

When the Ubuntu archive opens for OO, we will merge 1.47.1~rc1-1 from debian unstable, and then submit the patches for SRU to noble, mantic, jammy and focal. Should be a few days.

Thanks,
Matthew

Revision history for this message

Theodore Ts'o (tytso) wrote on 2024-04-29:

#34

n.b. If Ubuntu hasn't taken the 32-bit time_t change (which is in Debian unstable) I have a commit which backs out this change for building e2fsprogs on older systems, e.g., for backports.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2024-05-08:

#35

e2fsprogs 1.47.1~rc2-1 is now in Debian Testing. If you need e2fsprogs built for an older version of Debian/Ubuntu that hasn't done the 64-bit time_5 transition (e.g., you don't have libext2fs2t64 and just have libext2fs2 package installed), grab the debian/backports branch from my e2fsprogs git repository on git.kernel.org or github.com/tytso/e2fsprogs (1.47.1~rc2-1 for bookworm-backports is currently in the NEW queue, and should be released in a week or two, when the ftpmasters get around to processing it).

The 1.47.1 final release should be coming in another week or so.

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-05-08:

#36

Hi Theodore,

Ubuntu carries a little delta ontop of debian, and requires a merge, so I think it is best to wait a week for the final e2fsprogs release before we perform the merge.

https://merges.ubuntu.com/e/e2fsprogs/REPORT

Thanks,
Matthew

Revision history for this message

Krister Johansen (kmjohansen) wrote on 2024-05-22:

#37

Hi Matthew,
Ted has cut the 1.47.1 release as well as tagged a version for Debian. Are you good to pull that version in as well as apply the patches for this fix to Focal and Jammy?

Thanks again,

-K

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-07-04:

#38

Hi Krister,

I am terribly sorry for the delay. I have been swamped recently. And I hadn't
had any time to perform the merge from debian unstable.

However, Gianfranco Costamagna has recently completed it:

https://launchpad.net/ubuntu/+source/e2fsprogs/1.47.1-1ubuntu1

1.47.1 is now in oracular, which means we are clear for SRU.

Heitor, would you please help with sponsoring to F, J, M, N?

Thanks,
Matthew

Changed in e2fsprogs (Ubuntu Oracular):
status:	In Progress → Fix Released

Revision history for this message

Krister Johansen (kmjohansen) wrote on 2024-07-08:

#39

Thanks for the note, Matthew. I was just getting ready to check back with you. :)

It would be fantastic if we could get the patches for the backport sponsored in F, J, M, N.

Nishit Majithia (0xnishit) on 2024-07-15

Changed in e2fsprogs (Ubuntu Mantic):
status:	In Progress → Won't Fix

Revision history for this message

Heitor Alves de Siqueira (halves) wrote on 2024-07-18:

#40

Hi Matthew,

thanks for the high quality debdiffs as usual. I've reviewed your patches for Focal, Jammy and Noble only, and they look good overall. Mantic is already EOL, and older stable releases can be considered if needed (under the Ubuntu Pro project). If this is required for Xenial and Bionic, consider filing the bug against their Ubuntu Pro counterparts for review.

Other bug subscribers

Patches

Add patch

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntue2fsprogs package

Resizing cloud-images occasionally fails due to superblock checksum mismatch in resize2fs

Bug Description

Other bug subscribers

Patches

Remote bug watches

Ubuntu
e2fsprogs package