update-grub giving errors and apparently not locating /boot on correct zfs pool after upgrade to Ubuntu Mantic

Bug #2041739 reported by Danny
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
grub
Unknown
Unknown
grub2 (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

After upgrading to Ubuntu Mantic, grub failed to load vmlinuz from my zfs bpool giving me the following message:

  error: file `/BOOT/ubuntu_07flvq@/vmlinuz-6.5.0-0-9-generic' not found.
  error: you need to load the kernel first.

  Press any key to continue..._

After this I rebooted from Ubuntu Live USB, imported zfs bpool and rpool's, entered a chroot (with bind mounts), and performed an update-grub.

update-grub log showed it found multiple linux and initrd images from rpool and snapshots but not bpool. Log also contained the following error:

  /usr/sbin/grub-probe: error: compression algorithm inherit not supported

I deleted all old snapshots of /boot from zfs bpool and rpool's. I am not sure how there ended up being a /boot on rpool as well as bpool. After re-running update-grub I was still getting the error about the zfs 'inherit' compression flag, so I explicitly set the zfs compression flag to "on" on the bpool, bpool/BOOT and bpool/BOOT/ubuntu_07flvq datasets. After re-running update-grub the warning was still there. I installed the grub-common v2.12~rc1-10ubuntu4 src package and searched for the warning, and it seems it is reading the zfs compression flag from the block. (./grub-core/fs/zfs/zfs.c zio_read() function line 1853) Chatting with the AI (Bing, Phind.com or perpexity.ai - can't remember) it said the flag may be set per block and changing the flag on the dataset doesn't necessarily mean the block flag will be updated. I don't have time to dig further to verify this, and not sure if I'm on the right track.
I tried rebooting again but still got the same error, so I rebooted from Ubuntu Live USB again and re-entered the chroot. This time I backed up the bpool/BOOT/ubuntu_07flvq dataset to /tmp/boot_backup, destroyed the bpool pool on /dev/sda3 and re-created it with the proper structure and flags, and copied the boot backup back to /boot and ran update-grub again. From memory the compression flag warning went away - but my memory is hazy and I wasn't taking notes. After rebooting, grub booted successfully. However several days later I noticed during an apt upgrade that update-grub was giving the same warnings and had found some more linux/initrd images in snapshots on rpool again. I verified the zfs "com.sun:auto-snapshot" flag on all the bpool datasets and they were set to false, so zfs-auto-snapshot had not created auto-snapshots on bpool. (as expected) However, there was a /boot directory under the "rpool/ROOT/ubuntu_07flvq" dataset, which was being included in it's zfs-auto-snapshot. Even after unmounting bpool and removing this /boot directory and remounting bpool, update-grub still claims it is finding it's linux and initrd images from rpool/ROOT/ubuntu_07flvq. And when I browse the rpool/ROOT/ubuntu_07flvq snapshots under /.zfs I can't see any linux or initrd images under /boot. So the update-grub messages have me mystified. Are the messages incorrect? Is grub-probe just hopelessly confused? Does it only think the images are on rpool because bpool/BOOT/ubuntu_07flvq is mounted at /boot?

I hope someone can understand all this because it's a big mystery to me. Is this the right place to post it?

(base) danny@envy:/boot/grub$ lsb_release -rd
No LSB modules are available.
Description: Ubuntu 23.10
Release: 23.10

(base) danny@envy:/boot/grub$ apt-cache policy grub-common
grub-common:
  Installed: 2.12~rc1-10ubuntu4
  Candidate: 2.12~rc1-10ubuntu4
  Version table:
 *** 2.12~rc1-10ubuntu4 990
        990 mirror+file:/etc/apt/mirrorlist-ubuntu.txt mantic/main amd64 Packages
        100 /var/lib/dpkg/status

ProblemType: Bug
DistroRelease: Ubuntu 23.10
Package: grub-common 2.12~rc1-10ubuntu4
ProcVersionSignature: Ubuntu 6.5.0-9.9-generic 6.5.3
Uname: Linux 6.5.0-9-generic x86_64
NonfreeKernelModules: zfs
ApportVersion: 2.27.0-0ubuntu5
Architecture: amd64
CasperMD5CheckResult: pass
CurrentDesktop: GNOME
Date: Sun Oct 29 16:37:56 2023
InstallationDate: Installed on 2022-06-05 (511 days ago)
InstallationMedia: Ubuntu 22.04 LTS "Jammy Jellyfish" - Release amd64 (20220602)
SourcePackage: grub2
UpgradeStatus: Upgraded to mantic on 2023-04-21 (191 days ago)
modified.conffile..etc.grub.d.05_debian_theme: [deleted]

Revision history for this message
Danny (dannyp777) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in grub2 (Ubuntu):
status: New → Confirmed
Revision history for this message
Mathias Aerts (mathias-aerts) wrote (last edit ):

Had the same issue after upgrading to Ubuntu 23.10. Tried repairing grub multiple times in chroot after booting from a live USB. Even tried to roll back to a zfs snapshot from before the upgrade, which also did not work.

Was finally able to recover from this after finding this bug report with the mention of recreating the bpool.

I created a script to prepare the chroot to be able to speed up the tries to fix grub. Since I'm running zfs on root with encryption, it was taking too much effort to do this manually every time. I've included the script which might be useful for anyone else encountering this issue. You will have to be aware of your partition layout and adjust the script accordingly.

Revision history for this message
Danny (dannyp777) wrote :

I think this is related to several of the other reported bugs related to zfs:

https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1982897 - 26/07/2022
https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1632694 - 12/10/2016
https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1635115 - 20/10/2016
https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1688424 - 05/05/2017
https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1867542 - 15/03/2020

What I don't understand is how update-grub worked with zfs on prior versions of Ubuntu. And why does the problem temporarily resolve itself if I recreate bpool from scratch? And why hasn't it been fixed already given the problems been causing issues for up to 7 years? I thought ZFS was supposed to be quite well supported by Linux distro's or are people moving away from it now? It has great features.

The comments on bug 1867542 imply grub doesn't like zsys snapshots. Or maybe zsys changes some flag on the pool that grub doesn't like? Or maybe grub doesn't like some zfs feature flag enabled on one of my zpools? I have attached a file showing the currently enabled flags.

I have been without my personal laptop for over 7 weeks now as I took a break from looking at this problem as it was doing my head in. After several weeks break I have now pulled all the latest Ubuntu Mantic v23.10 updates and will reboot and see if it made any difference. Otherwise I try and go through all my zpool flags and see if there is anything that doesn't look compatible.

Revision history for this message
Danny (dannyp777) wrote :

Here are the zfs properties on the bpool datasets

Revision history for this message
Danny (dannyp777) wrote :

Have deleted my bpool and recreated with following commands:

zpool create -d \
-o compatibility=grub2,ubuntu-22.04 \
-O devices=off \
bpool \
/dev/sda3

zfs create -o canmount=off -o mountpoint=none bpool/BOOT
zfs create -o canmount=noauto -o mountpoint=/boot bpool/BOOT/ubuntu_07flvq
zfs set mountpoint=none bpool

Will restart and see what happens.

Revision history for this message
Danny (dannyp777) wrote :

Ok, working so far. Something must have set the incorrect flags on bpool at some point. Not sure what.

Revision history for this message
Mathias Aerts (mathias-aerts) wrote :

These ZFS and grub issues appear to be related as well:
https://github.com/openzfs/zfs/issues/15261
https://github.com/openzfs/zfs/issues/13873
https://savannah.gnu.org/bugs/index.php?64297

I could not believe simply creating a snapshot of the top level bpool would cause this, so I was curious enough to try it again and indeed this causes grub-probe to stop recognizing /boot as being zfs. Deleting the snapshot does not help, and I had to use my repair script to fix the boot again.

Before I fixed the bpool again, I saved zpool and zfs properties for the bpool and compared them afterwards. The only differences are the following:

zpool properties:
---
broken state:
bpool feature@extensible_dataset active

working state:
bpool feature@extensible_dataset enabled

zfs properties
---
broken state:
bpool snapshots_changed di dec 19 10:57:22 2023 -
working state:
(property does not exist)

I used to have sanoid enabled to periodically create snapshots of my system, which I have now excluded bpool from to prevent this from reoccurring.

Revision history for this message
Mathias Aerts (mathias-aerts) wrote :

I've been testing this some more and it seems that any enabled zpool feature that requires or also enables the extensible_dataset feature will cause this issue. 'extensible_dataset' is listed in grub2 compatibility mode (/usr/share/zfs/compatibility.d/grub2) but it doesn't seem that this is correct, or at least this bug is causing it to not be fully compatible.

When creating the bpool with any feature disabled that would also enable extensible_dataset, I can create a snapshot on bpool with grub-probe still recognizing it as zfs (test script attached):

zpool create \
    -o ashift=12 \
    -o autotrim=on \
    -o compatibility=grub2 \
    -o feature@extensible_dataset=disabled \
    -o feature@bookmarks=disabled \
    -o feature@filesystem_limits=disabled \
    -o feature@large_blocks=disabled \
    -o feature@large_dnode=disabled \
    -o feature@sha512=disabled \
    -o feature@skein=disabled \
    -o feature@edonr=disabled \
    -o feature@userobj_accounting=disabled \
    -o feature@encryption=disabled \
    -o feature@project_quota=disabled \
    -o feature@obsolete_counts=disabled \
    -o feature@bookmark_v2=disabled \
    -o feature@redaction_bookmarks=disabled \
    -o feature@redacted_datasets=disabled \
    -o feature@bookmark_written=disabled \
    -o feature@livelist=disabled \
    -o feature@zstd_compress=disabled \
    -o feature@zilsaxattr=disabled \
    -o feature@head_errlog=disabled \
    -o feature@blake3=disabled \
    -o feature@vdev_zaps_v2=disabled \
    -O devices=off \
    -O acltype=posixacl -O xattr=sa \
    -O compression=lz4 \
    -O normalization=formD \
    -O relatime=on \
    -O canmount=off \
    bpool /dev/nvme0n1p3

Enabling any of the features in the command above will cause grub not to recognize /boot as zfs again when a snapshot is created on bpool.

Revision history for this message
Tim K. (tkubnt) wrote :

This really needs to be fixed as a high priority. After 5 years, suddenly my server could not boot anymore because of it and it's unclear what changed because I have been taking snapshots of the same root pool for years.

Not sure whether 22.04 can upgrade to GRUB 2.12 or the patches should be backported to 2.06:
https://git.savannah.gnu.org/cgit/grub.git/log/grub-core/fs/zfs/zfs.c

According to this comment 2.12 should work:
https://github.com/openzfs/zfs/issues/13873#issuecomment-1889885090

Meanwhile, if others are having issues, try the portable version of ZFSBootMenu, it saved me (and I also moved to ZBM permanently, so much better than GRUB):
https://docs.zfsbootmenu.org/en/latest/guides/general/portable.html

Revision history for this message
Danny (dannyp777) wrote :

I was/am using version 2.12~rc1-10ubuntu4 with Ubuntu Mantic 23.10 and it was still happening. (before I recreated bpool with compatibility flags)

ZFSBootMenu looks good.

My recommendation is for grub to do some basic checking of zfs version/feature/compatibility flags before starting, so that it at least gives a useful warning/error message about which zfs flag is problematic and/or what to do about it or why it might have happened.

Revision history for this message
cniry (pavelsieder) wrote :

I have the same problem, but it's not only in upgrade - it's broken even in pure zfs installation using installer.

Steps to reproduce:
1) Install Ubuntu 23.10 with zfs.
2) Create first snapshot eg `zfs snap bpool@hello`.
3) Next reboot will failure.

Revision history for this message
Danny (dannyp777) wrote :

That would seem to indicate the Ubuntu Mantic installation process is creating the bpool with zfs flags that are incompatible with grub. I wonder if the problem is still present in Noble Numbat/24.04?

Are you able to check which zfs flags/features are set after the bpool is created? (use `zpool get all bpool` to get flags)

You will need to backup your /boot directory and recreate you bpool with the `-o compatibility=grub2,ubuntu-22.04 \` option, as described earlier in this thread.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.