24.04 grub-pc cannot upgrade on mirrored software RAID root disk

Bug #2060695 reported by Chris Siebenmann
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
subiquity
Fix Released
Undecided
Dan Bungert
cloud-init (Ubuntu)
Triaged
Medium
Unassigned
grub2 (Ubuntu)
Invalid
Undecided
Unassigned
subiquity (Ubuntu)
Fix Released
High
Dan Bungert

Bug Description

I am testing the 24.04 pre-beta in a libvirt virtual machine with two /dev/vd* disks set up as a single mirrored software RAID device, /dev/md0, that is used for the root filesystem. Since this is a libvirt install, it is using BIOS booting, not UEFI (maybe someday libvirt will support snapshots of UEFI based VMs). When I attempt to install Ubuntu updates, the grub-pc install fails with:

grub-pc: Running grub-install ...
Installing for i386-pc platform.
grub-install: warning: File system `ext2' doesn't support embedding.
grub-install: warning: Embedding is not possible. GRUB can only be installed in this setup by using blocklists. However, blocklists are UNRELIABLE and their use is discouraged..
grub-install: error: diskfilter writes are not supported.
  grub-install failure for /dev/md0
You must correct your GRUB install devices before proceeding:

  DEBIAN_FRONTEND=dialog dpkg --configure grub-pc
  dpkg --configure -a
dpkg: error processing package grub-pc (--configure):
 installed grub-pc package post-installation script subprocess returned error exit status 1

'debconf-show' reports (changed) settings as:
* grub-efi/cloud_style_installation: false
* grub-pc/install_devices: /dev/disk/by-id/md-name-ubuntu-server:0
* grub-pc/install_devices_empty: false

The same mirrored root filesystem configuration works on 22.04 LTS.

Revision history for this message
Mate Kukri (mkukri) wrote :

Can you provide a bit more detail about the exact disk layout being used here?

Is the RAID on the bare disks, or are there partition tables containing the RAID?

Also is there a partition table inside the raid, or is it directly formatted as ext4?

Is LVM used anywhere?

Revision history for this message
Chris Siebenmann (cks) wrote :

The RAID is on partitions, but there is no LVM involved. The layout was set up through the 24.04 server installer with custom storage layout, selecting both disks as boot disks, and then using all of their space as a single partition for the software RAID. The software RAID itself is unpartitioned.

/proc/mdstat:
md0 : active raid1 vdb2[1] vda2[0]
      41906176 blocks super 1.2 [2/2] [UU]

/proc/partitions:
 253 0 41943040 vda
 253 1 1024 vda1
 253 2 41939968 vda2
 253 16 41943040 vdb
 253 17 1024 vdb1
 253 18 41939968 vdb2
   9 0 41906176 md0

/proc/self/mounts of the root filesystem:
/dev/md0 / ext4 rw,relatime 0 0

I get this test VM into a state with a post-install grub2 update because I typically work by going through the server installer once, snapshotting the result, and then working from the snapshot to test our post-install actions, which involve updating all packages. This grub-pc problem has happened before several times and previously I've made it go away by rebuilding the initial install using the current daily server ISO. This time I'm reporting a bug.

Revision history for this message
Mate Kukri (mkukri) wrote (last edit ):

Did you set grub-pc/install-devices (e.g. through package configuration) yourself, or is it from the installer?

> * grub-pc/install_devices: /dev/disk/by-id/md-name-ubuntu-server:0

I'd expect grub-pc/install_devices to contain the raw disks themselves e.g. to be /dev/vda /dev/vdb and not the raid.

Revision history for this message
Mate Kukri (mkukri) wrote :

Could you try dpkg-reconfigure grub-pc and selecting the disks themselves?

Revision history for this message
Chris Siebenmann (cks) wrote :

Also, here's sgdisk output (the two disks have identical output apart from names):
isk /dev/vda: 83886080 sectors, 40.0 GiB
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): 08D25DA2-9B12-45EA-B7EE-0978D2780899
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 83886046
Partitions will be aligned on 2048-sector boundaries
Total free space is 4029 sectors (2.0 MiB)

Number Start (sector) End (sector) Size Code Name
   1 2048 4095 1024.0 KiB EF02
   2 4096 83884031 40.0 GiB 8300

Revision history for this message
Chris Siebenmann (cks) wrote :

Both 'dpkg-reconfigure grub-pc' then selecting the /dev/vd* disks and manually running 'grub-install /dev/vda' (and then /dev/vdb) do work.

Revision history for this message
Julian Andres Klode (juliank) wrote (last edit ):

Is it possible the 22.04 install was setup using legacy dm-raid format? The legacy dm-raid format does not include a header so it looks like a raw ext2 to grub and it can "embed" there (as it will see the ext2 on either disk at boot).

Anyway, reassigning to subiquity for triaging.

Changed in grub2 (Ubuntu):
status: New → Invalid
Revision history for this message
Chris Siebenmann (cks) wrote :

I didn't change grub-pc/install-devices, and on our 22.04 BIOS MBR + mirrored software RAID servers (of which we have a lot), it has the same value (or the same sort of value, naming the md device). A random 22.04 server install is also using 'super 1.2' for its root /dev/md0 device superblock format, which will have come from the installer since we don't change or customize that. We have some remaining 20.04 LTS servers as well with this same mirrored software RAID root and they are also superblock 1.2 format and the same grub-pc/install_devices setting. I think it has been this way in Ubuntu server installs for a long time (for BIOS MBR, UEFI is slightly different in that it also sets grub-efi/install_devices to the UEFI partitions on the boot disks).

Revision history for this message
Chris Siebenmann (cks) wrote :

I think I know what is happening here. In Ubuntu 20.04 and 22.04, the grub-pc.postinst has a chunk of code that was designed to deal with bug #1889556 by skipping running grub-install on package updates. The initial commit comment by Steve Langasek says:

debian/postinst.in: Avoid calling grub-install on upgrade of the grub-pc package, since we cannot be certain that it will install to the correct disk and a grub-install failure
will render the system unbootable. LP: #1889556
(commit 3aabdc6fe0ab3b6e129fc5b64238c45cbfd0de47 I believe)

This code is, in its final form in 22.04:
        elif dpkg --compare-versions "$2" ge 2.04-1ubuntu26 && [ -z "$DEBCONF_RECONFIGURE" ]; then
          # Avoid the possibility of breaking grub on SRU update
          # due to ABI change
          :

2.04-1ubuntu26 was the initial grub2 version in 20.04. This version was not updated for 22.04 to the 22.04 base version, but the effect was the same (since they were all more recent than the 20.04 base version). If you force a 22.04 machine to explicitly reconfigure grub-pc with 'dpkg-reconfigure grub-pc', it will fail with the same error message as in 24.04 (until you select the real devices).

The reason 24.04 fails here is that the 20.04/22.04 change to grub-pc.postinst wasn't carried forward to 24.04, the way it was for 22.04 (in commit 00e473f4e2b2e7e607b3aad58cb0c085b1f0561a I believe), so grub-pc always tries to run grub-install on package updates and fails here. I don't know if this is a grub-pc 24.04 problem by itself.

Revision history for this message
Mate Kukri (mkukri) wrote (last edit ):

Yeah you are right, this seems to have been uncovered by the removal of that workaround (the issue it solved wasn't related to this, and was solved better in the meantime).

That seems to indicate that this was a different bug for a long time that was hidden by that workaround.

I'll take another look at what is putting the wrong value into install_devices.

Revision history for this message
Mate Kukri (mkukri) wrote :

@cks I think I have it down to the installer configuring wrong values for grub debconf install_devices.

Changed in subiquity:
importance: Undecided → High
Mate Kukri (mkukri)
tags: added: foundations-todo
Revision history for this message
Mate Kukri (mkukri) wrote :

It looks like the installer is actually doing the correct thing, but then upon first boot the debconf magically contains a different value.

Currently suspecting cloud-init

Changed in subiquity:
status: New → Invalid
importance: High → Undecided
Revision history for this message
Mate Kukri (mkukri) wrote :

Confirmed that this issue is definitely caused by cloud-init replacing grub debconf with an incorrect value upon first boot.

Mate Kukri (mkukri)
Changed in cloud-init (Ubuntu):
assignee: nobody → Mate Kukri (mkukri)
Revision history for this message
Mate Kukri (mkukri) wrote :

Patch to remove the cloud init grub_dpkg module. This is no longer necessary:
- Curtin sets the debconf value itself
- cloud images now use the 'grub-{efi,pc}/cloud_style_installation' option which ignores `install_devices`

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "cloud-init-1-24.1.3-0ubuntu4.diff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
Dan Bungert (dbungert)
Changed in subiquity (Ubuntu):
milestone: none → ubuntu-24.04
importance: Undecided → High
status: New → Triaged
Revision history for this message
Dan Bungert (dbungert) wrote :

After discussion with Chad, we have decided that the low risk plan for 24.04 is for subiquity to inform cloud-init that the dpkg_grub module should not be used. Will implement.

Changed in subiquity (Ubuntu):
assignee: nobody → Dan Bungert (dbungert)
Changed in subiquity:
status: Invalid → Triaged
assignee: nobody → Dan Bungert (dbungert)
Revision history for this message
Chad Smith (chad.smith) wrote (last edit ):

@Mate thanks for the patch suggestion and triage work here.

While this approach will work for live-server/live-desktop and or Ubuntu cloudimage which have downstream grub2 deb fixes for handling the 'new' grub-pc/cloud_style_intialization debconf setting, are we missing other use-cases that will fall over if cloud-init never attempts to detect the proper boot device on image launch?

Does this issue also affect debian upstream too which may not set grub-pc/cloud_style_initialization? - I'm not seeing the comparable changes in upstream debug grub2 that'd take care of properly determining the boot device based on the debconf grub-pc/cloud_style_installation boolean.

This bug feels like it is reminiscent of LP: 1993503 which affects live server(subiquity/curtin based) installs which are perfoming that disk setup in the first place.

I'm concerned about complete removal of cloud-init's grub_dpkg module as a solution because it's a big hammer.
Without cloud-init's cc_grub_dpkg module, cloud-init may not find the right boot devices if grub2 doesn't support the grub-pc/cloud_style_installation boolean or if subiquity wasn't involved in the initial disk setup.

An additional concern is image creation tools like packer currently rely on cloud-init's behavior to detect and correct debconf grub-pc/install_devices values in 'first boot' scenerios to ensure the boot device is found https://bugs.launchpad.net/cloud-init/+bug/1993503/comments/6. Dropping the module completely from cloud.cfg prevents any workaround in user-data or vendor-data to re-enable this module for some unsatisfied corner-cases.

If we were to change anything in cloud-init's behavior of related to grub2, maybe we limit this changeset to setting the default of the "grub_dpkg" config module[1] to "enabled: false" so it makes no changes by default. This would still permit users or platforms the ability to provide "grub_dpkg: {enabled: true}" in either #cloud-config userdata or cloud vendor-data if the default behavior was insufficient.

[1] https://github.com/canonical/cloud-init/blob/24.1/cloudinit/config/cc_grub_dpkg.py#L148

From live installer(subiquity/curtin) derived images which are performing disk setup, it would be possible for those images to provide /etc/cloud/cloud.cfg.d/ configuration snippet to disable cloud-init's cc_grub_dpkg configuration module which would prevent this specific issue in the first place as cloud-init would not attempt to use grub probe to provide debconf selections to grub-pc in when it encourters this config.

grub_dpkg:
  enabled: false

Revision history for this message
Dan Bungert (dbungert) wrote :
Changed in subiquity:
status: Triaged → In Progress
Changed in subiquity (Ubuntu):
status: Triaged → In Progress
Mate Kukri (mkukri)
Changed in cloud-init (Ubuntu):
assignee: Mate Kukri (mkukri) → nobody
Dan Bungert (dbungert)
Changed in subiquity:
status: In Progress → Fix Released
Changed in subiquity (Ubuntu):
status: In Progress → Fix Released
Changed in subiquity:
status: Fix Released → Fix Committed
Changed in subiquity (Ubuntu):
status: Fix Released → Fix Committed
Revision history for this message
Chad Smith (chad.smith) wrote (last edit ):

Thanks Dan, on a patch for subiquity placing an /etc/cloud/cloud.cfg.d/ snippet into the target installed system. This approach will still allow for live server/desktop install path where customers which require cloud-init's default grub_dpkg behavior for some corner case could still turn that behavior back on by providing autoinstall user-data like the following:

#cloud-config
autoinstall:
 version: 1
 late-commands: [ rm -f /target/etc/cloud/cloud.cfg.d/20-disable-cc-dpkg-grub.cfg ]

- or -

#cloud-config
autoinstall:
 version: 1
 user-data:
   ...
   grub_dpkg:
    enabled: true

Revision history for this message
Chad Smith (chad.smith) wrote :

For the cloud-init element/task of this bug, before we change our current upstream grub_dpkg debconf set-selections behavior for boot device detection, we need to consider the following scenarios where the absence of cloud-init grub_dpkg behavior may be insufficient.

If Ubuntu primary use-cases determine we should have this cc_grub_dpkg debconf set-selections behavior off in the majority of our supported install paths, we'll need to make sure we file bugs against tooling we are aware of (packer) impacted by this change in behavior. We will also need to update docs on best practices for golden image creation that would provide a footnote on this behavior and remediation steps for any downstream deb-based distro that could be impacted by this change.

Changed in cloud-init (Ubuntu):
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Dan Bungert (dbungert) wrote :

We believe this issue has been resolved in Subiquity 24.04.1, which can be obtained on the Ubuntu 24.04 LTS ISO or as a snap refresh.

Changed in subiquity:
status: Fix Committed → Fix Released
Changed in subiquity (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.