grub-efi/install_devices becoming stale due to by-id/nvme-eui.* symlinks disappearing

Bug #2083176 reported by Marc Deslauriers
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
grub2 (Ubuntu)
Confirmed
High
Unassigned
linux (Ubuntu)
New
Undecided
Unassigned

Bug Description

A family member just sent me this dialog that popped up when they installed their updates today. I'm not sure how a regular user is supposed to be able to handle what is presented here. Do they check the box? What happens if they don't?

Heck, even I don't know what the proper action is here.

This dialog box needs to be removed and a safe default needs to be applied automatically during upgrades.

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :
description: updated
Revision history for this message
Mate Kukri (mkukri) wrote :

This dialog box strictly speaking isn't normal or expected, the installer initially configure grub_{efi,pc}/install_devices, then this isn't expected to pop up.

It usually shows up when the originally configured install device became invalid due to some system configuration change, and strictly speaking outside the simplest of case of one connected hark disk, there is no safe default that is appropriate.

Please see https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1940723 and it's many duplicates.

Is it possible that something was changed about this installation / computer since it was originally installed that could have caused the install device to become invalid?

no longer affects: shim-signed (Ubuntu)
summary: - Technical dialog during upgrade
+ grub-efi install device being prompted on upgrade
Revision history for this message
Julian Andres Klode (juliank) wrote : Re: grub-efi install device being prompted on upgrade

In this particular instance, the only option is the partition already mounted at /boot/efi, so in that case it seems sensible to automatically use it.

Revision history for this message
Mate Kukri (mkukri) wrote :

Yeah I think that fallback already happens if no install device was configured for grub-efi. What's likely going on is here there was one that likely became invalid.

Either way I agree that we should fill install_devices automatically in postinst, even on staleness, with the ESP mounted at /boot/efi, if it's the only one we can find.

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

> Is it possible that something was changed about this installation / computer since it was originally installed that could have caused the install device to become invalid?

There nothing special about this device. It's just an HP laptop with a single disk that was installed by me in a default way using the regular installer, and probably upgraded from focal to jammy (I would have to look at the logs to jog my memory).

Revision history for this message
Julian Andres Klode (juliank) wrote : Re: grub-efi install device being prompted on upgrade, despite only /boot/efi being an option.

It's possible that there were two ESPs configured, and we must not silently downgrade to a single ESP, so we must count how many ESPs were previously configured.

summary: - grub-efi install device being prompted on upgrade
+ grub-efi install device being prompted on upgrade, despite only
+ /boot/efi being an option.
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in grub2 (Ubuntu):
status: New → Confirmed
Revision history for this message
Mate Kukri (mkukri) wrote (last edit ):

Is it possible that devices originally installed before the multiple-ESP support was introduced don't have grub-efi/install_devices configured at all?

This was way before my time, but since that key doesn't exist in Debian (I think Debian grub only has install_devices for grub-pc and EFI just goes by ESP mountpoint), I have a hunch that this might be the case.

Revision history for this message
Julian Andres Klode (juliank) wrote :

That's correct yes; if the option hasn't been set yet, and /boot/efi is mounted, it migrates /boot/efi into it:

# We either migrate /boot/efi over, or we check if we have invalid devices
if [ -z "$RET" ] && [ "$seen" != "true" ]; then
  echo "Trying to migrate /boot/efi into esp config"
  esp="$(get_mounted_device /boot/efi)"
  if [ "$esp" ]; then
    esp="$(device_to_id "$esp")"
  fi
  if [ "$esp" ]; then
    db_set grub-efi/install_devices "$esp"
    db_fset grub-efi/install_devices seen true
    RET="$esp"
  fi
else
  for device in $RET; do
    if [ ! -e "${device%,}" ]; then
      valid=0
      break
    fi
  done
fi

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

The laptop is still waiting at this dialog in case there's some relevant information that would be useful for this bug

Revision history for this message
Mate Kukri (mkukri) wrote :

Can you do a `debconf-show grub-pc` (or it is maybe `debconf-show grub-efi-amd64`)? It seems like the migration code should have handled this without prompting.

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

Here's the output of debconf-show grub-pc.
debconf-show grub-efi-amd64 didn't return anything.

Revision history for this message
Mate Kukri (mkukri) wrote :

Hmm I think I see the problem, install_devices isn't empty, so the migration must have succeeded, but somehow the symlink /dev/disk/by-id/nvme-eui.ace42e00256632d42ee4ac0000000001-part1
 that should point to your ESP doesn't exist?

Can you verify if /dev/disk/by-id/nvme-eui.ace42e00256632d42ee4ac0000000001-part1
 is a thing, or anything similar next to it in the same directory?

It might be that something is pulling rug from under our symlinks :/

Revision history for this message
Mate Kukri (mkukri) wrote :

Or maybe by the off-chance, have you migrated this installation to a different disk with a different nvme eui than it originally was on?

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :
Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

I didn't migrate the installation. It looks like I installed it with jammy. Here are the /var/log/installer contents if that helps any.

Revision history for this message
Mate Kukri (mkukri) wrote :

So it looks like the real cause is that the nvme-eui.* symlinks disappear for some reason.

This will need more time for investigation, but i've seen other issues with those exact symlinks disappearing, so we need to figure out the root cause.

summary: - grub-efi install device being prompted on upgrade, despite only
- /boot/efi being an option.
+ grub-efi/install_devices becoming stale due to by-id/nvme-eui.* symlinks
+ disappearing
Revision history for this message
Mate Kukri (mkukri) wrote :

I wanna move to UUID based ESP selection instead of this broken fragile mess anyways, but figuring out the root cause of the symlinks going away would be nice anyways.

It seems like udev or the kernel are the possible culprits here.

Changed in grub2 (Ubuntu):
importance: Undecided → High
Revision history for this message
Chris Coulson (chrisccoulson) wrote :

What's the output of:

$ cat /sys/class/block/nvme0n1/wwid

Which is what 60-persistent-storage.rules is using to generate those first symlinks

Revision history for this message
Chris Coulson (chrisccoulson) wrote :
Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

/sys/class/block/nvme0n1/wwid is:

nvme.1c5c-465342334e3636383131343130334f3259-534b48796e69785f48464d35313247443348583031354e-00000001

Revision history for this message
Chris Coulson (chrisccoulson) wrote :

Bingo :)

I think we just need confirmation that the SSD's PCI vendor and device ID match the one in the quirk to close the loop, and then we know what the root cause is

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

We have a winner!

01:00.0 Non-Volatile memory controller [0108]: SK hynix Gold P31 SSD [1c5c:174a]

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

So now that we've identified the root cause, I have checked the box beside the disk that is displayed, clicked the Next button and am presented with a dialog with an unchecked box that says "Continue without installing grub". If I don't check that, I get a warning and I go back to the disk selection screen.

Am I supposed to check the "Continue without installing grub" box? Is there an impact to doing that? Do I need to manually fix something?

Revision history for this message
Mate Kukri (mkukri) wrote (last edit ):

That looks very weird that checking the disk isn't letting you install GRUB, it should just let the reinstall work.

if just running grub-install after the package upgrade succeeds, the thing will likely boot successfully, but it's not great that the postinst gets stuck in a loop when selecting the disk.

Revision history for this message
Marc Deslauriers (mdeslaur) wrote (last edit ):

grub-install worked, and the laptop rebooted successfully. Thanks!

Revision history for this message
Trent Lloyd (lathiat) wrote :

I looked into this a few months ago for slightly different reasons (juju/maas getting confused and not identifying a disk, due to differing kernels used for install vs boot), I can confirm I found at the time that the nvme by-id symlinks change due to backporting of the NVME_QUIRK_BOGUS_NID quirk. This was

Unfortunately backports of this quirk for random SSD models has been regularly done to linux -stable kernels upstream. I ran out of time to follow-up on this at the time, but probably this practice needs to be raised upstream with the kernel and possibly needs to stop and/or some solution to do with the symlinks needs to happen, I didn't quite get as far as understanding why the BOGUS NID matters and what that breaks, or what is fixed by the change, fully.

There are a couple of other open bugs related to this issue, e.g. where it also breaks on upgrade:
https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/2039108
https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1940723

in my juju/maas case this was happening with VirtIO SCSI devices too, not a real SSD. As that was also quirked. May make for a way to reproduce the issue without one of the effected SSDs.

Possibly also related links I collected:
https://<email address hidden>/T/#madf46b0ae9d07405bad2e324cb782c477e7518b2:
https://bugs.launchpad.net/curtin/+bug/2015100
https://bugzilla.redhat.com/show_bug.cgi?id=2031810
https://bugzilla.kernel.org/show_bug.cgi?id=217981
https://www.truenas.com/community/threads/bluefin-to-cobia-rc1-drive-now-fails-with-duplicate-ids.113205/

Revision history for this message
Mate Kukri (mkukri) wrote :

> "Unfortunately backports of this quirk for random SSD models has been regularly done to linux -stable kernels upstream."

This problem is worse than that, because GRUB packaging assumes that these symlinks will stay valid across OS upgrades as well.

Getting rid of the reliance on these links will be among my first plans for 25.04 cycle, but this seems like a case of "we will never break userspace, except when it's convenient".

Revision history for this message
Chris Coulson (chrisccoulson) wrote :

> I didn't quite get as far as understanding why the BOGUS NID matters and what that breaks, or what is fixed by the change, fully.

The issue is that the unique identifiers are meant to be unique between drives, but some specific SSDs do not report unique values, which makes them difficult to identify if you have more than one in the same machine. The quirk makes the kernel generate a unique value by including the serial number instead.

Revision history for this message
Mate Kukri (mkukri) wrote :

That is totally understandable, but if the absolutely stability of these identifiers isn't possible to guarantee at the kernel level, we will simply have to use something else to permanently identify bootloader installation targets.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.