[Regression] breaks GRUB install on an nvme device

Bug #1891718 reported by dann frazier
24
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Efivar
Fix Released
Unknown
efivar (Ubuntu)
Fix Released
Undecided
dann frazier
Focal
Fix Released
Undecided
dann frazier
Groovy
Fix Released
Undecided
dann frazier

Bug Description

[Impact]
Grub fails to install on systems with nvme-subsys storage when installing focal, or upgrading from bionic to focal. As symptom of the latter is shown below:

┌───────────────────────┤ Configuring shim-signed ├────────────────────────┐
  │ │
  │ GRUB failed to install to the following devices: │
  │ │
  │ /dev/nvme0n1p1 │
  │ │
  │ Do you want to continue anyway? If you do, your computer may not start │
  │ up properly. │
  │ │
  │ Writing GRUB to boot device failed - continue? │
  │ │
  │ <Yes> <No>

I've hit this on 2 systems so far - a high-end Nvidia server and an older ARM Server.

[Test Case]
On a system with an EFI System Partition residing on an nvme-subsys block device, run grub-install:

$ sudo /usr/sbin/grub-install
Installing for x86_64-efi platform.
/usr/sbin/grub-install: warning: Internal error.
/usr/sbin/grub-install: error: failed to register the EFI boot entry: Operation not permitted.

Also, regression test on a system with a non-nvme-subsys NVMe device.

[Regression Potential]
There's a risk that a parsing bug will introduce a regression to other systems - most at risk are systems with NVMe block devices. Supposedly nvme-fabrics systems were not previously supported and, as a side-effect of this backport, will now be. However, it's possible - like was the case with bionic/nvme-subsys - that nvme-fabrics *used* to happen to work, and now will work differently or even break. Ideally we'd be able to test on such a system, but I don't know where to find that hardware.

This fix has been in groovy since just after beta (minus an innocuous debug statement that upstream requested during review), and the same patches apply cleanly to focal, which should help mitigate the risk by way of some real world exposure.

Changed in efivar:
status: Unknown → New
Jeff Lane  (bladernr)
tags: added: blocks-hwcert-server
dann frazier (dannf)
Changed in efivar (Ubuntu Groovy):
status: New → In Progress
assignee: nobody → dann frazier (dannf)
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package efivar - 37-5ubuntu1

---------------
efivar (37-5ubuntu1) groovy; urgency=medium

  * Add support for nvme-fabrics and nvme-subsystem devices. LP: #1891718.

 -- dann frazier <email address hidden> Sat, 03 Oct 2020 11:09:53 -0600

Changed in efivar (Ubuntu Groovy):
status: In Progress → Fix Released
dann frazier (dannf)
Changed in efivar (Ubuntu Focal):
assignee: nobody → dann frazier (dannf)
status: New → In Progress
dann frazier (dannf)
description: updated
description: updated
dann frazier (dannf)
description: updated
description: updated
Revision history for this message
Robie Basak (racb) wrote :

Hi dann,

It looks like there's quite a bit of refactoring going on in the patchset you backported. Would it be feasible to write a smaller fix just for the problem being fixed instead, or is this no practical?

From SRU policy:

"""We never assume that any change, no matter how obvious, is completely free of regression risk.

In line with this, the requirements for stable updates are not necessarily the same as those in the development release. When preparing future releases, one of our goals is to construct the most elegant and maintainable system possible, and this often involves fundamental improvements to the system's architecture, rearranging packages to avoid bundled copies of other software so that we only have to maintain it in one place, and so on. However, once we have completed a release, the priority is normally to minimise risk caused by changes not explicitly required to fix qualifying bugs, and this tends to be well-correlated with minimising the size of those changes. As such, the same bug may need to be fixed in different ways in stable and development releases.
"""

Revision history for this message
dann frazier (dannf) wrote :

Hi Robie,
  Thank you for the review. I don't believe a surgical fix is practical here. I did look to see if we could somehow back out the code that introduced the regression. That was introduced upstream between versions v35 & v36 in this commit:
  https://github.com/rhboot/efivar/commit/ff696a432bf92af1206e235a60ad28b58ff0ab92
That was a significant refactoring itself and taking a v37-based-package back to the "old way" didn't seem feasible. So I think we're stuck evolving the new code to deal with these devices, and I don't have an alternate solution to following what upstream has done. Here's some patch-by-patch justification/commentary:

= 0001-Fix-the-error-path-in-set_disk_and_part_name.patch =
TBH, this is one patch that could be dropped, and I could refactor later patches to leave the old behavior. It fixes an obvious bug - that is, even if set_disk_and_part_name() fails to parse the device, it will always return 0. The calling code in device_get() checks for non-zero before proceeding to dereference strings that set_disk_and_part_name() would only set if it *actually* succeeded. Since technically this bug is unrelated to what I'm fixing here, I'm open to dropping it if you prefer or, alternatively, I could file a new bug and update the patch comments to make it's justification appear less arbitrary.

= 0002-sysfs-parsers-make-all-the-sys-block-link-parsers-wo.patch =
While the upstream version of the patch I backported does change all parsers, I restricted the changes to the NVMe parser to reduce regression risk. This patch is part of the evolution to add support for nvme-subsys devices, and I think trying to avoid it would be riskier than including it.

= 0003-Try-even-harder-to-find-disk-device-symlinks-in-sysf.patch =
The refactoring here is introduced to change sysfs parsing behavior, where NVMe devices are the given example in the commit log/comments. I don't see a practical way of avoiding it.

= 0004-Handle-sys-devices-virtual-nvme-fabrics-nvme-subsyst.patch =
This adds the virtual/nvme-subsystem parsing code that is required to solve this issue.

= 0005-Fix-variable-sz-uninitialized-error.patch =
A bugfix for the code introduced above.

= 0006-Fix-parsing-for-nvme-subsystem-devices.patch =
Another bugfix that makes the nvme-subsystem code actually do what the existing comments assert it should do.

Revision history for this message
dann frazier (dannf) wrote :

For others hitting this issue - a workaround is to set nvme-core.multipath=N on the kernel command line.

Revision history for this message
dann frazier (dannf) wrote :

TLDR; I've convinced myself that the first patch can be safely dropped, so I've re-uploaded without it.

I attempted to demonstrate the problem I mentioned in Comment #3 that 0001-Fix-the-error-path-in-set_disk_and_part_name.patch addresses. Basically, what would happen if I passed in a device that set_disk_and_part_name() did not know about, causing it to incorrectly report no error while leaving dev->disk_name/dev->part_name uninitialized in the caller. Turns out, the answer is "nothing awful". I used the following test on a system where nvme0n1 was a nvme-subsys device, which focal's set_disk_and_part_name() would not recognize:

#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <sys/types.h>
#include <efivar/efiboot-creator.h>

extern int efi_set_verbose(int verbosity, FILE *errlog);

int main() {
  int err;
  efi_set_verbose(1, stderr);
  err = efi_generate_file_device_path_from_esp(NULL, 0,
            "/dev/nvme0n1", 1,
            "\\EFI\\debian\\grub.efi",
            EFIBOOT_ABBREV_EDD10,
            0x80);
}

Even though neither name was set, the pointer value happens to be zeroed out:

linux.c:387 device_get(): dev->disk_name: (null)
linux.c:388 device_get(): dev->part_name: (null)

And everything between that and the next error check seems to handle a NULL pointer well, so we safely bail at the next check:

linux.c:392 device_get(): readlink of /sys/block/(null)/device failed

That's great if they *are* zero'd out - but is that guaranteed? Yes - the dev is allocated using calloc, which always sets allocated memory to 0.

dann frazier (dannf)
description: updated
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello dann, or anyone else affected,

Accepted efivar into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/efivar/37-2ubuntu2.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in efivar (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-focal
Revision history for this message
dann frazier (dannf) wrote :

Verification:

ubuntu@blanka:~$ readlink /sys/dev/block/259\:3
../../devices/virtual/nvme-subsystem/nvme-subsys0/nvme0n1
ubuntu@blanka:~$ sudo apt install libefiboot1 libefivar1
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be upgraded:
  libefiboot1 libefivar1
2 upgraded, 0 newly installed, 0 to remove and 59 not upgraded.
Need to get 88.2 kB of archives.
After this operation, 9216 B of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal-proposed/main amd64 libefivar1 amd64 37-2ubuntu2.1 [47.8 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal-proposed/main amd64 libefiboot1 amd64 37-2ubuntu2.1 [40.4 kB]
Fetched 88.2 kB in 1s (172 kB/s)
(Reading database ... 143571 files and directories currently installed.)
Preparing to unpack .../libefivar1_37-2ubuntu2.1_amd64.deb ...
Unpacking libefivar1:amd64 (37-2ubuntu2.1) over (37-2ubuntu2) ...
Preparing to unpack .../libefiboot1_37-2ubuntu2.1_amd64.deb ...
Unpacking libefiboot1:amd64 (37-2ubuntu2.1) over (37-2ubuntu2) ...
Setting up libefivar1:amd64 (37-2ubuntu2.1) ...
Setting up libefiboot1:amd64 (37-2ubuntu2.1) ...
Processing triggers for libc-bin (2.31-0ubuntu9.1) ...
ubuntu@blanka:~$ sudo grub-install
Installing for x86_64-efi platform.
Installation finished. No error reported.
ubuntu@blanka:~$

tags: added: verification-done verification-done-focal
removed: verification-needed verification-needed-focal
Revision history for this message
dann frazier (dannf) wrote :

Also regression tested on an EFI system w/ ESP on a non-nvme-subsys device (virtual machine w/ virtio disks):

dannf@focal:~$ readlink /sys/dev/block/252:2
../../devices/pci0000:00/0000:00:02.2/0000:03:00.0/virtio2/block/vda/vda2
dannf@focal:~$ sudo grub-install /dev/vda
Installing for i386-pc platform.
Installation finished. No error reported.

Revision history for this message
dann frazier (dannf) wrote :

And regression tested on an NVMe (but non-nvme-subsys) device on an ARM server:

ubuntu@d05-4:~$ mount | grep /boot/efi
/dev/nvme0n1p1 on /boot/efi type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)
ubuntu@d05-4:~$ ls -l /dev/nvme0n1p1
brw-rw---- 1 root disk 259, 1 Nov 3 23:40 /dev/nvme0n1p1
ubuntu@d05-4:~$ readlink /sys/dev/block/259\:1
../../devices/pci000d:30/000d:30:00.0/000d:31:00.0/nvme/nvme0/nvme0n1/nvme0n1p1
ubuntu@d05-4:~$ sudo grub-install
Installing for arm64-efi platform.
Installation finished. No error reported.
ubuntu@d05-4:~$ sudo apt install libefivar1 libefiboot1
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be upgraded:
  libefiboot1 libefivar1
2 upgraded, 0 newly installed, 0 to remove and 60 not upgraded.
Need to get 78.2 kB of archives.
After this operation, 8192 B of additional disk space will be used.
Get:1 http://ports.ubuntu.com/ubuntu-ports focal-proposed/main arm64 libefivar1 arm64 37-2ubuntu2.1 [42.5 kB]
Get:2 http://ports.ubuntu.com/ubuntu-ports focal-proposed/main arm64 libefiboot1 arm64 37-2ubuntu2.1 [35.7 kB]
Fetched 78.2 kB in 0s (233 kB/s)
(Reading database ... 77021 files and directories currently installed.)
Preparing to unpack .../libefivar1_37-2ubuntu2.1_arm64.deb ...
Unpacking libefivar1:arm64 (37-2ubuntu2.1) over (37-2ubuntu2) ...
Preparing to unpack .../libefiboot1_37-2ubuntu2.1_arm64.deb ...
Unpacking libefiboot1:arm64 (37-2ubuntu2.1) over (37-2ubuntu2) ...
Setting up libefivar1:arm64 (37-2ubuntu2.1) ...
Setting up libefiboot1:arm64 (37-2ubuntu2.1) ...
Processing triggers for libc-bin (2.31-0ubuntu9.1) ...
ubuntu@d05-4:~$ sudo grub-install
Installing for arm64-efi platform.
Installation finished. No error reported.

Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Thank you for testing all the cases! Releasing.

Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for efivar has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package efivar - 37-2ubuntu2.1

---------------
efivar (37-2ubuntu2.1) focal; urgency=medium

  * Add support for nvme-fabrics and nvme-subsystem devices. LP: #1891718.

 -- dann frazier <email address hidden> Fri, 30 Oct 2020 17:04:24 -0600

Changed in efivar (Ubuntu Focal):
status: Fix Committed → Fix Released
Changed in efivar:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.