Ironic does not purge UEFI NVRAM entries unrelated to the OS being deployed

Bug #2041901 reported by Julia Kreger
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
Medium
Steve Baker

Bug Description

When Ironic deploys a node, it performs two actions, at least *typically*:

1) Removes any prior exact label matches in the UEFI NVRAM. I.e. if you deploy an OS with a default UEFI NVRAM payload of "Joe's Linux", then any record for "Joe's Linux", will be removed.

2) Injects *whatever* the default record in the disk image being deployed, *or* injects a record to any known default found inside of the known boot loaders list carried in IPA.

This is further compounded as some bootloaders, whenever they run, auto-inject if not already configured the default. Example: Shim injecting the default loader CSV's record into UEFI NVRAM for all future boot operations.

This sometimes results in additional records labeled like "Red", when "Red Hat Enterprise Linux" on the disk.

Even *further* complicated by the need to *always* delete first, then add records.

And so, quickly we reach a state where NVRAM records could become polluted with pointless entries which are not default for the next boot operation. And a further challenging aspect is these were not really an issue back in the days of BIOS booting, but as UEFI has taken over as the dominant pattern, we've not really taken on this issue.

Why this is a bug:

1) You *can* run out of space in the UEFI NVRAM Table.
2) Auto-adding entries, for example if an operator chooses to run a grub bootloader, can create confusion as well.

Proposed path forward:

I propose we do two aspects:

1) Create a "UEFI NVRAM Cleaning routine" in IPA.
2) Add a low numbered default clean/deploy step. The reason for cleaning is to just ensure the outdated records get removed. Consensus from the PTG was likely "just remove anything that was on a hard disk (the HD record indicator). The aspect for on-deploy as an early step is to remove before adding additional records as not to upset system firmware, and to also cover the case where a deployer runs without cleaning enabled and somehow ends up booting to the wrong OS or attempts to manually intervene at first boot of the new OS.

Challenges:

With aspects like persistent booting, we might need to have a way to navigate that. We can also just treat that as a bug if ever reported, because persistent booting is also handled differently in vendor BMCs and firmware.

Note: This *does not* address nodes which are already in a "bad state" with the UEFI NVRAM, where they cannot be booted into a ramdisk. For that, we likely need a vendor-passthru or management interface clean step for BMCs to enable the ability to signal removal of records, the unfortunate reality is that the BMC may not honor BMC driven updates of this field, or the values may be in abstracted formats which we don't understand, so the overall challenge there is even harder, and puts more emphasis on fixing it well before we end up in such a situation.

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

I would like to look into doing this

Changed in ironic:
assignee: nobody → Steve Baker (steve-stevebaker)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ironic-python-agent (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/ironic-python-agent/+/899774

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/ironic-python-agent/+/899775

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/ironic-python-agent/+/900739

Dmitry Tantsur (divius)
Changed in ironic:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ironic-python-agent (master)

Reviewed: https://review.opendev.org/c/openstack/ironic-python-agent/+/899774
Committed: https://opendev.org/openstack/ironic-python-agent/commit/26be55f763d7c6739082025e47d6a8439c2e33ef
Submitter: "Zuul (22348)"
Branch: master

commit 26be55f763d7c6739082025e47d6a8439c2e33ef
Author: Steve Baker <email address hidden>
Date: Wed Nov 1 15:15:34 2023 +1300

    Test coverage for efi_utils.get_boot_record

    A step will be developed to delete all EFI entries of type HD. As part
    of this get_boot_record will need to parse more of the output of
    `efibootmgr -v`.

    This change asserts the existing behaviour of get_boot_record, and the
    test can evolve with the changes in get_boot_record.

    Related-Bug: #2041901
    Change-Id: I0c5ac4adc1044c528c27a4eaf580c619ceef47e0

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/ironic-python-agent/+/899775
Committed: https://opendev.org/openstack/ironic-python-agent/commit/352df0bc54c56f6a1ea2e3efc7c3db4b882d9f03
Submitter: "Zuul (22348)"
Branch: master

commit 352df0bc54c56f6a1ea2e3efc7c3db4b882d9f03
Author: Steve Baker <email address hidden>
Date: Mon Nov 6 15:29:13 2023 +1300

    Parse efibootmgr type and details

    This change improves the regex to match an exact entry name, and to also
    match with the the entry type from a set of recognised types.
    The boot entry details start from the recognised type onwards.

    This can be used by a step which deletes all entries of type 'HW' and
    UsbClass.

    Related-Bug: #2041901
    Change-Id: I5d879f724efc2919b541fd3fef0f931df67ff9c7

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic-python-agent (master)
Changed in ironic:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ironic-python-agent (master)

Change abandoned by "Steve Baker <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/ironic-python-agent/+/900739
Reason: It looks like we're good with using regex to match entries to delete, so no need to fully parse the path

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic-python-agent (master)

Reviewed: https://review.opendev.org/c/openstack/ironic-python-agent/+/914563
Committed: https://opendev.org/openstack/ironic-python-agent/commit/215fecd4470e868e1bac9737417e166a7e10fb64
Submitter: "Zuul (22348)"
Branch: master

commit 215fecd4470e868e1bac9737417e166a7e10fb64
Author: Steve Baker <email address hidden>
Date: Thu Apr 4 10:42:55 2024 +1300

    Step to clean UEFI NVRAM entries

    Adds a deploy step ``clean_uefi_nvram`` to remove unrequired extra UEFI
    NVRAM boot entries. By default any entry matching ``HD`` as the root
    device, or with a ``shim`` or ``grub`` efi file in the path will be
    deleted, ensuring that disk based boot entries are removed before the
    new entry is created for the written image. The ``match_patterns``
    parameter allows a list of regular expressions to be passed, where a
    case insensitive search in the device path will result in that entry
    being deleted.

    Closes-Bug: #2041901
    Change-Id: I3559dc800fcdfb0322286eba30ce47041419b0c6

Changed in ironic:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ironic-python-agent 9.12.0

This issue was fixed in the openstack/ironic-python-agent 9.12.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.