Ironic does not purge UEFI NVRAM entries unrelated to the OS being deployed
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ironic |
Fix Released
|
Medium
|
Steve Baker |
Bug Description
When Ironic deploys a node, it performs two actions, at least *typically*:
1) Removes any prior exact label matches in the UEFI NVRAM. I.e. if you deploy an OS with a default UEFI NVRAM payload of "Joe's Linux", then any record for "Joe's Linux", will be removed.
2) Injects *whatever* the default record in the disk image being deployed, *or* injects a record to any known default found inside of the known boot loaders list carried in IPA.
This is further compounded as some bootloaders, whenever they run, auto-inject if not already configured the default. Example: Shim injecting the default loader CSV's record into UEFI NVRAM for all future boot operations.
This sometimes results in additional records labeled like "Red", when "Red Hat Enterprise Linux" on the disk.
Even *further* complicated by the need to *always* delete first, then add records.
And so, quickly we reach a state where NVRAM records could become polluted with pointless entries which are not default for the next boot operation. And a further challenging aspect is these were not really an issue back in the days of BIOS booting, but as UEFI has taken over as the dominant pattern, we've not really taken on this issue.
Why this is a bug:
1) You *can* run out of space in the UEFI NVRAM Table.
2) Auto-adding entries, for example if an operator chooses to run a grub bootloader, can create confusion as well.
Proposed path forward:
I propose we do two aspects:
1) Create a "UEFI NVRAM Cleaning routine" in IPA.
2) Add a low numbered default clean/deploy step. The reason for cleaning is to just ensure the outdated records get removed. Consensus from the PTG was likely "just remove anything that was on a hard disk (the HD record indicator). The aspect for on-deploy as an early step is to remove before adding additional records as not to upset system firmware, and to also cover the case where a deployer runs without cleaning enabled and somehow ends up booting to the wrong OS or attempts to manually intervene at first boot of the new OS.
Challenges:
With aspects like persistent booting, we might need to have a way to navigate that. We can also just treat that as a bug if ever reported, because persistent booting is also handled differently in vendor BMCs and firmware.
Note: This *does not* address nodes which are already in a "bad state" with the UEFI NVRAM, where they cannot be booted into a ramdisk. For that, we likely need a vendor-passthru or management interface clean step for BMCs to enable the ability to signal removal of records, the unfortunate reality is that the BMC may not honor BMC driven updates of this field, or the values may be in abstracted formats which we don't understand, so the overall challenge there is even harder, and puts more emphasis on fixing it well before we end up in such a situation.
Changed in ironic: | |
status: | New → Triaged |
importance: | Undecided → Medium |
I would like to look into doing this