Installed image can be missing necessary boot file

Bug #1853906 reported by Rod Smith
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Incomplete
Undecided
Unassigned
grub
New
Undecided
Unassigned
shim
New
Undecided
Unassigned

Bug Description

Sometimes, an installation via MAAS will fail, with the following displayed on the node's screen:

Booting local disk...
Failed to open \efi\boot\grubx64.efi - Not Found
Failed to load image \efi\boot\grubx64.efi: Not Found
start_image() returned Not Found

The boot process stops here. Bypassing network booting to boot from the hard disk succeeds. It turns out that the EFI System Partition's (ESP's) \efi\boot\grubx64.efi file (/boot/efi/EFI/BOOT/grubx64.efi) is indeed missing; SOMETHING is trying to load that file and failing. Copying the grubx64.efi and grub.cfg files from the ESP's \efi\ubuntu directory (/boot/efi/EFI/ubuntu in Ubuntu) to the ESP's \efi\boot enables the server to boot.

The server in question is meitner, a Supermicro 5018R-WR. I'm trying to deploy Ubuntu 19.10 on it; I haven't yet checked to see if the same problem occurs when deploying other versions of Ubuntu. Other servers boot just fine in the absence of this file; I don't know why this is a problem for just this one server.

$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-======================================-============-==================================================
ii maas 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cert-server 0.4.4-0ppa1~git3ac1382~ubuntu18.04.1 all Ubuntu certification support files for MAAS server
ii maas-cli 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all MAAS client and command-line interface
un maas-cluster-controller <none> <none> (no description available)
ii maas-common 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all MAAS server common files
ii maas-dhcp 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all MAAS DHCP server
un maas-dns <none> <none> (no description available)
ii maas-proxy 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all Rack Controller for MAAS
ii maas-region-api 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all Region Controller for MAAS
un maas-region-controller-min <none> <none> (no description available)
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all MAAS server provisioning libraries (Python 3)

Revision history for this message
Rod Smith (rodsmith) wrote :
Revision history for this message
Rod Smith (rodsmith) wrote :

This problem occurs when attempting to deploy Ubuntu 18.04, too.

It appears that the ESP's \efi\boot\bootx64.efi is Shim. There's also a copy of fbx64.efi in this directory, but it looks like this Shim isn't configured to look for it, so in the absence of grubx64.efi in this directory, the boot process hangs when Shim is launched. At least, that's my hypothesis.

Revision history for this message
Lee Trager (ltrager) wrote :

Machines managed by MAAS are always configured to boot off of the network. When an image is deployed MAAS sends grub, which was loaded over the network, a configuration file[1] which searches for the local boot loader to chain boot to. As per 13.3.1.3 of the UEFI spec[2] MAAS first tries the default location for a local boot loader, \EFI\BOOT\BOOTX64.EFI. If that is not found or fails to load a number of known vendor directories are searched and attempted. If everything fails GRUB exists so the firmware tries the next boot device.

If I keep a close eye on the UEFI machine booting I can see that \EFI\BOOT\BOOTX64.EFI is being loaded but fails to chain load \EFI\BOOT\GRUBX64.EFI. On UEFI QEMU as well as our CI machines GRUB continues, finds /EFI/ubuntu/shimx64.efi and booting succeeds.

Do you have secure boot enabled? If so can you try deploying with it disabled? It may be causing the boot process to lock when BOOTX64.EFI fails to load.

I'm adding GRUB and shim as \EFI\BOOT\BOOTX64.EFI should chain load to the locally installed GRUB. I'm also not sure why GRUB isn't trying the other alternatives its given.

[1] https://git.launchpad.net/maas/tree/src/provisioningserver/templates/uefi/config.local.amd64.template
[2] https://uefi.org/sites/default/files/resources/UEFI_Spec_2_8_final.pdf

Revision history for this message
Rod Smith (rodsmith) wrote :

Secure Boot is NOT enabled on the affected machine.

So far, I've seen this on only this one server. Other machines, including others deployed from the same MAAS server, deploy and boot fine.

Revision history for this message
Rod Smith (rodsmith) wrote :

I've discovered that adjusting some firmware options caused the system to deploy normally. Specifically, I adjusted the settings of some PCI devices to favor EFI-mode vs. BIOS-mode device firmware. The warnings noted in my initial bug report still appear on the screen for a few seconds, but the computer now moves past them.

I still think that the fallback (EFI\BOOT\bootx64.efi) boot path should either work (which it does not without a matching EFI\BOOT\grubx64.efi binary) or it should be completely absent. Thus, I believe this is still a bug, albeit one that's being worked around by my modified firmware settings.

Revision history for this message
Rod Smith (rodsmith) wrote :

To elaborate on the preceding, the options in question are under Advanced -> PCIe/PCI/PnP Configuration in the firmware setup screen. I toggled several from "Legacy" to "EFI." I'm attaching a screen shot. The firmware is an AMI Aptio variety.

Revision history for this message
Peter Wianecki (peterw71) wrote :

I get exactly same issue with HPE DL380 gen10, I can't find right UEFI settings to work around it.
After 2nd reboot the ESP is not there.
If I choose to boot from disk it will boot fine.
Annoyingly, if I make any change in UEFI and leave it in PXE boot, it will boot fine.
Very frustrating. I think I tried 20 different UEFI configs, but still can't get it to work around this issue.
I can change boot order to disk, but then I will lose the maas control over it.

Revision history for this message
Peter Wianecki (peterw71) wrote :

I was not using latest maas available,
After updating from 2.4.2 to 2.6.2 the issue is no longer present

Revision history for this message
Rod Smith (rodsmith) wrote :

Note that the original bug report was under MAAS 2.6.1, which I believe was the latest non-development release at the time.

Revision history for this message
Björn Tillenius (bjornt) wrote :

Given the comments, it sounds like we need to know more details on what's going on on the node itself. Do we have a system where we can reproduce this on?

Changed in maas:
status: New → Incomplete
Revision history for this message
Zhanglei Mao (zhanglei-mao) wrote :

I got same issue on Lenovo SR665 server for MAAS 2.8.2 when deploy Ubuntu 20.04. The server UEFI secure boot was disabled.

I didn't find it do the try to boot/load BOOTX64.EFI from ipmiconsole output and if 'cp ubuntu/* BOOT', then above error would disappear.

Below is efi partions and theirs files which deployment by MAAS.
/boot/efi/
/boot/efi/EFI
/boot/efi/EFI/ubuntu
/boot/efi/EFI/ubuntu/grubx64.efi
/boot/efi/EFI/ubuntu/shimx64.efi
/boot/efi/EFI/ubuntu/mmx64.efi
/boot/efi/EFI/ubuntu/BOOTX64.CSV
/boot/efi/EFI/ubuntu/grub.cfg
/boot/efi/EFI/BOOT
/boot/efi/EFI/BOOT/BOOTX64.EFI
/boot/efi/EFI/BOOT/fbx64.efi

Revision history for this message
Alberto Donato (ack) wrote :

Could you please attach the rsyslog log from maas for that machine when this is happening?

Alberto Donato (ack)
Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Rod Smith (rodsmith) wrote :

The original bug report includes the entire /var/log/maas directory tree, including the rsyslog file for the failing node, meitner.

That said, I've just tried reproducing the problem, and have failed. Meitner is now deploying fine, even when I reconfigure the node's firmware options, as described in comments #5 and #6. I suspect something may have changed in Shim; or perhaps it was something quirky in the machine's boot order that has since changed. (Unfortunately, I didn't record detailed boot order information for the node.)

Revision history for this message
Björn Tillenius (bjornt) wrote :

Zhanglei, can you reproduce this with MAAS 3.0?

Changed in maas:
status: Incomplete → New
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.