Deployment fails if server's EFI variable storage is full

Bug #1724989 reported by Rod Smith
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Undecided
Unassigned
curtin
Expired
Undecided
Unassigned

Bug Description

Deployment can fail if a server's EFI variable storage is full. Unfortunately, I lack most relevant logs, since the error went away while I was investigating it, and a subsequent deployment worked; however, I'm pretty confident of the cause: When calling efibootmgr to add a local-disk boot variable and/or set the boot variable order, efibootmgr returned an error condition, which caused the deployment to fail. I don't recall the exact message, but in a deployment, there was a message to the effect that a call to efibootmgr had failed, which appeared to trigger the deployment failure. In my experiments, I booted an Ubuntu Artful desktop image and tried running "efibootmgr -o {a sensible boot order}", which returned:

could not set BootOrder: No space left on device

This error refers to an out-of-space condition on the system's NVRAM, blocking a change in the BootOrder variable. On a subsequent boot, the system deployed correctly. Perhaps a normal garbage collection by the EFI fixed it, or perhaps a change I made to the firmware settings cleared the problem. In either event, I lost the exact MAAS installation logs.

Failing the installation upon a failure of the "efibootmgr -o" command is an unnecessarily strict condition, IMHO, since if the system booted to the MAAS installer, we know that PXE-booting works. Adding a boot entry for the local disk and adjusting the boot order to boot from the network is done so that the system can continue to boot if the MAAS server goes down; but if these operations fail, it seems to me that it's better to reboot and (if the system comes up) call the installation a success -- but ideally to flag the system with a warning that the boot order may be set incorrectly or that the system might fail to boot if the MAAS server goes down, depending on which efibootmgr call failed.

I'm attaching the /var/log/maas directory tree from the server. The node that experienced the problem is oil-prunus. Here's the MAAS package version information:

$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-====================================-============-==================================================
ii maas 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cert-server 0.2.30-0~76~ubuntu16.04.1 all Ubuntu certification support files for MAAS server
ii maas-cli 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all MAAS client and command-line interface
un maas-cluster-controller <none> <none> (no description available)
ii maas-common 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all MAAS server common files
ii maas-dhcp 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all MAAS DHCP server
ii maas-dns 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all MAAS DNS server
ii maas-proxy 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all Region Controller for MAAS
un maas-region-controller-min <none> <none> (no description available)
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all MAAS server provisioning libraries (Python 3)

Revision history for this message
Rod Smith (rodsmith) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hey rod, installation log would be helpful but without nothing much we can do I’m sfraid!

Changed in maas:
status: New → Incomplete
milestone: none → 2.3.x
Revision history for this message
Rod Smith (rodsmith) wrote :

I understand. We're running regression tests, so there's a chance this bug will appear on another server, in which case I'll be sure to grab the installation log.

Revision history for this message
Ryan Harper (raharper) wrote :

Is there a command that we can use to trigger the "clean-up"? What recourse does curtin have if attempt to handle the failure?

Changed in curtin:
status: New → Incomplete
Revision history for this message
Blake Rouse (blake-rouse) wrote :

I think catching the failure and continuing would be bad as well. There is an expectation that the system will boot when MAAS is down, that expectation would go away if it failed and we silently skipped it it.

Seems that if we can determine the actual cause and it was no enough space in the nvram, we might need to remove a boot entry or something to make space for the new entry.

Revision history for this message
Rod Smith (rodsmith) wrote :

AFAIK, there's no way for an OS to reliably trigger garbage collection in the firmware. Even if there were, that action would likely occur only after a reboot of the server. Thus, there's very little that MAAS can do to clear the error; the best would be to reboot the server and HOPE that it performs garbage collection.

I understand that continuing in the case of this error is POTENTIALLY bad in the future, but failing the deployment is DEFINITELY bad in the present. I guess it boils down to what type of failure is worse. For my own purposes, I'd rather have the node boot now, even if it might fail in the future should the MAAS server go down. If the node were mission-critical hardware for a business, though, I might prefer to debug the problem now rather than risk a failure later. (OTOH, having either a MAAS server or whatever the node would be as a single point of failure sounds like a bad design.) Hence my suggestion that MAAS allow the node to fully deploy but present a warning of some type -- but I don't know if MAAS is really set up to handle this type of warning. With a deployed node and a warning, the administrator could log into the node to investigate further.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for curtin because there has been no activity for 60 days.]

Changed in curtin:
status: Incomplete → Expired
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi!

**This is an automated message**

We believe this is may no longer be an issue in the latest MAAS release. Due to the original date of the bug report, we are currently marking it as Invalid. If you believe this bug report still valid against the latest release of MAAS, or if you are still interested in this, please re-open this bug report.

Thanks

Changed in maas:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.