Servers set to boot from disk after MAAS installation

Bug #1789650 reported by Rod Smith
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Undecided
Unassigned
curtin
Fix Committed
High
Ryan Harper

Bug Description

This bug appears to be a regression of bug #1311827 and/or bug #1642298, but it seems to affect only some servers. Upon deployment via MAAS, the server is set to boot from the "ubuntu" EFI boot order entry -- that is, it boots directly from disk, bypassing the PXE-boot option. Immediately after deployment, the boot order looks like this:

$ sudo efibootmgr
BootCurrent: 0000
BootOrder: 0000,0002,0001,0003,0004,0005,0006,0007,0008
Boot0000* ubuntu
Boot0001* Hard Disk 0
Boot0002* PXE Network
Boot0003 Enter Setup
Boot0004 Boot Devices
Boot0005 Boot Manager
Boot0006 Setup
Boot0007 Diagnostics
Boot0008 Firmware Log

This bug is affecting two Lenovo servers (drapion and jolteon), when deployed with either Ubuntu 18.04 or 16.04. These servers seemed OK in tests in late April. Other servers we've recently tested are NOT affected. Changes to the boot order after installation do "stick," and it's possible to entirely delete the "ubuntu" entry without it being re-created, so I doubt if the firmware is creating the entry or making it the default; it appears that the bug is somewhere in Ubuntu. I'm reporting this against curtin, but it could be GRUB or something else is the true cause.

Related branches

Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1789650] [NEW] Servers set to boot from disk after MAAS installation

On Wed, Aug 29, 2018 at 9:01 AM Rod Smith <email address hidden> wrote:
>
> Public bug reported:
>
> This bug appears to be a regression of bug #1311827 and/or bug #1642298,
> but it seems to affect only some servers. Upon deployment via MAAS, the
> server is set to boot from the "ubuntu" EFI boot order entry -- that is,
> it boots directly from disk, bypassing the PXE-boot option. Immediately
> after deployment, the boot order looks like this:
>
> $ sudo efibootmgr
> BootCurrent: 0000
> BootOrder: 0000,0002,0001,0003,0004,0005,0006,0007,0008
> Boot0000* ubuntu
> Boot0001* Hard Disk 0
> Boot0002* PXE Network
> Boot0003 Enter Setup
> Boot0004 Boot Devices
> Boot0005 Boot Manager
> Boot0006 Setup
> Boot0007 Diagnostics
> Boot0008 Firmware Log
>
> This bug is affecting two Lenovo servers (drapion and jolteon), when
> deployed with either Ubuntu 18.04 or 16.04. These servers seemed OK in
> tests in late April. Other servers we've recently tested are NOT
> affected. Changes to the boot order after installation do "stick," and
> it's possible to entirely delete the "ubuntu" entry without it being re-
> created, so I doubt if the firmware is creating the entry or making it
> the default; it appears that the bug is somewhere in Ubuntu. I'm
> reporting this against curtin, but it could be GRUB or something else is
> the true cause.

Can you provide the curtin install.log and config from the
installation of the affected servers?

Revision history for this message
Rod Smith (rodsmith) wrote :

I assume you mean the /root/curtin-install.log file, which I'm attaching. If you mean another file, please clarify.

Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1789650] Re: Servers set to boot from disk after MAAS installation

On Wed, Aug 29, 2018 at 9:40 AM Rod Smith <email address hidden> wrote:
>
> I assume you mean the /root/curtin-install.log file, which I'm
> attaching. If you mean another file, please clarify.

That's the file; however debugging output is not enabled. Could you
enable that and get the install log?

maas <user> maas set-config name=curtin_verbose value=True

Before the install if you could attach efibootmgr -v output with the
loader paths
I'll take that and look at the code which reorders boot entries and see if
we can sort out if it's a curtin bug.

That code for resorting hasn't changed since 2017-05-11

https://git.launchpad.net/curtin/tree/curtin/commands/curthooks.py#n219

>
> ** Attachment added: "/root/curtin-install.log file from one affected server (jolteon)"
> https://bugs.launchpad.net/curtin/+bug/1789650/+attachment/5182207/+files/curtin-install.log
>
> --
> You received this bug notification because you are subscribed to curtin.
> Matching subscriptions: curtin-bugs-all
> https://bugs.launchpad.net/bugs/1789650
>
> Title:
> Servers set to boot from disk after MAAS installation
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1789650/+subscriptions

Revision history for this message
Rod Smith (rodsmith) wrote :

I'm attaching a tarball with both the curtin-install-cfg.yaml and curtin-install.log files after a fresh installation, with the debugging feature you wanted activated.

Here's the output of "sudo efibootmgr -v" before the installation:

$ sudo efibootmgr -v
BootCurrent: 0002
BootOrder: 0002,0001,0003,0004,0005,0006,0007,0008
Boot0001* Hard Disk 0 FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0002* PXE Network FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0003 Enter Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)
Boot0004 Boot Devices FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS....h
Boot0005 Boot Manager FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0006 Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0007 Diagnostics FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0008 Firmware Log FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....

Note that installing requires either changing the boot order or deleting the "ubuntu" entry. I opted for the latter in this case. I can re-order the boot entries and try again, if you prefer. (I've done both in my testing, and both have the same result.)

Here's the same command's output immediately after the installation:

$ sudo efibootmgr -v
BootCurrent: 0000
BootOrder: 0000,0002,0001,0003,0004,0005,0006,0007,0008
Boot0000* ubuntu HD(1,GPT,7fb492e7-7aac-4444-896f-13532e2de1f8,0x800,0x100000)/File(\EFI\ubuntu\shimx64.efi)
Boot0001* Hard Disk 0 FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0002* PXE Network FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0003 Enter Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)
Boot0004 Boot Devices FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS....h
Boot0005 Boot Manager FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0006 Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0007 Diagnostics FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0008 Firmware Log FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....

Revision history for this message
Rod Smith (rodsmith) wrote :

Ryan, have you looked at the debugging output? Do you have any clues about what's going on?

Revision history for this message
Ryan Harper (raharper) wrote :

On Fri, Sep 7, 2018 at 9:30 PM Rod Smith <email address hidden> wrote:
>
> Ryan, have you looked at the debugging output? Do you have any clues
> about what's going on?

Looking at the output of before and after, I believe this is expected behavior
due to changes related to handling the case where if MAAS is offline then all
nodes couldn't boot due to defaulting to PXE booting from a busted MAAS
service.

Here's the relevant commit and details w.r.t the behavior change.

commit cb1ef09beddb6c4559c131d2606d9b6b70c4ca7f
Merge: bae772c 5cffafa
Author: Blake Rouse <email address hidden>
Date: Fri May 26 13:27:22 2017 -0500

    Clear and re-order UEFI boot methods during UEFI grub installation.

    Previously when installing Ubuntu using curtin it was default to pass
    '--no-nvram' to the grub-install. This branch reverts that has passing
    '--no-nvram' will prevent the system from booting if MAAS is down
because Ubuntu
    not be a loader in the EFI system for the system to fallback on.
When updating
    the nvram grub places Ubuntu before the currently booted method,
which prevents
    the ability for the machine to boot from the network anymore. This branch
    reorders to boot order of the EFI system to place the currently
booted method
    before all others, but Ubuntu is still second in the list so if
MAAS is down the
    machine will still boot from its local disk correctly.

    Another issue is that older EFI loaders would be present in the EFI firmware
    even through curtin deletes and re-creates the entire EFI partition. This
    removes only those loaders before grub-install is performed to
make sure that
    only the relevant loads for the current state of the system are present.

    Fixes: LP:#1680917, LP:#1686669

    LP: #1680917, #1686669
    bzr-revno: 503

If you want to always keep PXE as the boot option after install then I think
you'll need MAAS to send:

grub:
  update_nvram: false

The default value if nothing is sent is True which is why curtin is
doing reordering.

Setting this value to false will prevent curtin from making any boot
order changes at all.

>
> --
> You received this bug notification because you are subscribed to curtin.
> Matching subscriptions: curtin-bugs-all
> https://bugs.launchpad.net/bugs/1789650
>
> Title:
> Servers set to boot from disk after MAAS installation
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1789650/+subscriptions

Changed in curtin:
importance: Undecided → High
status: New → Incomplete
Revision history for this message
Rod Smith (rodsmith) wrote :

After MAAS deploys a node, the node *SHOULD* be set to PXE-boot. This is, in fact, what happens on most computers (or does recently; the behavior of booting from disk was a bug several months ago). In recent tests (for 16.04.5 regression testing), I've seen this misbehavior of having the disk-boot option first in the boot list only on two Lenovo computers, so it looks like there's some new interaction going on.

Revision history for this message
Ryan Harper (raharper) wrote :

Could you collect the same info (efibootmgr -v output before/after and the curtin-* from /root) from machines that work as you expect for further comparison?

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Actually, this would seem like a regression in curtin. As per the comment #6 from Blake's branch:

" This branch reorders to boot order of the EFI system to place the currently booted method before all others, but Ubuntu is still second in the list so if MAAS is down the machine will still boot from its local disk correctly."

From the description & changes in [3] (which is a fix for [1]), what would have happened is:

a. The --no-nvram flag is no longer passed with the grub command. This causes the nvram to be updated (as expected).
b. After this, curtin (with update_nvram: True) would re-organize the boot order to ensure that PXE is first ("the current boot method"), and then set the disk (or the entry that grub created) as the second in the boot order as fallback to boot, in case the machine cannot boot of PXE.
c. The update_nvram defaults to True, which would do (a) and (b) above. You can see line 100 of the diff in [3], where update_nvram: True calls uefi_reorder_loaders which (as per the docstring) "Reorders the UEFI BootOrder to place BootCurrent first.". Since BootCurrent should've been PXE (because the machine PXE'd off the network), this would be set first.

NOTE that these changes were coupled with [2], which tells grub package itself to not update nvram on subsequente grub updates (e.g. when doing a sudo apt-get dist-upgrade), but I think this is irrelevant for this issue.

So based on the above and [3], this would indeed seem like a regression in curtin if when update_nvram is True, it is not trying to re-organize the bootorder to set PXE first and the boot order that grub created (since we pass it without --no-nvram) second.

That said, I looked at [4] and it doesn't seem that this has actually changed in curtin (see lines between 387 and 411), unless curtin is now always defaulting to update_nvram: false, instead of true. Just looking at [4] this seems to me that if 'update_nvram: True' is still doing what is expected:

1. To run with --update-nvram (update_nvram: True)
2. To re-order the bootorder to ensure PXE is first (update_nvram: True)

Lastly, I wonder if this is actual an issue with the firmware? @Rod, when was this firmware upgraded last? What's the 'BIOS Boot Type' set for the power params for this machine? (auto, legacy, or efi?). I think what's requested on #8 would potentially highlight the differences.

[1]: https://bugs.launchpad.net/maas/+bug/1680917
[2]: https://bugs.launchpad.net/maas/+bug/1642298
[3]: https://code.launchpad.net/~blake-rouse/curtin/uefi-clear-reorder/+merge/323875
[4]: https://git.launchpad.net/curtin/tree/curtin/commands/curthooks.py?id=f5ea2d4d4d714d2cd93c4435fad298860df4d711

Revision history for this message
Andres Rodriguez (andreserl) wrote :

@Rod, so to clarify I would like to know:

1. how's the boot order configured in the BIOS of the machine? is it configured to boot from disk first?
2. In the power params of MAAS, what's the BIOS Boot type option set to? (auto, efi, legacy?).

Revision history for this message
Andres Rodriguez (andreserl) wrote :

and when was the last time the firmware of this system updated?

Revision history for this message
Ryan Harper (raharper) wrote :

No change to curtin here, it will use update-nvram, remove old loaders and reorder.

Looking at the output of the failed system I can see:

after grub-install efiboot settings
+ efibootmgr
BootOrder: 0000,0002,0001,0003,0004,0005,0006,0007,0008
Boot0000* ubuntu
Boot0001* Hard Disk 0
Boot0002* PXE Network
Boot0003 Enter Setup
Boot0004 Boot Devices
Boot0005 Boot Manager
Boot0006 Setup
Boot0007 Diagnostics
Boot0008 Firmware Log

Here we've added the ubuntu entry at 0000; and we know we booted via 0002 (PXE)
since this is maas. What we would expect to see is curtin do a reorder and here
it is in the log:

Running command ['unshare', '--fork', '--pid', '--', 'chroot', '/tmp/tmpggk7qppj/target', 'efibootmgr', '-v'] with allowed return codes [0] (capture=True)

Now, the reorder code logs if it is *not* reordering, and we do not see that in the output, so it must have attempted to reorder (as expected).

In the reorder code block we have:

    efi_output = util.get_efibootmgr(target)
    currently_booted = efi_output.get('current', None)
    boot_order = efi_output.get('order', [])
    if currently_booted:
        if currently_booted in boot_order:
            boot_order.remove(currently_booted)
        boot_order = [currently_booted] + boot_order
        new_boot_order = ','.join(boot_order)
        LOG.debug(
            "Setting currently booted %s as the first "
            "UEFI loader.", currently_booted)
        LOG.debug(
            "New UEFI boot order: %s", new_boot_order)
        with util.ChrootableTarget(target) as in_chroot:
            in_chroot.subp(['efibootmgr', '-o', new_boot_order])

The only path out of here that doesn't log is if the efi_output['current'] is not set.
But since we don't see additional calls to efibootmgr with -o to set the order then
it must have returned output that didn't have 'BootCurrent' set.

The efibootmgr -v output from Rod certainly shows BootCurrent, so this remains somewhat mysterious as to why we attempted to reorder but didn't find a BootCurrent value in the
efibootmgr -v output right after a grub install.

Revision history for this message
Rod Smith (rodsmith) wrote :
Download full text (5.7 KiB)

Ryan and Andres, here are some things you've requested....

The attachment is the /root/curtin-install* files from a server that works as expected. (That server is jehan, a Quanta D52B-1U, FWIW.) Here are the before and after "efibootmgr -v" outputs from jehan:

ubuntu@jehan:~$ sudo efibootmgr -v
BootCurrent: 0006
Timeout: 5 seconds
BootOrder: 0006,0008,0007,0005,0009,000A,000B,000C,0000,0003
Boot0000* ubuntu HD(1,GPT,489c1060-e2fd-4f57-b2e4-beb55dec5764,0x800,0x100000)/File(\EFI\UBUNTU\SHIMX64.EFI)
Boot0003 UEFI: Built-in EFI Shell VenMedia(5023b95c-db26-429b-a648-bd47664c8012)..BO
Boot0005* UEFI: Slot5 Port0 HTTP IPv4 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(a81e84f296c5,0)/IPv4(0.0.0.00.0.0.0,0,0)/Uri()..BO
Boot0006* UEFI: Slot5 Port0 PXE IPv4 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(a81e84f296c5,0)/IPv4(0.0.0.00.0.0.0,0,0)..BO
Boot0007* UEFI: Slot5 Port1 HTTP IPv4 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(a81e84f296c6,0)/IPv4(0.0.0.00.0.0.0,0,0)/Uri()..BO
Boot0008* UEFI: Slot5 Port1 PXE IPv4 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(a81e84f296c6,0)/IPv4(0.0.0.00.0.0.0,0,0)..BO
Boot0009* UEFI: Slot5 Port0 HTTP IPv6 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(a81e84f296c5,0)/IPv6([::]:<->[::]:,0,0)/Uri()..BO
Boot000A* UEFI: Slot5 Port0 PXE IPv6 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(a81e84f296c5,0)/IPv6([::]:<->[::]:,0,0)..BO
Boot000B* UEFI: Slot5 Port1 HTTP IPv6 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(a81e84f296c6,0)/IPv6([::]:<->[::]:,0,0)/Uri()..BO
Boot000C* UEFI: Slot5 Port1 PXE IPv6 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(a81e84f296c6,0)/IPv6([::]:<->[::]:,0,0)..BO
MirroredPercentageAbove4G: 0.00
MirrorMemoryBelow4GB: false

After re-deploying:

ubuntu@jehan:~$ sudo efibootmgr -v
BootCurrent: 0006
Timeout: 5 seconds
BootOrder: 0006,0008,0007,0005,0009,000A,000B,000C,0000,0003
Boot0000* ubuntu HD(1,GPT,6cdb926d-5f7b-4f83-a88d-d9a65ff43d3b,0x800,0x100000)/File(\EFI\UBUNTU\SHIMX64.EFI)
Boot0003 UEFI: Built-in EFI Shell VenMedia(5023b95c-db26-429b-a648-bd47664c8012)..BO
Boot0005* UEFI: Slot5 Port0 HTTP IPv4 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(a81e84f296c5,0)/IPv4(0.0.0.00.0.0.0,0,0)/Uri()..BO
Boot0006* UEFI: Slot5 Port0 PXE IPv4 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(a81e84f296c5,0)/IPv4(0.0.0.00.0.0.0,0,0)..BO
Boot0007* UEFI: Slot5 Port1 HTTP IPv4 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(a81e84f296c6,0)/IPv4(0.0.0.00.0.0.0,0,0)/Uri()..BO
Boot0008* UEFI: Slot5 Port1 PXE IPv4 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(a81e84f296c6,0)/IPv4(0.0.0.00.0.0.0,0,0)..BO
B...

Read more...

Revision history for this message
Rod Smith (rodsmith) wrote :

Oh, I missed one question: The servers are all set to "EFI Boot" under "Power Configuration."

Revision history for this message
Ryan Harper (raharper) wrote :
Download full text (3.2 KiB)

Rod,

Thanks for the successful run. Looking at the log I can see where
curtin does a reorder of the boot entries and clearly shows a call to
efibootmgr with the -o option.

Running command ['unshare', '--fork', '--pid', '--', 'chroot',
'/tmp/tmpbfj5kvuo/target', 'efibootmgr', '-v'] with allowed return
codes [0] (capture=True)
Running command ['udevadm', 'settle'] with allowed return codes [0]
(capture=False)
Running command ['umount', '/tmp/tmpbfj5kvuo/target/sys'] with allowed
return codes [0] (capture=False)
Running command ['umount', '/tmp/tmpbfj5kvuo/target/proc'] with
allowed return codes [0] (capture=False)
Running command ['umount', '/tmp/tmpbfj5kvuo/target/dev'] with allowed
return codes [0] (capture=False)
Setting currently booted 0006 as the first UEFI loader.
New UEFI boot order: 0006,0000,0008,0007,0005,0009,000A,000B,000C,0003
Running command ['mount', '--bind', '/dev',
'/tmp/tmpbfj5kvuo/target/dev'] with allowed return codes [0]
(capture=False)
Running command ['mount', '--bind', '/proc',
'/tmp/tmpbfj5kvuo/target/proc'] with allowed return codes [0]
(capture=False)
Running command ['mount', '--bind', '/sys',
'/tmp/tmpbfj5kvuo/target/sys'] with allowed return codes [0]
(capture=False)
Running command ['unshare', '--fork', '--pid', '--', 'chroot',
'/tmp/tmpbfj5kvuo/target', 'efibootmgr', '-o',
'0006,0000,0008,0007,0005,0009,000A,000B,000C,0003'] with allowed
return codes [0] (capture=False)
BootCurrent: 0006
Timeout: 5 seconds
BootOrder: 0006,0000,0008,0007,0005,0009,000A,000B,000C,0003
Boot0000* ubuntu
Boot0003 UEFI: Built-in EFI Shell
Boot0005* UEFI: Slot5 Port0 HTTP IPv4 Intel(R) 82599 10 Gigabit Dual
Port Network Connection
Boot0006* UEFI: Slot5 Port0 PXE IPv4 Intel(R) 82599 10 Gigabit Dual
Port Network Connection
Boot0007* UEFI: Slot5 Port1 HTTP IPv4 Intel(R) 82599 10 Gigabit Dual
Port Network Connection
Boot0008* UEFI: Slot5 Port1 PXE IPv4 Intel(R) 82599 10 Gigabit Dual
Port Network Connection
Boot0009* UEFI: Slot5 Port0 HTTP IPv6 Intel(R) 82599 10 Gigabit Dual
Port Network Connection
Boot000A* UEFI: Slot5 Port0 PXE IPv6 Intel(R) 82599 10 Gigabit Dual
Port Network Connection
Boot000B* UEFI: Slot5 Port1 HTTP IPv6 Intel(R) 82599 10 Gigabit Dual
Port Network Connection
Boot000C* UEFI: Slot5 Port1 PXE IPv6 Intel(R) 82599 10 Gigabit Dual
Port Network Connection

So the remaining question now is on the failing system why does the
system *not* show BootCurrent in the output immediately after we
install grub.

That seems to be the core issue. Now, once you've booted into these
systems, efibootmgr -v does show boot current, however, while we're
booted into the ephemeral environment and we chroot into the target
OS, the efibootmgr command doesn't seem to return BootCurrent.

Do you have any insight w.r.t what how efibootmgr determines what
BootCurrent value should be?

On Thu, Sep 13, 2018 at 2:01 PM Rod Smith <email address hidden> wrote:
>
> Oh, I missed one question: The servers are all set to "EFI Boot" under
> "Power Configuration."
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1789650
>
> Title:
> Servers set to boot from disk after MAAS i...

Read more...

Revision history for this message
Ryan Harper (raharper) wrote :
Download full text (3.7 KiB)

Looks like this comes from reading:

/sys/firmware/efi/efivars/BootCurrent-<UUID>

i'm testing a late_command to dump both efibootmgr -v output and
hexdump'ing the sysfs path
to see if that shows the missing BootCurrent entry.

On Thu, Sep 13, 2018 at 2:16 PM Ryan Harper <email address hidden> wrote:
>
> Rod,
>
> Thanks for the successful run. Looking at the log I can see where
> curtin does a reorder of the boot entries and clearly shows a call to
> efibootmgr with the -o option.
>
> Running command ['unshare', '--fork', '--pid', '--', 'chroot',
> '/tmp/tmpbfj5kvuo/target', 'efibootmgr', '-v'] with allowed return
> codes [0] (capture=True)
> Running command ['udevadm', 'settle'] with allowed return codes [0]
> (capture=False)
> Running command ['umount', '/tmp/tmpbfj5kvuo/target/sys'] with allowed
> return codes [0] (capture=False)
> Running command ['umount', '/tmp/tmpbfj5kvuo/target/proc'] with
> allowed return codes [0] (capture=False)
> Running command ['umount', '/tmp/tmpbfj5kvuo/target/dev'] with allowed
> return codes [0] (capture=False)
> Setting currently booted 0006 as the first UEFI loader.
> New UEFI boot order: 0006,0000,0008,0007,0005,0009,000A,000B,000C,0003
> Running command ['mount', '--bind', '/dev',
> '/tmp/tmpbfj5kvuo/target/dev'] with allowed return codes [0]
> (capture=False)
> Running command ['mount', '--bind', '/proc',
> '/tmp/tmpbfj5kvuo/target/proc'] with allowed return codes [0]
> (capture=False)
> Running command ['mount', '--bind', '/sys',
> '/tmp/tmpbfj5kvuo/target/sys'] with allowed return codes [0]
> (capture=False)
> Running command ['unshare', '--fork', '--pid', '--', 'chroot',
> '/tmp/tmpbfj5kvuo/target', 'efibootmgr', '-o',
> '0006,0000,0008,0007,0005,0009,000A,000B,000C,0003'] with allowed
> return codes [0] (capture=False)
> BootCurrent: 0006
> Timeout: 5 seconds
> BootOrder: 0006,0000,0008,0007,0005,0009,000A,000B,000C,0003
> Boot0000* ubuntu
> Boot0003 UEFI: Built-in EFI Shell
> Boot0005* UEFI: Slot5 Port0 HTTP IPv4 Intel(R) 82599 10 Gigabit Dual
> Port Network Connection
> Boot0006* UEFI: Slot5 Port0 PXE IPv4 Intel(R) 82599 10 Gigabit Dual
> Port Network Connection
> Boot0007* UEFI: Slot5 Port1 HTTP IPv4 Intel(R) 82599 10 Gigabit Dual
> Port Network Connection
> Boot0008* UEFI: Slot5 Port1 PXE IPv4 Intel(R) 82599 10 Gigabit Dual
> Port Network Connection
> Boot0009* UEFI: Slot5 Port0 HTTP IPv6 Intel(R) 82599 10 Gigabit Dual
> Port Network Connection
> Boot000A* UEFI: Slot5 Port0 PXE IPv6 Intel(R) 82599 10 Gigabit Dual
> Port Network Connection
> Boot000B* UEFI: Slot5 Port1 HTTP IPv6 Intel(R) 82599 10 Gigabit Dual
> Port Network Connection
> Boot000C* UEFI: Slot5 Port1 PXE IPv6 Intel(R) 82599 10 Gigabit Dual
> Port Network Connection
>
> So the remaining question now is on the failing system why does the
> system *not* show BootCurrent in the output immediately after we
> install grub.
>
> That seems to be the core issue. Now, once you've booted into these
> systems, efibootmgr -v does show boot current, however, while we're
> booted into the ephemeral environment and we chroot into the target
> OS, the efibootmgr command doesn't seem to return BootCurrent.
>
> Do you have any insight w.r.t w...

Read more...

Revision history for this message
Rod Smith (rodsmith) wrote :

I've never dug into the efibootmgr source code, so I don't know offhand where it's getting the BootCurrent variable, but /sys/firmware/efi/efivars/BootCurrent-8be4df61-93ca-11d2-aa0d-00e098032b8c is a plausible source. FWIW, on the deployed problem machine, /sys/firmware/efi/efivars/BootCurrent-8be4df61-93ca-11d2-aa0d-00e098032b8c does exist and contains plausible data. If you've got a custom deployment config you want me to run, I can do that; or I can give you access to our MAAS server and the trouble systems.

Revision history for this message
Ryan Harper (raharper) wrote :

On Thu, Sep 13, 2018 at 3:50 PM Rod Smith <email address hidden> wrote:
>
> I've never dug into the efibootmgr source code, so I don't know offhand
> where it's getting the BootCurrent variable, but
> /sys/firmware/efi/efivars/BootCurrent-8be4df61-93ca-11d2-aa0d-
> 00e098032b8c is a plausible source. FWIW, on the deployed problem
> machine, /sys/firmware/efi/efivars/BootCurrent-8be4df61-93ca-11d2-aa0d-
> 00e098032b8c does exist and contains plausible data. If you've got a
> custom deployment config you want me to run, I can do that; or I can
> give you access to our MAAS server and the trouble systems.

This config should dump efibootmgr -v output and what's in BootCurrent
right after we've completed the install but before we reboot.

_hexdump_bootcurrent:
 - &hexdump |
   ls -al /sys/firmware/efi
   bcurrent=$(ls /sys/firmware/efi/efivars/BootCurrent*)
   [ -e "${bcurrent}" ] && hexdump $bcurrent

late_commands:
  01_bootcurrent: ['curtin', 'in-target', '--', 'efibootmgr', '-v']
  02_hexdump: ['curtin', 'in-target', '--', 'sh', '-c', *hexdump]

This will show up in the node logs output.

Revision history for this message
Rod Smith (rodsmith) wrote :

I'm afraid the node is failing to deploy with those changes to /etc/maas/preseeds/curtin_userdata (I assume that's where you wanted them):

        Running command ['unshare', '--fork', '--pid', '--', 'chroot', '/tmp/tmpzfw22u7e/target', 'sh', '-c', 'ls -al /sys/firmware/efi\nbcurrent=$(ls /sys/firmware/efi/efivars/BootCurrent*)\n[ -e "${bcurrent}" ] && hexdump $bcurrent\n'] with allowed return codes [0] (capture=False)
        total 0
        drwxr-xr-x 5 root root 0 Sep 13 23:10 .
        drwxr-xr-x 6 root root 0 Sep 13 23:08 ..
        -r--r--r-- 1 root root 4096 Sep 13 23:10 config_table
        dr-xr-xr-x 2 root root 0 Sep 13 23:08 efivars
        -r--r--r-- 1 root root 4096 Sep 13 23:10 fw_platform_size
        -r--r--r-- 1 root root 4096 Sep 13 23:10 fw_vendor
        -r--r--r-- 1 root root 4096 Sep 13 23:10 runtime
        drwxr-xr-x 9 root root 0 Sep 13 23:10 runtime-map
        -r-------- 1 root root 4096 Sep 13 23:09 systab
        drwxr-xr-x 70 root root 0 Sep 13 23:10 vars
        ls: cannot access '/sys/firmware/efi/efivars/BootCurrent*': No such file or directory
        Running command ['udevadm', 'settle'] with allowed return codes [0] (capture=False)
        Running command ['umount', '/tmp/tmpzfw22u7e/target/sys'] with allowed return codes [0] (capture=False)
        Running command ['umount', '/tmp/tmpzfw22u7e/target/proc'] with allowed return codes [0] (capture=False)
        Running command ['umount', '/tmp/tmpzfw22u7e/target/dev'] with allowed return codes [0] (capture=False)
        finish: cmd-install/stage-late/02_hexdump/cmd-in-target: FAIL: curtin command in-target

Revision history for this message
Ryan Harper (raharper) wrote :

On Thu, Sep 13, 2018 at 6:35 PM Rod Smith <email address hidden> wrote:
>
> I'm afraid the node is failing to deploy with those changes to
> /etc/maas/preseeds/curtin_userdata (I assume that's where you wanted
> them):
>
> Running command ['unshare', '--fork', '--pid', '--', 'chroot', '/tmp/tmpzfw22u7e/target', 'sh', '-c', 'ls -al /sys/firmware/efi\nbcurrent=$(ls /sys/firmware/efi/efivars/BootCurrent*)\n[ -e "${bcurrent}" ] && hexdump $bcurrent\n'] with allowed return codes [0] (capture=False)
> total 0
> drwxr-xr-x 5 root root 0 Sep 13 23:10 .
> drwxr-xr-x 6 root root 0 Sep 13 23:08 ..
> -r--r--r-- 1 root root 4096 Sep 13 23:10 config_table
> dr-xr-xr-x 2 root root 0 Sep 13 23:08 efivars
> -r--r--r-- 1 root root 4096 Sep 13 23:10 fw_platform_size
> -r--r--r-- 1 root root 4096 Sep 13 23:10 fw_vendor
> -r--r--r-- 1 root root 4096 Sep 13 23:10 runtime
> drwxr-xr-x 9 root root 0 Sep 13 23:10 runtime-map
> -r-------- 1 root root 4096 Sep 13 23:09 systab
> drwxr-xr-x 70 root root 0 Sep 13 23:10 vars
> ls: cannot access '/sys/firmware/efi/efivars/BootCurrent*': No such file or directory

Yuck, I was seeing the same thing in my VM, but I was sure it was an
issue with the VM.

I cannot fathom why that sys path is not accessible. Let me look more
into my VM and see what's going on.

Revision history for this message
Ryan Harper (raharper) wrote :

On Fri, Sep 14, 2018 at 9:51 AM Ryan Harper <email address hidden> wrote:
>
> On Thu, Sep 13, 2018 at 6:35 PM Rod Smith <email address hidden> wrote:
> >
> > I'm afraid the node is failing to deploy with those changes to
> > /etc/maas/preseeds/curtin_userdata (I assume that's where you wanted
> > them):
> >
> > Running command ['unshare', '--fork', '--pid', '--', 'chroot', '/tmp/tmpzfw22u7e/target', 'sh', '-c', 'ls -al /sys/firmware/efi\nbcurrent=$(ls /sys/firmware/efi/efivars/BootCurrent*)\n[ -e "${bcurrent}" ] && hexdump $bcurrent\n'] with allowed return codes [0] (capture=False)
> > total 0
> > drwxr-xr-x 5 root root 0 Sep 13 23:10 .
> > drwxr-xr-x 6 root root 0 Sep 13 23:08 ..
> > -r--r--r-- 1 root root 4096 Sep 13 23:10 config_table
> > dr-xr-xr-x 2 root root 0 Sep 13 23:08 efivars
> > -r--r--r-- 1 root root 4096 Sep 13 23:10 fw_platform_size
> > -r--r--r-- 1 root root 4096 Sep 13 23:10 fw_vendor
> > -r--r--r-- 1 root root 4096 Sep 13 23:10 runtime
> > drwxr-xr-x 9 root root 0 Sep 13 23:10 runtime-map
> > -r-------- 1 root root 4096 Sep 13 23:09 systab
> > drwxr-xr-x 70 root root 0 Sep 13 23:10 vars
> > ls: cannot access '/sys/firmware/efi/efivars/BootCurrent*': No such file or directory
>
> Yuck, I was seeing the same thing in my VM, but I was sure it was an
> issue with the VM.
>
> I cannot fathom why that sys path is not accessible. Let me look more
> into my VM and see what's going on.

Well, it turns out that /sys/firmware/efi/efivars is a *mount* point
which should be
automatically mounted on UEFI systems. Something is fishy here.

If it's not mounted, one can run:

mount -t efivarfs efivarfs /sys/firmware/efi/efivars

Hrm, it looks like there are two vars paths:

/sys/firmware/efi/vars (part of the kernel, not a separate mount)
and
/sys/firmware/efi/efivars (special mount)

It seems that efibootmgr could show different values depending which
path it is taking.

Revision history for this message
Ryan Harper (raharper) wrote :
Download full text (3.7 KiB)

Here's an updated late_command to deploy with.

_hexdump_bootcurrent:
 - &hexdump |
   grep efi /proc/mounts
   mountpoint /sys/firmware/efi/efivars
   echo "checking /sys/firmware/efi/vars/"
   ls -al /sys/firmware/efi/vars/
   bcurrent=$(ls /sys/firmware/efi/efivars/BootCurrent*/data)
   [ -e "${bcurrent}" ] && hexdump $bcurrent
   echo "efibootmgr output before mounting efivars (uses vars)"
   efibootmgr -v
   echo "mounting efivars"
   mount -o defaults -t efivarfs efivarfs /sys/firmware/efi/efivars
   ls -al /sys/firmware/efi/efivars/
   echo "efibootmgr output after mounting efivars"
   efibootmgr -v
   bcurrent=$(ls /sys/firmware/efi/efivars/BootCurrent*)
   [ -e "${bcurrent}" ] && hexdump $bcurrent
   umount /sys/firmware/efi/efivars

late_commands:
  01_efivars: ['grep', 'efi', '/proc/mounts']
  02_efimnt: ['mountpoint', '/sys/firmware/efi/efivars']
  03_hexdump: ['curtin', 'in-target', '--', 'sh', '-c', *hexdump]

This runs fine on my VM now so it will be interesting to see what the
BootCurrent values show here.

One possible change to curtin here is we may need to start bind
mounting /sys/firmware/efi/efivars when we run commands in-target
The debug output from this should help us understand what's going on.

I did observe that without efivars mounted, the grub install which
adds a new ubuntu entry was only viable via /sys/firmware/efi/vars
and that if I mounted efivars up and then ran efibootmgr, it wouldn't
*show* the ubuntu entry; so it seems possible to have these
different paths out of sync which may explain the error.

On Fri, Sep 14, 2018 at 11:08 AM Ryan Harper <email address hidden> wrote:
>
> On Fri, Sep 14, 2018 at 9:51 AM Ryan Harper <email address hidden> wrote:
> >
> > On Thu, Sep 13, 2018 at 6:35 PM Rod Smith <email address hidden> wrote:
> > >
> > > I'm afraid the node is failing to deploy with those changes to
> > > /etc/maas/preseeds/curtin_userdata (I assume that's where you wanted
> > > them):
> > >
> > > Running command ['unshare', '--fork', '--pid', '--', 'chroot', '/tmp/tmpzfw22u7e/target', 'sh', '-c', 'ls -al /sys/firmware/efi\nbcurrent=$(ls /sys/firmware/efi/efivars/BootCurrent*)\n[ -e "${bcurrent}" ] && hexdump $bcurrent\n'] with allowed return codes [0] (capture=False)
> > > total 0
> > > drwxr-xr-x 5 root root 0 Sep 13 23:10 .
> > > drwxr-xr-x 6 root root 0 Sep 13 23:08 ..
> > > -r--r--r-- 1 root root 4096 Sep 13 23:10 config_table
> > > dr-xr-xr-x 2 root root 0 Sep 13 23:08 efivars
> > > -r--r--r-- 1 root root 4096 Sep 13 23:10 fw_platform_size
> > > -r--r--r-- 1 root root 4096 Sep 13 23:10 fw_vendor
> > > -r--r--r-- 1 root root 4096 Sep 13 23:10 runtime
> > > drwxr-xr-x 9 root root 0 Sep 13 23:10 runtime-map
> > > -r-------- 1 root root 4096 Sep 13 23:09 systab
> > > drwxr-xr-x 70 root root 0 Sep 13 23:10 vars
> > > ls: cannot access '/sys/firmware/efi/efivars/BootCurrent*': No such file or directory
> >
> > Yuck, I was seeing the same thing in my VM, but I was sure it was an
> > issue with the VM.
> >
> > I cannot fathom why that sys path is not accessible. Le...

Read more...

Revision history for this message
Rod Smith (rodsmith) wrote :

Here are the install files using your modified MAAS preseed.

Revision history for this message
Ryan Harper (raharper) wrote :

Rod,

Thanks for running that.

I don't know why, but there is no BootCurrent entry available during
the install.
This is going to prevent curtin from ensuring what we booted from is the first
entry.

I suspect that there's something firmware related to not having
BootCurrent around if you've PXE booted; you've show that if you boot
from a local disk, that efibootmgr shows a BootCurrent entry. but
during the PXE/ephemeral boot, it's not present in the efi variables
available.

I don't think there is anything else that curtin can do here.

On Fri, Sep 14, 2018 at 12:16 PM Rod Smith <email address hidden> wrote:
>
> Here are the install files using your modified MAAS preseed.
>
> ** Attachment added: "curtin-install.tgz"
> https://bugs.launchpad.net/curtin/+bug/1789650/+attachment/5188885/+files/curtin-install.tgz
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1789650
>
> Title:
> Servers set to boot from disk after MAAS installation
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1789650/+subscriptions

Revision history for this message
Ryan Harper (raharper) wrote :

Related:
   https://bugzilla.redhat.com/show_bug.cgi?id=1031876

On Fri, Sep 14, 2018 at 1:00 PM Ryan Harper <email address hidden> wrote:
>
> Rod,
>
> Thanks for running that.
>
> I don't know why, but there is no BootCurrent entry available during
> the install.
> This is going to prevent curtin from ensuring what we booted from is the first
> entry.
>
> I suspect that there's something firmware related to not having
> BootCurrent around if you've PXE booted; you've show that if you boot
> from a local disk, that efibootmgr shows a BootCurrent entry. but
> during the PXE/ephemeral boot, it's not present in the efi variables
> available.
>
> I don't think there is anything else that curtin can do here.
>
>
> On Fri, Sep 14, 2018 at 12:16 PM Rod Smith <email address hidden> wrote:
> >
> > Here are the install files using your modified MAAS preseed.
> >
> > ** Attachment added: "curtin-install.tgz"
> > https://bugs.launchpad.net/curtin/+bug/1789650/+attachment/5188885/+files/curtin-install.tgz
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1789650
> >
> > Title:
> > Servers set to boot from disk after MAAS installation
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/curtin/+bug/1789650/+subscriptions

Revision history for this message
Rod Smith (rodsmith) wrote :

It can't be a simple matter of BootCurrent not existing when PXE-booted, since after adjusting BootOrder manually and rebooting, it is present:

ubuntu@oil-jolteon:~$ sudo efibootmgr -v
BootCurrent: 0002
BootOrder: 0002,0000,0001,0003,0004,0005,0006,0007,0008
Boot0000* ubuntu HD(1,GPT,2f2ac784-ce90-471b-b036-e2776ee5bdd3,0x800,0x100000)/File(\EFI\ubuntu\shimx64.efi)
Boot0001* Hard Disk 0 FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0002* PXE Network FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0003 Enter Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)
Boot0004 Boot Devices FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS....h
Boot0005 Boot Manager FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0006 Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0007 Diagnostics FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0008 Firmware Log FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....

Might GRUB on the installation image be passing a parameter that's different from what the installed image passes? Or maybe there's a subtle timing issue that's triggering a race condition in the firmware? This is just wild speculation on my part, of course.

Revision history for this message
Ryan Harper (raharper) wrote :

If you can ssh in during deployment you can see if bootcurrent is available.

I don’t know what happens in the firmware when we write uefi setting during
grub install either. Can we try with different kernels? Like xenial ga?

On Fri, Sep 14, 2018 at 2:41 PM Rod Smith <email address hidden> wrote:

> It can't be a simple matter of BootCurrent not existing when PXE-booted,
> since after adjusting BootOrder manually and rebooting, it is present:
>
> ubuntu@oil-jolteon:~$ sudo efibootmgr -v
> BootCurrent: 0002
> BootOrder: 0002,0000,0001,0003,0004,0005,0006,0007,0008
> Boot0000* ubuntu
> HD(1,GPT,2f2ac784-ce90-471b-b036-e2776ee5bdd3,0x800,0x100000)/File(\EFI\ubuntu\shimx64.efi)
> Boot0001* Hard Disk 0
> FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
> Boot0002* PXE Network
> FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
> Boot0003 Enter Setup
> FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)
> Boot0004 Boot Devices
> FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS....h
> Boot0005 Boot Manager
> FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
> Boot0006 Setup
> FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
> Boot0007 Diagnostics
> FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
> Boot0008 Firmware Log
> FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
>
> Might GRUB on the installation image be passing a parameter that's
> different from what the installed image passes? Or maybe there's a
> subtle timing issue that's triggering a race condition in the firmware?
> This is just wild speculation on my part, of course.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1789650
>
> Title:
> Servers set to boot from disk after MAAS installation
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1789650/+subscriptions
>

Revision history for this message
Rod Smith (rodsmith) wrote :

During deployment:

$ sudo efibootmgr -v
sudo: efibootmgr: command not found
$ ls /sys/class/firmware/
timeout

After installing efibootmgr:

$ sudo efibootmgr -v
BootCurrent: 0002
BootOrder: 0002,0000,0001,0003,0004,0005,0006,0007,0008
Boot0001* Hard Disk 0 FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0002* PXE Network FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0003 Enter Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)
Boot0004 Boot Devices FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS....h
Boot0005 Boot Manager FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0006 Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0007 Diagnostics FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0008 Firmware Log FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....

So it seems that efibootmgr is able to extract BootCurrent, but I don't know what's going on with the /sys/class/firmware directory.

I'll try some deployments with other kernels next....

Revision history for this message
Rod Smith (rodsmith) wrote :

That was weird. After the previous attempt, the system looked OK when it was fully deployed -- BootOrder was set correctly. I therefore tried replicating the login while deploying, and this time the /sys/class/firmware directory looked more normal, but both it and efibootmgr (once installed) showed no BootCurrent variable:

$ sudo efibootmgr -v
BootOrder: 0000,0002,0001,0003,0004,0005,0006,0007,0008
Boot0000* ubuntu HD(1,GPT,3867d6c3-0241-43b0-a31e-a519c8271305,0x800,0x100000)/File(\EFI\ubuntu\shimx64.efi)
Boot0001* Hard Disk 0 FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0002* PXE Network FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0003 Enter Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)
Boot0004 Boot Devices FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS....h
Boot0005 Boot Manager FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0006 Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0007 Diagnostics FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0008 Firmware Log FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
ubuntu@oil-jolteon:~$ ls /sys/firmware/efi/efivars/BootC*
ls: cannot access '/sys/firmware/efi/efivars/BootC*': Invalid argument

So there's some inconsistency. Maybe one in X boots works OK...?

Revision history for this message
Rod Smith (rodsmith) wrote :

My tests of different kernels and Ubuntu versions yielded no clues; everything back to 14.04 with its GA kernel appears to be affected; however...

I think I've figured out why the BootCurrent variable is not appearing during deployments:

If these Lenovos boot cold, BootCurrent is missing; if they warm boot, BootCurrent is present. This is reproducible post-installation by shutting down and powering up the server vs. rebooting the server. BootCurrent can then be checked using "sudo efibootmgr" or checking /sys/firmware/efi/efivars. A MAAS deployment, of course, involves a cold boot, so BootCurrent is missing during deployment; but after deployment, the system does a warm reboot into the deployed OS, so BootCurrent appears again.

Once the system is deployed, booting cold and then resetting the server via the BMC a few seconds after starting the server can cause the BootCurrent variable to appear. (I used the web UI and "Power Actions->Restart the Server Immediately" from the main screen.) This causes a reset in POST, which seems to be enough to get the BootCurrent variable to appear. This process, however, is NOT enough to cause a successful deployment with the correct boot order set. REntering the firmware setup utility via F1 after beginning a deployment and manually resetting the server via its setup utility, however, is effective; when this is done, the server deploys and sets the boot order correctly.

One more comment: It's possible, but not certain, that the Kontron MSP804x - MSP804x server is affected by this same bug, since it's failed the certification test that looks for a network boot. The full certification run on this server can be found here:

https://certification.canonical.com/hardware/201809-26486/submission/133138/

The failed test is here:

https://certification.canonical.com/hardware/201809-26486/submission/133138/test/67646/result/9957736/

Overall, this looks like either a firmware bug or a kernel bug (maybe a race condition in building the EFI variables list...?). The fact that a reset via the BMC causes the BootCurrent variable to appear post-deployment but (presumably) not during deployment is peculiar, though.

Is there a way to force the Linux kernel to rebuild its EFI variables list? If so, that might be worth trying as a workaround. If it's a kernel bug, then obviously fixing it is the best solution. If it's a firmware bug, then getting it fixed would also be the best solution, but that's likely to take a while, and the fix might never make it to some affected servers.

Rod Smith (rodsmith)
Changed in curtin:
status: Incomplete → Confirmed
Revision history for this message
Jeff Lane  (bladernr) wrote :

This is still an ongoing problem. Currently, It is preventing me from runnign regression testing on some systems because, it appears, I must first log into each and delete all but the PXE option from the EFI boot menu before I can get them to re-deploy from MAAS.

So for example, on one machine, deployments repeatedly failed and I discovered the machine was actually booting into the existing installation rather than PXE booting (just as Rod described 2 years ago now).

So I logged into that system and deleted all EFI boot options except for the network card, using efibootmgr.

Then after re-deploying, you can see that it once again has all the previous EFI boot options AND is configured to boot from "ubuntu" not from PXE.

ubuntu@persianlime:~$ efibootmgr
BootCurrent: 0002
Timeout: 0 seconds
BootOrder: 0002,0003,0000,0001
Boot0000* TEAC DVD-ROM DV28SV
Boot0001* EFI Fixed Disk Boot Device 1
Boot0002* ubuntu
Boot0003* Broadcom NetXtreme II Gigabit Ethernet (BCM5716C)
ubuntu@persianlime:~$ [ 494.875862] print_req_error: I/O error, dev loop0, sector 0

0002 Ubuntu did not exist prior to deployment

Revision history for this message
Jeff Lane  (bladernr) wrote :

Hrmmm... htat said, after messing about in the EFI boot settings a bit, I may have fixed it by forcing it to boot via PXE as it should have been all along... Need to try this on some other machines to verify it's more likely misconfiguration for my hardware.

ubuntu@persianlime:~$ efibootmgr
BootCurrent: 0003
Timeout: 0 seconds
BootOrder: 0003,0002,0000,0001
Boot0000 TEAC DVD-ROM DV28SV
Boot0001* EFI Fixed Disk Boot Device 1
Boot0002* ubuntu
Boot0003* Broadcom NetXtreme II Gigabit Ethernet (BCM5716C)

Revision history for this message
Ryan Harper (raharper) wrote :

@Jeff

The output you provided looks exactly like what we've seen before. Where BootCurrent is not always set. Have you looked into what Rod suggested (kernel or firmware bug)?

" Overall, this looks like either a firmware bug or a kernel bug (maybe a race condition in building the EFI variables list...?). The fact that a reset via the BMC causes the BootCurrent variable to appear post-deployment but (presumably) not during deployment is peculiar, though."

It's not clear to me what curtin should do if BootCurrent is not set?

You could try a maas deployment with:

grub:
  reorder_uefi: false

This has curtin skip attempting to put BootCurrent to the front of the BootOrder list.

Changed in curtin:
status: Confirmed → Incomplete
Revision history for this message
Jeff Lane  (bladernr) wrote :

well, no joy. I went into the EFI boot settings in firmware and re-ordered all the EFI boot options so that the Network device was first.

That allowed me to re-deploy, however once that deployment was finished, the next one failed because that deployment put the ubuntu item, once again, at the beginning of the EFI boot table.

/-----------------------------------------------------------------------------\
| * Integrated SAS: ubuntu |
| * Embedded NIC 1: Broadcom NetXtreme II Gigabit Ethernet (BCM5716C) |
| * Embedded SATA Port E Optical: TEAC DVD-ROM DV-28SW |
| * Integrated SAS: EFI Fixed Disk Boot Device 1 |
| +/- to move up/down | <ENTER> to accept |
\-----------------------------------------------------------------------------/

As an aside, I had also de-selected all boot devices except for Embedded NIC 1. And after the deployment that did work, I checked again and they're all back on again:
* Integrated SAS: ubuntu [X]
* Embedded NIC 1: Broadcom NetXtreme II Gigabit [X]
Ethernet (BCM5716C)
* Embedded SATA Port E Optical: TEAC DVD-ROM DV-28SW [ ]
* Integrated SAS: EFI Fixed Disk Boot Device 1 [X]

So there's likely a firmware component too but needless to say, it's caused several re-deployments to fail. The only option I have now is to turn EFI off on these machines.

Revision history for this message
Rod Smith (rodsmith) wrote :

@rharper,

Where is the GRUB option you mentioned[1] supposed to be entered? I'm assuming it's somewhere on the MAAS server, since once the node is deployed, the disk is already set as the default boot device.

[1]
grub:
  reorder_uefi: false

Revision history for this message
Ryan Harper (raharper) wrote :

@Rod

I think you'll need to modify the "curtin preseed"; I'm not familiar with how to do this on MAAS.

https://maas.io/docs/custom-node-setup-preseed

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Adding "MAAS" as an impacted project, as it appears there are questions about the curtin preseed that MAAS uses.

Could someone from the MAAS team help to answer Ryan's question?

Revision history for this message
Alberto Donato (ack) wrote :

Ryan's link is the correct one.

You should be able to either add to the default /etc/maas/preseeds/curtin or add a custom one, following the naming convention described in that doc.

Revision history for this message
Rod Smith (rodsmith) wrote :

I've tried adding Ryan Harper's suggested workaround[1] to the MAAS curtin preseed file (/etc/maas/preseeds/curtin), to no effect; the node (I was using drapion for testing) still deploys with the system set to boot from hard disk (via the "ubuntu" entry in the EFI boot order). I also tried putting those lines in /etc/maas/preseeds/curtin_userdata, which we modify in our certification MAAS server for other purposes. (We do not create a "grub" entry by default, so there's no conflict over this detail.)

Please review my comments in posts #29 and #30, above. The affected servers don't produce a BootCurrent variable under some circumstances, which seems to be throwing off the algorithms that curtin uses. This is almost certainly a bug in some firmware implementations, but getting multiple vendors to fix their firmware is a nightmare, so if a workaround in curtin or MAAS is possible, such a workaround should be pursued.

grub:
  reorder_uefi: false

Revision history for this message
Ryan Harper (raharper) wrote :

@Rod

do you have the curtin install logs from your reorder_uefi: false run?

> Please review my comments in posts #29 and #30, above. The affected servers don't produce a BootCurrent variable under some circumstances, which seems to be throwing off the algorithms that curtin uses.

The curtin algorithm is:

if grubcfg.get('reorder_uefi', True):
   if efibootmgr has BootCurrent set:
       if BootCurrent in BootOrder:
            remove BootCurrent from BootOrder
       BootOrder = [BootCurrent] + BootOrder
       efibootmgr -o BootOrder

1) If BootCurrent is not present, we don't do any reordering in curtin
2) If You disable reordering (like you did) we don't do any reordering

I think I suggested the disable redordering as I was confused about what the
expected result was, let's refresh (mine) and clarify what is expected.

There was a time when MAAS *always* PXE booted (even after an install) and that the PXE config sent to the node would instead use a BootNext directive to hand off from PXE to the on-disk loader/grub.

Is that still the case?

It was with this MAAS mode that the uefi_reorder_loaders() was written (ie no matter how you booted) curtin would place the currently booted item to the head of the list, (PXE booted systems stay PXE booted, non-pxeboot will stay non-pxeboot.)

It may be the case that via BMC or other firmware modes that a PXE boot does not affect the BootOrder and *also* does not display BootCurrent; in which case, after an install curtin will not have modified *any* BootOrder values, the node of course will boot in the order it previously did.

In the case that the Firmware is reordering things under us; it's not clear how curtin can know which item (other than currently booted one) at the front. If BootCurrent is not available, how can curtin know what item to place at the beginning of the BootOrder list?

Revision history for this message
Rod Smith (rodsmith) wrote :

@Ryan,

Is it the /var/log/maas/rsyslog/{nodename}/{date}/messages file you want? If so, I've appended it. Note that it includes several deployments from today. The final one is definitely with the GRUB config lines in /etc/maas/preseeds/curtin. If it's another file, can you please tell me where to find it?

I don't know offhand precisely how MAAS's GRUB handles the hand-off to the local GRUB, I'm afraid. I know it's been changed over the years, though.

From what state does the algorithm you posted begin? As I understand it (I may be wrong), installing the GRUB package adds the "ubuntu" boot item to the boot list, making it the first in the BootOrder. Thus, if the algorithm you specified runs after that, then the effect of a missing BootCurrent variable would be that the GRUB-modified BootOrder that boots from disk would not be changed. Several possible fixes/workarounds occur to me, one of which would require changes to GRUB, or at least its packaging

- Change the algorithm to ensure that the BootOrder item corresponding
  to "ubuntu" is NOT first -- just blindly demote it by one if it's
  first in the list, on the assumption that the system booted via
  the first BootOrder item. This could be done conditional on
  BootCurrent not existing.
- Try to extract a PXE-boot option from the Boot#### list and push
  it to the top. It's likely to be tricky to identify this item,
  since it's not named in a fully standardized way, in my experience;
  and some systems have multiple PXE-boot entries, so booting from
  the wrong one would be inappropriate.
- Combining the two above, the "ubuntu" entry could be pushed below
  all identifiable PXE-boot entries.
- Change GRUB packaging so that it can be told to add the GRUB entry
  to the second position rather than the first one. (OTOH, I don't
  think that efibootmgr permits this, so this may not be workable.)

Overall, the first of those seems to be the most sensible one, but I admit even it has problems, since the system might have failed over and booted from something other than the first of the BootOrder items.

Revision history for this message
Ryan Harper (raharper) wrote :
Download full text (4.1 KiB)

> @Ryan,
>
> Is it the /var/log/maas/rsyslog/{nodename}/{date}/messages file you want? If
> so, I've appended it. Note that it includes several deployments from today.
> The final one is definitely with the GRUB config lines in
> /etc/maas/preseeds/curtin. If it's another file, can you please tell me
> where to find it?

Thanks, that should be good enough. Generally, fetching the logs via the
cli like so:

https://discourse.maas.io/t/getting-curtin-debug-logs/169

>
> I don't know offhand precisely how MAAS's GRUB handles the hand-off to the
> local GRUB, I'm afraid. I know it's been changed over the years, though.

OK. Me neither; but I suspect we can work on that after sorting out how
to keep PXE boot enabled after a curtin install when BootCurrent is missing.

>
> From what state does the algorithm you posted begin? As I understand it (I

The algorithm runs after the following:

- Node is powered on and PXE boots
- Ubuntu boots over networking and starts the curtin install
- Curtin completes early, partitioning, network, extract phases; the
  requested OS has been installed to the target devices.
- Curtin starts the curthooks phase, which is used to configure the target
  OS
- The last step in curthooks calls setup_grub() method which finds the
  target block devices, chroots into the target and installs grub
- If grub install is successful, then curtin will then run the logic
  specified.

> may be wrong), installing the GRUB package adds the "ubuntu" boot
> item to the boot list, making it the first in the BootOrder. Thus, if the

Interesting. I know we create a new entry via the grub-install command, we
specify the bootloader-id value. I was unaware as to the where in the order
it might place the new entry.

> algorithm you specified runs after that, then the effect of a missing
> BootCurrent variable would be that the GRUB-modified BootOrder that boots
> from disk would not be changed. Several possible fixes/workarounds occur to
> me, one of which would require changes to GRUB, or at least its packaging

Yes; It definitely runs after we've run grub-install and added a new entry.

>
> - Change the algorithm to ensure that the BootOrder item corresponding
> to "ubuntu" is NOT first -- just blindly demote it by one if it's
> first in the list, on the assumption that the system booted via
> the first BootOrder item. This could be done conditional on
> BootCurrent not existing.
> - Try to extract a PXE-boot option from the Boot#### list and push
> it to the top. It's likely to be tricky to identify this item,
> since it's not named in a fully standardized way, in my experience;
> and some systems have multiple PXE-boot entries, so booting from
> the wrong one would be inappropriate.
> - Combining the two above, the "ubuntu" entry could be pushed below
> all identifiable PXE-boot entries.

I think understanding how MAAS pxe config works will matter here.
I can work up a curtin change which moves it past any PXE entries;
Thats worth a test at least.

The other thought I had was if we could find out which entry we've booted
from when BootCurrent is not set ... do you have any info on what we could
probe or look for?

> - ...

Read more...

Revision history for this message
Rod Smith (rodsmith) wrote :

AFAIK, BootCurrent is the only way to know how the system booted. OTOH, I suppose that, in theory, you could have MAAS's GRUB write a file somewhere that indicates a MAAS/PXE boot. You could then look for that file (checking its time stamp to be sure it was recent) -- but if BootCurrent is missing for GRUB in the EFI environment as well as for the kernel, that wouldn't help you figure out which of the Boot#### entries was used.

Revision history for this message
Ryan Harper (raharper) wrote :

I've got a branch of curtin if you can test:

https://launchpad.net/~raharper/+archive/ubuntu/lp1789650

Waiting for binaries to publish and I'll copy over to Focal, Bionic and Xenial for testing.

Ryan Harper (raharper)
Changed in curtin:
assignee: nobody → Ryan Harper (raharper)
status: Incomplete → In Progress
Revision history for this message
Rod Smith (rodsmith) wrote :

Ryan,

I've tried your curtin update. I installed it on our MAAS server; I assume that's all that's needed. If not, please tell me what more I need to do.

Unfortunately, it didn't completely fix the problem. If I deployed drapion when it had no existing "ubuntu" entry, it worked; the newly-deployed system came up configured to PXE-boot. When I re-deployed the system, though, it comes up configured to boot from the "ubuntu" disk entry. I tried this several times and the results were reproducible. Thus, in a regular production environment, an affected system could be deployed once (fresh out of the box), then successfully deployed a second time; but subsequent deployments would fail unless the "ubuntu" entry was erased or some other manual intervention taken.

Revision history for this message
Ryan Harper (raharper) wrote :

Thanks Rod,

Install is all that is needed.

Can you capture the logs of the successful install and the unsuccessful second install?

We'll need to work on the fallback algorithm; To improve it, capture the logs which will show me the bootorder/menu before and after install will help us make something more robust.

Revision history for this message
Rod Smith (rodsmith) wrote :

Ryan, I'm attaching what I can, which is the output from the "maas {profile} machine get-curtin-config {system_id}" command and the log file I provided earlier. The "maas {profile} node-results read system_id={system_id} name=/tmp/curtin-logs.tar | jq -r .[0].data | base64 -d > curtin-logs.tar" command from https://discourse.maas.io/t/getting-curtin-debug-logs/169 did not work; that produced 3-byte files. (Yes, I changed {profile} and {system_id} appropriately.) I suspect the documentation needs to be updated; or perhaps it's TOO up-to-date (we're running MAAS 2.6.2).

In the drapion-messages file, check the final two deployments. The first of those two completed successfully, with the system set to PXE-boot at the end; but the second resulted in the server set to boot from the "ubuntu" disk item.

Revision history for this message
Rod Smith (rodsmith) wrote :
Revision history for this message
Rod Smith (rodsmith) wrote :
Revision history for this message
Ryan Harper (raharper) wrote :

Rod,

Thanks for the logs. I've adding some additional logging; as I really want to see a BootOrder before the install and after to compare the lists properly.

I've also updated the logic. If we don;t have a BootCurrent, then comparing the BootOrder from before the grub install to after and if the new list is the same size (or less); we'll use the order prior to the grub-install.

If the new boot order list has more entries, then we'll place all new entries after the original order.

Let's see how this goes.

curtin_20.1-245-gb71fe868

is in the PPA; will take some time to publish and then I'll copy over to Focal/Bionic/Xenial

Jeff Lane  (bladernr)
tags: added: hwcert-server
Revision history for this message
Rod Smith (rodsmith) wrote :

I've run a new series of tests with the latest version. Here are the results:

- When installing with no ubuntu entry, a new ubuntu entry
  is created at the END of the boot list.
- When an existing ubuntu entry is first in the boot list,
  the ubuntu entry remains (or is re-created in) first
  place.
- When an existing ubuntu entry is second in the boot list
  (after the PXE-boot entry), ubuntu remains second in
  the boot list.
- When an existing ubuntu entry is last in the boot list,
  ubuntu remains last in the boot list.

To test the second condition, I wiped the partition table, thus causing a failure of the existing ubuntu entry; the firmware failed over to PXE-booting.

I'm attaching the /var/log/maas/rsyslog/oil-drapion/2020-07-27/messages file, which should hold logs of all these deployments. (Note that I tested most conditions twice, once with and once without wiping the disk partition table with "sudo sgdisk -Z /dev/sda". I saw no effect from this practice, except that installing with the ubuntu entry first without wiping the partition table is impossible -- or it would require some other workaround, like doing a one-time manual PXE boot, which I did not test.)

Revision history for this message
Ryan Harper (raharper) wrote :

Rod,

Thanks for testing.

Looking at your results (I'll label the in-order 1 through 4). It appears to me that
1, 3, 4 were successful in that installed entry was not placed at the front of the list.

Is that correct?

For (2) where Ubuntu is already first in the list and then the install just updates this entry (not creating or removing) then curtin doesn't appear to reorder things in a way that keeps the system booting from PXE

Is that correct?

For (2), If we're in fallback mode and we don't detect a new entry (an existing Ubuntu (or Centos)) is already present; instead of using the previous order here, we could do any number of things: swap first and second options, push first to tend of the list. Looking at your system in scenario (2), just moving the Ubuntu entry somewhere other than the first entry would work. In general, I'm not sure this will always be correct; that is the next item in the boot order may not be PXE.

It's still better than nothing in the fallback scenario; I'll update the branch to detect this scenario and swap the installed slot back one entry.

> I saw no effect from this practice, except that installing with the ubuntu entry first without wiping the partition table is impossible -- or it would require some other workaround, like doing a one-time manual PXE boot, which I did not test.)

I don't quite follow the "installing with the ubunt entry first without wiping the partition table is impossible"; It sounds like you wanted to preserve the existing partition structure and just unpack the target OS on top of the disk? Curtin *can* support that but MAAS does not enable this functionality; It seems unrelated to the fix. Please let me know if you think something isn't working as expected.

Revision history for this message
Ryan Harper (raharper) wrote :

I've pushed curtin - 20.1-246-g738de054-0ubuntu1 to the PPA, I'll copy the groovy build into the other releases when it's been published.

I've pushed the two changes we're testing into my WIP branch here:

https://code.launchpad.net/~raharper/curtin/+git/curtin/+merge/387981

Revision history for this message
Rod Smith (rodsmith) wrote :

Ryan,

Yes, #1, #3, and #4 were successful from the perspective of forcing a PXE boot; however, putting the ubuntu entry at the end of the boot list likely breaks the reason for creating the ubuntu entry at all (namely, enabling the node to boot if the MAAS server is down), since chances are an entry to boot to the firmware setup utility, EFI shell, or something else that would prevent an automated boot when the ubuntu entry is last in the list.

Given lack of a BootCurrent variable, I think that pushing the ubuntu entry to the second position is the easiest compromise. A more complex approach would be to move that entry beyond at least the first PXE-boot item in the Boot#### entries. This would be trickier to program and would require identifying PXE-boot items, which might not be 100% reliable. FWIW, there's code to identify PXE-boot items in the efi-pxeboot script in Checkbox. (See https://code.launchpad.net/~checkbox-dev/plainbox-provider-checkbox/+git/plainbox-provider-checkbox/+ref/master; and specifically, lines 99-100 of https://git.launchpad.net/plainbox-provider-checkbox/tree/bin/efi-pxeboot.py. The code looks for the strings "Network", "PXE", "NIC", "Ethernet", "IP4", or "IP6" in the description field. So far and AFAIK, those strings have correctly identified every network-boot option we've encountered in certification, but there's no guarantee that the next server released will use something else.) Even if you did this, moving "ubuntu" after the first PXE-boot entry might not work, because the system might boot from a later one; and moving it after the last one might not work, because there might be some intervening non-functional entry (boot to firmware setup, etc.). Maybe keep moving the ubuntu entry down until you hit a non-PXE entry (or the end of the list)? I don't think there's a perfect solution without BootCurrent -- but moving it to the second entry, or beyond the first network-boot option if you think it's worth writing the extra code, would be better than what we've got now.

> I don't quite follow the "installing with the ubunt entry first without wiping the partition table is impossible"

If the ubuntu entry is first in the BootOrder list, and if you try to redeploy the server, then that will fail, since the server will boot the existing Ubuntu installation; that's why this is a problem. Wiping the partition table is my go-to easy way around this, since when that's done and the server is rebooted, the ubuntu entry will fail and the server will try to boot what's next (PXE-boot, if it was correctly configured for MAAS).

> It sounds like you wanted to preserve the existing partition structure and just unpack the target OS on top of the disk?

No, it's just that this is the way things end up being configured with the current state of affairs. I wanted to feed your test code that condition for completeness, but of course that would work only if the partition table had been wiped (or some other workaround done) for the initial installation boot. Once a fix or workaround is in place and existing nodes redeployed, this condition won't often be encountered.

Revision history for this message
Ryan Harper (raharper) wrote :
Download full text (3.4 KiB)

> Yes, #1, #3, and #4 were successful from the perspective of forcing a PXE boot; however, putting the ubuntu entry at the end of the boot list likely breaks the reason for creating the ubuntu entry at all (namely, enabling the node to boot if the MAAS server is down), since chances are an entry to boot to the firmware setup utility, EFI shell, or something else that would prevent an automated boot when the ubuntu entry is last in the list.

MAAS being down fallback is a good point; I'd not thought of that.

> Given lack of a BootCurrent variable, I think that pushing the ubuntu entry to the second position is the easiest compromise.

Yes

> A more complex approach would be to move that entry beyond at least the first PXE-boot item in the Boot#### entries. This would be trickier to program and would require identifying PXE-boot items, which might not be 100% reliable. FWIW, there's code to identify PXE-boot items in the efi-pxeboot script in Checkbox. (See https://code.launchpad.net/~checkbox-dev/plainbox-provider-checkbox/+git/plainbox-provider-checkbox/+ref/master; and specifically, lines 99-100 of https://git.launchpad.net/plainbox-provider-checkbox/tree/bin/efi-pxeboot.py. The code looks for the strings "Network", "PXE", "NIC", "Ethernet", "IP4", or "IP6" in the description field. So far and AFAIK, those strings have correctly identified every network-boot option we've encountered in certification, but there's no guarantee that the next server released will use something else.)

That seems reasonable enough; again this is down a path that doesn't
currently boot; so we can only make it better.

> Even if you did this, moving "ubuntu" after the first PXE-boot entry might not work, because the system might boot from a later one; and moving it after the last one might not work, because there might be some intervening non-functional entry (boot to firmware setup, etc.). Maybe keep moving the ubuntu entry down until you hit a non-PXE entry (or the end of the list)? I don't think there's a perfect solution without BootCurrent -- but moving it to the second entry, or beyond the first network-boot option if you think it's worth writing the extra code, would be better than what we've got now.

Right; I'm happy to iterate on this until we're working reliably on
the machines you've got that demonstrate the failure.

As for the potential boot failure of an Network/PXE entry before
getting to Ubuntu (in the no MAAS scanerio); I don't think there's
much curtin can do about that since we've no way of knowing which
of those entries work and do not.

I think it's reasonable to move Ubuntu after the first network
entry; the logic being that any Network entry past the first once
is not likely to be the one used to PXE boot a machine as waiting
for more than one PXE failure before successfully booting the next
is likley a misconfiguration rather than a designed fallback
scenario.

Not sure if it's worth it, but we could configure whether
curtin pre-pends the first PXE/Net entry, or all entries.
Then different curtin config could be tweaked on a per-machine
bases.

So, in summary I think we have this:
If no BootCurrent, curtin will reorder the menu like thi...

Read more...

Revision history for this message
Rod Smith (rodsmith) wrote :

Yes, the procedure you've outlined seems sensible.

That said and FWIW, I've seen servers that boot reliably via the second (or later) PXE-boot entry, but not via the first. This might happen because the first one's network device is unplugged, for instance. As you say, though, that's arguably a misconfiguration. It's not even a possible issue with drapion, which I'm using for my testing in this bug, either; drapion has a single PXE-boot entry. (IIRC, there's a separate firmware setting to control the order of network devices to try when PXE-booting, but from efibootmgr's point of view, there's just one PXE-boot device. Other computers create multiple PXE-boot entries, though.)

Revision history for this message
Greg McNutt (gcmcnutt) wrote :

Not sure about where or how far to push ubuntu down the list, especially when a system has 2 or more potential bootable network devices.

I have run into the same thing with MaaS setup.
- 2.8.1 MaaS
- Supermicro in UEFI mode
- two networks connected (in a LAG)

The problem is that after a deploy, the first network is brought ahead of 'ubuntu' so the boot order is UEFI0,ubuntu,UEFI1,UEFI2,UEFI3

And since MaaS can set either 0 or 1 to be PXE boot, on the systems where network 1 is set for PXE then no chance to boot from network -- the local storage is ahead.

Perhaps our workaround is to suppress the ubuntu(disk) entry completely?

Revision history for this message
Ryan Harper (raharper) wrote :

On Tue, Aug 4, 2020 at 8:50 PM Greg McNutt <email address hidden>
wrote:

> Not sure about where or how far to push ubuntu down the list, especially
> when a system has 2 or more potential bootable network devices.
>
> I have run into the same thing with MaaS setup.
> - 2.8.1 MaaS
> - Supermicro in UEFI mode
> - two networks connected (in a LAG)
>
> The problem is that after a deploy, the first network is brought ahead
> of 'ubuntu' so the boot order is UEFI0,ubuntu,UEFI1,UEFI2,UEFI3
>
> And since MaaS can set either 0 or 1 to be PXE boot, on the systems
> where network 1 is set for PXE then no chance to boot from network --
> the local storage is ahead.
>
> Perhaps our workaround is to suppress the ubuntu(disk) entry completely?
>

I think the most robust solution is to somehow confirm which entry MAAS
uses to PXE boot, and supply this entry (BootNumber and the path) to
curtin; then we can ensure that the same entry is first, and the target
distro
entry is second.

That's going to require a bit more work on the MAAS side.

In the meantime, we'll do a reasonable effort to try to make systems lacking
a BootCurrent entry to be bootable for MAAS.

> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1789650
>
> Title:
> Servers set to boot from disk after MAAS installation
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1789650/+subscriptions
>

Revision history for this message
Ryan Harper (raharper) wrote :

Rod,

Sorry for the delay; I've updated the PPA with an updated curtin which prepends the network boot entries, then the installed target and then the remaining; I believe this should handle the known cases we have.

Shortly I'll copy over the debs into the other releases once the groovy build is published.

Revision history for this message
Rod Smith (rodsmith) wrote :

Ryan,

This one looks good. All my test deployments resulted in the "PXE Network" setting being #1 in the boot order, followed by "ubuntu" at #2.

Revision history for this message
Ryan Harper (raharper) wrote :

On Wed, Aug 19, 2020 at 12:50 PM Rod Smith <email address hidden>
wrote:

> Ryan,
>
> This one looks good. All my test deployments resulted in the "PXE
> Network" setting being #1 in the boot order, followed by "ubuntu" at #2.
>

\o/

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

This bug is fixed with commit 7a48737e to curtin on branch master.
To view that commit see the following URL:
https://git.launchpad.net/curtin/commit/?id=7a48737e

Changed in curtin:
status: In Progress → Fix Committed
Revision history for this message
David van der Spek (vanderspek-david) wrote :

The bug I just issued (bug #1894217) seems very related to this. The deployment is failing due to an error being caused when trying to set the EFI boot order.

Revision history for this message
Ryan Harper (raharper) wrote :

Thanks for the pointer David. I've commented there; that bug is different, but related to curtin's manipulation of efi entries.

Revision history for this message
Alberto Donato (ack) wrote :

MAAS' PPA ppa:maas/2.8 has an updated curtin with the fix.

Changed in maas:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.