Servers set to boot from disk after MAAS installation

Bug #1789650 reported by Rod Smith on 2018-08-29
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
curtin
High
Unassigned

Bug Description

This bug appears to be a regression of bug #1311827 and/or bug #1642298, but it seems to affect only some servers. Upon deployment via MAAS, the server is set to boot from the "ubuntu" EFI boot order entry -- that is, it boots directly from disk, bypassing the PXE-boot option. Immediately after deployment, the boot order looks like this:

$ sudo efibootmgr
BootCurrent: 0000
BootOrder: 0000,0002,0001,0003,0004,0005,0006,0007,0008
Boot0000* ubuntu
Boot0001* Hard Disk 0
Boot0002* PXE Network
Boot0003 Enter Setup
Boot0004 Boot Devices
Boot0005 Boot Manager
Boot0006 Setup
Boot0007 Diagnostics
Boot0008 Firmware Log

This bug is affecting two Lenovo servers (drapion and jolteon), when deployed with either Ubuntu 18.04 or 16.04. These servers seemed OK in tests in late April. Other servers we've recently tested are NOT affected. Changes to the boot order after installation do "stick," and it's possible to entirely delete the "ubuntu" entry without it being re-created, so I doubt if the firmware is creating the entry or making it the default; it appears that the bug is somewhere in Ubuntu. I'm reporting this against curtin, but it could be GRUB or something else is the true cause.

On Wed, Aug 29, 2018 at 9:01 AM Rod Smith <email address hidden> wrote:
>
> Public bug reported:
>
> This bug appears to be a regression of bug #1311827 and/or bug #1642298,
> but it seems to affect only some servers. Upon deployment via MAAS, the
> server is set to boot from the "ubuntu" EFI boot order entry -- that is,
> it boots directly from disk, bypassing the PXE-boot option. Immediately
> after deployment, the boot order looks like this:
>
> $ sudo efibootmgr
> BootCurrent: 0000
> BootOrder: 0000,0002,0001,0003,0004,0005,0006,0007,0008
> Boot0000* ubuntu
> Boot0001* Hard Disk 0
> Boot0002* PXE Network
> Boot0003 Enter Setup
> Boot0004 Boot Devices
> Boot0005 Boot Manager
> Boot0006 Setup
> Boot0007 Diagnostics
> Boot0008 Firmware Log
>
> This bug is affecting two Lenovo servers (drapion and jolteon), when
> deployed with either Ubuntu 18.04 or 16.04. These servers seemed OK in
> tests in late April. Other servers we've recently tested are NOT
> affected. Changes to the boot order after installation do "stick," and
> it's possible to entirely delete the "ubuntu" entry without it being re-
> created, so I doubt if the firmware is creating the entry or making it
> the default; it appears that the bug is somewhere in Ubuntu. I'm
> reporting this against curtin, but it could be GRUB or something else is
> the true cause.

Can you provide the curtin install.log and config from the
installation of the affected servers?

Rod Smith (rodsmith) wrote :

I assume you mean the /root/curtin-install.log file, which I'm attaching. If you mean another file, please clarify.

On Wed, Aug 29, 2018 at 9:40 AM Rod Smith <email address hidden> wrote:
>
> I assume you mean the /root/curtin-install.log file, which I'm
> attaching. If you mean another file, please clarify.

That's the file; however debugging output is not enabled. Could you
enable that and get the install log?

maas <user> maas set-config name=curtin_verbose value=True

Before the install if you could attach efibootmgr -v output with the
loader paths
I'll take that and look at the code which reorders boot entries and see if
we can sort out if it's a curtin bug.

That code for resorting hasn't changed since 2017-05-11

https://git.launchpad.net/curtin/tree/curtin/commands/curthooks.py#n219

>
> ** Attachment added: "/root/curtin-install.log file from one affected server (jolteon)"
> https://bugs.launchpad.net/curtin/+bug/1789650/+attachment/5182207/+files/curtin-install.log
>
> --
> You received this bug notification because you are subscribed to curtin.
> Matching subscriptions: curtin-bugs-all
> https://bugs.launchpad.net/bugs/1789650
>
> Title:
> Servers set to boot from disk after MAAS installation
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1789650/+subscriptions

Rod Smith (rodsmith) wrote :

I'm attaching a tarball with both the curtin-install-cfg.yaml and curtin-install.log files after a fresh installation, with the debugging feature you wanted activated.

Here's the output of "sudo efibootmgr -v" before the installation:

$ sudo efibootmgr -v
BootCurrent: 0002
BootOrder: 0002,0001,0003,0004,0005,0006,0007,0008
Boot0001* Hard Disk 0 FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0002* PXE Network FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0003 Enter Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)
Boot0004 Boot Devices FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS....h
Boot0005 Boot Manager FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0006 Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0007 Diagnostics FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0008 Firmware Log FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....

Note that installing requires either changing the boot order or deleting the "ubuntu" entry. I opted for the latter in this case. I can re-order the boot entries and try again, if you prefer. (I've done both in my testing, and both have the same result.)

Here's the same command's output immediately after the installation:

$ sudo efibootmgr -v
BootCurrent: 0000
BootOrder: 0000,0002,0001,0003,0004,0005,0006,0007,0008
Boot0000* ubuntu HD(1,GPT,7fb492e7-7aac-4444-896f-13532e2de1f8,0x800,0x100000)/File(\EFI\ubuntu\shimx64.efi)
Boot0001* Hard Disk 0 FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0002* PXE Network FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0003 Enter Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)
Boot0004 Boot Devices FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS....h
Boot0005 Boot Manager FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0006 Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0007 Diagnostics FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0008 Firmware Log FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....

Rod Smith (rodsmith) wrote :

Ryan, have you looked at the debugging output? Do you have any clues about what's going on?

Ryan Harper (raharper) wrote :

On Fri, Sep 7, 2018 at 9:30 PM Rod Smith <email address hidden> wrote:
>
> Ryan, have you looked at the debugging output? Do you have any clues
> about what's going on?

Looking at the output of before and after, I believe this is expected behavior
due to changes related to handling the case where if MAAS is offline then all
nodes couldn't boot due to defaulting to PXE booting from a busted MAAS
service.

Here's the relevant commit and details w.r.t the behavior change.

commit cb1ef09beddb6c4559c131d2606d9b6b70c4ca7f
Merge: bae772c 5cffafa
Author: Blake Rouse <email address hidden>
Date: Fri May 26 13:27:22 2017 -0500

    Clear and re-order UEFI boot methods during UEFI grub installation.

    Previously when installing Ubuntu using curtin it was default to pass
    '--no-nvram' to the grub-install. This branch reverts that has passing
    '--no-nvram' will prevent the system from booting if MAAS is down
because Ubuntu
    not be a loader in the EFI system for the system to fallback on.
When updating
    the nvram grub places Ubuntu before the currently booted method,
which prevents
    the ability for the machine to boot from the network anymore. This branch
    reorders to boot order of the EFI system to place the currently
booted method
    before all others, but Ubuntu is still second in the list so if
MAAS is down the
    machine will still boot from its local disk correctly.

    Another issue is that older EFI loaders would be present in the EFI firmware
    even through curtin deletes and re-creates the entire EFI partition. This
    removes only those loaders before grub-install is performed to
make sure that
    only the relevant loads for the current state of the system are present.

    Fixes: LP:#1680917, LP:#1686669

    LP: #1680917, #1686669
    bzr-revno: 503

If you want to always keep PXE as the boot option after install then I think
you'll need MAAS to send:

grub:
  update_nvram: false

The default value if nothing is sent is True which is why curtin is
doing reordering.

Setting this value to false will prevent curtin from making any boot
order changes at all.

>
> --
> You received this bug notification because you are subscribed to curtin.
> Matching subscriptions: curtin-bugs-all
> https://bugs.launchpad.net/bugs/1789650
>
> Title:
> Servers set to boot from disk after MAAS installation
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1789650/+subscriptions

Changed in curtin:
importance: Undecided → High
status: New → Incomplete
Rod Smith (rodsmith) wrote :

After MAAS deploys a node, the node *SHOULD* be set to PXE-boot. This is, in fact, what happens on most computers (or does recently; the behavior of booting from disk was a bug several months ago). In recent tests (for 16.04.5 regression testing), I've seen this misbehavior of having the disk-boot option first in the boot list only on two Lenovo computers, so it looks like there's some new interaction going on.

Ryan Harper (raharper) wrote :

Could you collect the same info (efibootmgr -v output before/after and the curtin-* from /root) from machines that work as you expect for further comparison?

Andres Rodriguez (andreserl) wrote :

Actually, this would seem like a regression in curtin. As per the comment #6 from Blake's branch:

" This branch reorders to boot order of the EFI system to place the currently booted method before all others, but Ubuntu is still second in the list so if MAAS is down the machine will still boot from its local disk correctly."

From the description & changes in [3] (which is a fix for [1]), what would have happened is:

a. The --no-nvram flag is no longer passed with the grub command. This causes the nvram to be updated (as expected).
b. After this, curtin (with update_nvram: True) would re-organize the boot order to ensure that PXE is first ("the current boot method"), and then set the disk (or the entry that grub created) as the second in the boot order as fallback to boot, in case the machine cannot boot of PXE.
c. The update_nvram defaults to True, which would do (a) and (b) above. You can see line 100 of the diff in [3], where update_nvram: True calls uefi_reorder_loaders which (as per the docstring) "Reorders the UEFI BootOrder to place BootCurrent first.". Since BootCurrent should've been PXE (because the machine PXE'd off the network), this would be set first.

NOTE that these changes were coupled with [2], which tells grub package itself to not update nvram on subsequente grub updates (e.g. when doing a sudo apt-get dist-upgrade), but I think this is irrelevant for this issue.

So based on the above and [3], this would indeed seem like a regression in curtin if when update_nvram is True, it is not trying to re-organize the bootorder to set PXE first and the boot order that grub created (since we pass it without --no-nvram) second.

That said, I looked at [4] and it doesn't seem that this has actually changed in curtin (see lines between 387 and 411), unless curtin is now always defaulting to update_nvram: false, instead of true. Just looking at [4] this seems to me that if 'update_nvram: True' is still doing what is expected:

1. To run with --update-nvram (update_nvram: True)
2. To re-order the bootorder to ensure PXE is first (update_nvram: True)

Lastly, I wonder if this is actual an issue with the firmware? @Rod, when was this firmware upgraded last? What's the 'BIOS Boot Type' set for the power params for this machine? (auto, legacy, or efi?). I think what's requested on #8 would potentially highlight the differences.

[1]: https://bugs.launchpad.net/maas/+bug/1680917
[2]: https://bugs.launchpad.net/maas/+bug/1642298
[3]: https://code.launchpad.net/~blake-rouse/curtin/uefi-clear-reorder/+merge/323875
[4]: https://git.launchpad.net/curtin/tree/curtin/commands/curthooks.py?id=f5ea2d4d4d714d2cd93c4435fad298860df4d711

Andres Rodriguez (andreserl) wrote :

@Rod, so to clarify I would like to know:

1. how's the boot order configured in the BIOS of the machine? is it configured to boot from disk first?
2. In the power params of MAAS, what's the BIOS Boot type option set to? (auto, efi, legacy?).

Andres Rodriguez (andreserl) wrote :

and when was the last time the firmware of this system updated?

Ryan Harper (raharper) wrote :

No change to curtin here, it will use update-nvram, remove old loaders and reorder.

Looking at the output of the failed system I can see:

after grub-install efiboot settings
+ efibootmgr
BootOrder: 0000,0002,0001,0003,0004,0005,0006,0007,0008
Boot0000* ubuntu
Boot0001* Hard Disk 0
Boot0002* PXE Network
Boot0003 Enter Setup
Boot0004 Boot Devices
Boot0005 Boot Manager
Boot0006 Setup
Boot0007 Diagnostics
Boot0008 Firmware Log

Here we've added the ubuntu entry at 0000; and we know we booted via 0002 (PXE)
since this is maas. What we would expect to see is curtin do a reorder and here
it is in the log:

Running command ['unshare', '--fork', '--pid', '--', 'chroot', '/tmp/tmpggk7qppj/target', 'efibootmgr', '-v'] with allowed return codes [0] (capture=True)

Now, the reorder code logs if it is *not* reordering, and we do not see that in the output, so it must have attempted to reorder (as expected).

In the reorder code block we have:

    efi_output = util.get_efibootmgr(target)
    currently_booted = efi_output.get('current', None)
    boot_order = efi_output.get('order', [])
    if currently_booted:
        if currently_booted in boot_order:
            boot_order.remove(currently_booted)
        boot_order = [currently_booted] + boot_order
        new_boot_order = ','.join(boot_order)
        LOG.debug(
            "Setting currently booted %s as the first "
            "UEFI loader.", currently_booted)
        LOG.debug(
            "New UEFI boot order: %s", new_boot_order)
        with util.ChrootableTarget(target) as in_chroot:
            in_chroot.subp(['efibootmgr', '-o', new_boot_order])

The only path out of here that doesn't log is if the efi_output['current'] is not set.
But since we don't see additional calls to efibootmgr with -o to set the order then
it must have returned output that didn't have 'BootCurrent' set.

The efibootmgr -v output from Rod certainly shows BootCurrent, so this remains somewhat mysterious as to why we attempted to reorder but didn't find a BootCurrent value in the
efibootmgr -v output right after a grub install.

Rod Smith (rodsmith) wrote :
Download full text (5.7 KiB)

Ryan and Andres, here are some things you've requested....

The attachment is the /root/curtin-install* files from a server that works as expected. (That server is jehan, a Quanta D52B-1U, FWIW.) Here are the before and after "efibootmgr -v" outputs from jehan:

ubuntu@jehan:~$ sudo efibootmgr -v
BootCurrent: 0006
Timeout: 5 seconds
BootOrder: 0006,0008,0007,0005,0009,000A,000B,000C,0000,0003
Boot0000* ubuntu HD(1,GPT,489c1060-e2fd-4f57-b2e4-beb55dec5764,0x800,0x100000)/File(\EFI\UBUNTU\SHIMX64.EFI)
Boot0003 UEFI: Built-in EFI Shell VenMedia(5023b95c-db26-429b-a648-bd47664c8012)..BO
Boot0005* UEFI: Slot5 Port0 HTTP IPv4 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(a81e84f296c5,0)/IPv4(0.0.0.00.0.0.0,0,0)/Uri()..BO
Boot0006* UEFI: Slot5 Port0 PXE IPv4 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(a81e84f296c5,0)/IPv4(0.0.0.00.0.0.0,0,0)..BO
Boot0007* UEFI: Slot5 Port1 HTTP IPv4 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(a81e84f296c6,0)/IPv4(0.0.0.00.0.0.0,0,0)/Uri()..BO
Boot0008* UEFI: Slot5 Port1 PXE IPv4 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(a81e84f296c6,0)/IPv4(0.0.0.00.0.0.0,0,0)..BO
Boot0009* UEFI: Slot5 Port0 HTTP IPv6 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(a81e84f296c5,0)/IPv6([::]:<->[::]:,0,0)/Uri()..BO
Boot000A* UEFI: Slot5 Port0 PXE IPv6 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(a81e84f296c5,0)/IPv6([::]:<->[::]:,0,0)..BO
Boot000B* UEFI: Slot5 Port1 HTTP IPv6 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(a81e84f296c6,0)/IPv6([::]:<->[::]:,0,0)/Uri()..BO
Boot000C* UEFI: Slot5 Port1 PXE IPv6 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(a81e84f296c6,0)/IPv6([::]:<->[::]:,0,0)..BO
MirroredPercentageAbove4G: 0.00
MirrorMemoryBelow4GB: false

After re-deploying:

ubuntu@jehan:~$ sudo efibootmgr -v
BootCurrent: 0006
Timeout: 5 seconds
BootOrder: 0006,0008,0007,0005,0009,000A,000B,000C,0000,0003
Boot0000* ubuntu HD(1,GPT,6cdb926d-5f7b-4f83-a88d-d9a65ff43d3b,0x800,0x100000)/File(\EFI\UBUNTU\SHIMX64.EFI)
Boot0003 UEFI: Built-in EFI Shell VenMedia(5023b95c-db26-429b-a648-bd47664c8012)..BO
Boot0005* UEFI: Slot5 Port0 HTTP IPv4 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(a81e84f296c5,0)/IPv4(0.0.0.00.0.0.0,0,0)/Uri()..BO
Boot0006* UEFI: Slot5 Port0 PXE IPv4 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(a81e84f296c5,0)/IPv4(0.0.0.00.0.0.0,0,0)..BO
Boot0007* UEFI: Slot5 Port1 HTTP IPv4 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(a81e84f296c6,0)/IPv4(0.0.0.00.0.0.0,0,0)/Uri()..BO
Boot0008* UEFI: Slot5 Port1 PXE IPv4 Intel(R) 82599 10 Gigabit Dual Port Network Connection PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(a81e84f296c6,0)/IPv4(0.0.0.00.0.0.0,0,0)..BO
B...

Read more...

Rod Smith (rodsmith) wrote :

Oh, I missed one question: The servers are all set to "EFI Boot" under "Power Configuration."

Ryan Harper (raharper) wrote :
Download full text (3.2 KiB)

Rod,

Thanks for the successful run. Looking at the log I can see where
curtin does a reorder of the boot entries and clearly shows a call to
efibootmgr with the -o option.

Running command ['unshare', '--fork', '--pid', '--', 'chroot',
'/tmp/tmpbfj5kvuo/target', 'efibootmgr', '-v'] with allowed return
codes [0] (capture=True)
Running command ['udevadm', 'settle'] with allowed return codes [0]
(capture=False)
Running command ['umount', '/tmp/tmpbfj5kvuo/target/sys'] with allowed
return codes [0] (capture=False)
Running command ['umount', '/tmp/tmpbfj5kvuo/target/proc'] with
allowed return codes [0] (capture=False)
Running command ['umount', '/tmp/tmpbfj5kvuo/target/dev'] with allowed
return codes [0] (capture=False)
Setting currently booted 0006 as the first UEFI loader.
New UEFI boot order: 0006,0000,0008,0007,0005,0009,000A,000B,000C,0003
Running command ['mount', '--bind', '/dev',
'/tmp/tmpbfj5kvuo/target/dev'] with allowed return codes [0]
(capture=False)
Running command ['mount', '--bind', '/proc',
'/tmp/tmpbfj5kvuo/target/proc'] with allowed return codes [0]
(capture=False)
Running command ['mount', '--bind', '/sys',
'/tmp/tmpbfj5kvuo/target/sys'] with allowed return codes [0]
(capture=False)
Running command ['unshare', '--fork', '--pid', '--', 'chroot',
'/tmp/tmpbfj5kvuo/target', 'efibootmgr', '-o',
'0006,0000,0008,0007,0005,0009,000A,000B,000C,0003'] with allowed
return codes [0] (capture=False)
BootCurrent: 0006
Timeout: 5 seconds
BootOrder: 0006,0000,0008,0007,0005,0009,000A,000B,000C,0003
Boot0000* ubuntu
Boot0003 UEFI: Built-in EFI Shell
Boot0005* UEFI: Slot5 Port0 HTTP IPv4 Intel(R) 82599 10 Gigabit Dual
Port Network Connection
Boot0006* UEFI: Slot5 Port0 PXE IPv4 Intel(R) 82599 10 Gigabit Dual
Port Network Connection
Boot0007* UEFI: Slot5 Port1 HTTP IPv4 Intel(R) 82599 10 Gigabit Dual
Port Network Connection
Boot0008* UEFI: Slot5 Port1 PXE IPv4 Intel(R) 82599 10 Gigabit Dual
Port Network Connection
Boot0009* UEFI: Slot5 Port0 HTTP IPv6 Intel(R) 82599 10 Gigabit Dual
Port Network Connection
Boot000A* UEFI: Slot5 Port0 PXE IPv6 Intel(R) 82599 10 Gigabit Dual
Port Network Connection
Boot000B* UEFI: Slot5 Port1 HTTP IPv6 Intel(R) 82599 10 Gigabit Dual
Port Network Connection
Boot000C* UEFI: Slot5 Port1 PXE IPv6 Intel(R) 82599 10 Gigabit Dual
Port Network Connection

So the remaining question now is on the failing system why does the
system *not* show BootCurrent in the output immediately after we
install grub.

That seems to be the core issue. Now, once you've booted into these
systems, efibootmgr -v does show boot current, however, while we're
booted into the ephemeral environment and we chroot into the target
OS, the efibootmgr command doesn't seem to return BootCurrent.

Do you have any insight w.r.t what how efibootmgr determines what
BootCurrent value should be?

On Thu, Sep 13, 2018 at 2:01 PM Rod Smith <email address hidden> wrote:
>
> Oh, I missed one question: The servers are all set to "EFI Boot" under
> "Power Configuration."
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1789650
>
> Title:
> Servers set to boot from disk after MAAS i...

Read more...

Ryan Harper (raharper) wrote :
Download full text (3.7 KiB)

Looks like this comes from reading:

/sys/firmware/efi/efivars/BootCurrent-<UUID>

i'm testing a late_command to dump both efibootmgr -v output and
hexdump'ing the sysfs path
to see if that shows the missing BootCurrent entry.

On Thu, Sep 13, 2018 at 2:16 PM Ryan Harper <email address hidden> wrote:
>
> Rod,
>
> Thanks for the successful run. Looking at the log I can see where
> curtin does a reorder of the boot entries and clearly shows a call to
> efibootmgr with the -o option.
>
> Running command ['unshare', '--fork', '--pid', '--', 'chroot',
> '/tmp/tmpbfj5kvuo/target', 'efibootmgr', '-v'] with allowed return
> codes [0] (capture=True)
> Running command ['udevadm', 'settle'] with allowed return codes [0]
> (capture=False)
> Running command ['umount', '/tmp/tmpbfj5kvuo/target/sys'] with allowed
> return codes [0] (capture=False)
> Running command ['umount', '/tmp/tmpbfj5kvuo/target/proc'] with
> allowed return codes [0] (capture=False)
> Running command ['umount', '/tmp/tmpbfj5kvuo/target/dev'] with allowed
> return codes [0] (capture=False)
> Setting currently booted 0006 as the first UEFI loader.
> New UEFI boot order: 0006,0000,0008,0007,0005,0009,000A,000B,000C,0003
> Running command ['mount', '--bind', '/dev',
> '/tmp/tmpbfj5kvuo/target/dev'] with allowed return codes [0]
> (capture=False)
> Running command ['mount', '--bind', '/proc',
> '/tmp/tmpbfj5kvuo/target/proc'] with allowed return codes [0]
> (capture=False)
> Running command ['mount', '--bind', '/sys',
> '/tmp/tmpbfj5kvuo/target/sys'] with allowed return codes [0]
> (capture=False)
> Running command ['unshare', '--fork', '--pid', '--', 'chroot',
> '/tmp/tmpbfj5kvuo/target', 'efibootmgr', '-o',
> '0006,0000,0008,0007,0005,0009,000A,000B,000C,0003'] with allowed
> return codes [0] (capture=False)
> BootCurrent: 0006
> Timeout: 5 seconds
> BootOrder: 0006,0000,0008,0007,0005,0009,000A,000B,000C,0003
> Boot0000* ubuntu
> Boot0003 UEFI: Built-in EFI Shell
> Boot0005* UEFI: Slot5 Port0 HTTP IPv4 Intel(R) 82599 10 Gigabit Dual
> Port Network Connection
> Boot0006* UEFI: Slot5 Port0 PXE IPv4 Intel(R) 82599 10 Gigabit Dual
> Port Network Connection
> Boot0007* UEFI: Slot5 Port1 HTTP IPv4 Intel(R) 82599 10 Gigabit Dual
> Port Network Connection
> Boot0008* UEFI: Slot5 Port1 PXE IPv4 Intel(R) 82599 10 Gigabit Dual
> Port Network Connection
> Boot0009* UEFI: Slot5 Port0 HTTP IPv6 Intel(R) 82599 10 Gigabit Dual
> Port Network Connection
> Boot000A* UEFI: Slot5 Port0 PXE IPv6 Intel(R) 82599 10 Gigabit Dual
> Port Network Connection
> Boot000B* UEFI: Slot5 Port1 HTTP IPv6 Intel(R) 82599 10 Gigabit Dual
> Port Network Connection
> Boot000C* UEFI: Slot5 Port1 PXE IPv6 Intel(R) 82599 10 Gigabit Dual
> Port Network Connection
>
> So the remaining question now is on the failing system why does the
> system *not* show BootCurrent in the output immediately after we
> install grub.
>
> That seems to be the core issue. Now, once you've booted into these
> systems, efibootmgr -v does show boot current, however, while we're
> booted into the ephemeral environment and we chroot into the target
> OS, the efibootmgr command doesn't seem to return BootCurrent.
>
> Do you have any insight w.r.t w...

Read more...

Rod Smith (rodsmith) wrote :

I've never dug into the efibootmgr source code, so I don't know offhand where it's getting the BootCurrent variable, but /sys/firmware/efi/efivars/BootCurrent-8be4df61-93ca-11d2-aa0d-00e098032b8c is a plausible source. FWIW, on the deployed problem machine, /sys/firmware/efi/efivars/BootCurrent-8be4df61-93ca-11d2-aa0d-00e098032b8c does exist and contains plausible data. If you've got a custom deployment config you want me to run, I can do that; or I can give you access to our MAAS server and the trouble systems.

Ryan Harper (raharper) wrote :

On Thu, Sep 13, 2018 at 3:50 PM Rod Smith <email address hidden> wrote:
>
> I've never dug into the efibootmgr source code, so I don't know offhand
> where it's getting the BootCurrent variable, but
> /sys/firmware/efi/efivars/BootCurrent-8be4df61-93ca-11d2-aa0d-
> 00e098032b8c is a plausible source. FWIW, on the deployed problem
> machine, /sys/firmware/efi/efivars/BootCurrent-8be4df61-93ca-11d2-aa0d-
> 00e098032b8c does exist and contains plausible data. If you've got a
> custom deployment config you want me to run, I can do that; or I can
> give you access to our MAAS server and the trouble systems.

This config should dump efibootmgr -v output and what's in BootCurrent
right after we've completed the install but before we reboot.

_hexdump_bootcurrent:
 - &hexdump |
   ls -al /sys/firmware/efi
   bcurrent=$(ls /sys/firmware/efi/efivars/BootCurrent*)
   [ -e "${bcurrent}" ] && hexdump $bcurrent

late_commands:
  01_bootcurrent: ['curtin', 'in-target', '--', 'efibootmgr', '-v']
  02_hexdump: ['curtin', 'in-target', '--', 'sh', '-c', *hexdump]

This will show up in the node logs output.

Rod Smith (rodsmith) wrote :

I'm afraid the node is failing to deploy with those changes to /etc/maas/preseeds/curtin_userdata (I assume that's where you wanted them):

        Running command ['unshare', '--fork', '--pid', '--', 'chroot', '/tmp/tmpzfw22u7e/target', 'sh', '-c', 'ls -al /sys/firmware/efi\nbcurrent=$(ls /sys/firmware/efi/efivars/BootCurrent*)\n[ -e "${bcurrent}" ] && hexdump $bcurrent\n'] with allowed return codes [0] (capture=False)
        total 0
        drwxr-xr-x 5 root root 0 Sep 13 23:10 .
        drwxr-xr-x 6 root root 0 Sep 13 23:08 ..
        -r--r--r-- 1 root root 4096 Sep 13 23:10 config_table
        dr-xr-xr-x 2 root root 0 Sep 13 23:08 efivars
        -r--r--r-- 1 root root 4096 Sep 13 23:10 fw_platform_size
        -r--r--r-- 1 root root 4096 Sep 13 23:10 fw_vendor
        -r--r--r-- 1 root root 4096 Sep 13 23:10 runtime
        drwxr-xr-x 9 root root 0 Sep 13 23:10 runtime-map
        -r-------- 1 root root 4096 Sep 13 23:09 systab
        drwxr-xr-x 70 root root 0 Sep 13 23:10 vars
        ls: cannot access '/sys/firmware/efi/efivars/BootCurrent*': No such file or directory
        Running command ['udevadm', 'settle'] with allowed return codes [0] (capture=False)
        Running command ['umount', '/tmp/tmpzfw22u7e/target/sys'] with allowed return codes [0] (capture=False)
        Running command ['umount', '/tmp/tmpzfw22u7e/target/proc'] with allowed return codes [0] (capture=False)
        Running command ['umount', '/tmp/tmpzfw22u7e/target/dev'] with allowed return codes [0] (capture=False)
        finish: cmd-install/stage-late/02_hexdump/cmd-in-target: FAIL: curtin command in-target

Ryan Harper (raharper) wrote :

On Thu, Sep 13, 2018 at 6:35 PM Rod Smith <email address hidden> wrote:
>
> I'm afraid the node is failing to deploy with those changes to
> /etc/maas/preseeds/curtin_userdata (I assume that's where you wanted
> them):
>
> Running command ['unshare', '--fork', '--pid', '--', 'chroot', '/tmp/tmpzfw22u7e/target', 'sh', '-c', 'ls -al /sys/firmware/efi\nbcurrent=$(ls /sys/firmware/efi/efivars/BootCurrent*)\n[ -e "${bcurrent}" ] && hexdump $bcurrent\n'] with allowed return codes [0] (capture=False)
> total 0
> drwxr-xr-x 5 root root 0 Sep 13 23:10 .
> drwxr-xr-x 6 root root 0 Sep 13 23:08 ..
> -r--r--r-- 1 root root 4096 Sep 13 23:10 config_table
> dr-xr-xr-x 2 root root 0 Sep 13 23:08 efivars
> -r--r--r-- 1 root root 4096 Sep 13 23:10 fw_platform_size
> -r--r--r-- 1 root root 4096 Sep 13 23:10 fw_vendor
> -r--r--r-- 1 root root 4096 Sep 13 23:10 runtime
> drwxr-xr-x 9 root root 0 Sep 13 23:10 runtime-map
> -r-------- 1 root root 4096 Sep 13 23:09 systab
> drwxr-xr-x 70 root root 0 Sep 13 23:10 vars
> ls: cannot access '/sys/firmware/efi/efivars/BootCurrent*': No such file or directory

Yuck, I was seeing the same thing in my VM, but I was sure it was an
issue with the VM.

I cannot fathom why that sys path is not accessible. Let me look more
into my VM and see what's going on.

Ryan Harper (raharper) wrote :

On Fri, Sep 14, 2018 at 9:51 AM Ryan Harper <email address hidden> wrote:
>
> On Thu, Sep 13, 2018 at 6:35 PM Rod Smith <email address hidden> wrote:
> >
> > I'm afraid the node is failing to deploy with those changes to
> > /etc/maas/preseeds/curtin_userdata (I assume that's where you wanted
> > them):
> >
> > Running command ['unshare', '--fork', '--pid', '--', 'chroot', '/tmp/tmpzfw22u7e/target', 'sh', '-c', 'ls -al /sys/firmware/efi\nbcurrent=$(ls /sys/firmware/efi/efivars/BootCurrent*)\n[ -e "${bcurrent}" ] && hexdump $bcurrent\n'] with allowed return codes [0] (capture=False)
> > total 0
> > drwxr-xr-x 5 root root 0 Sep 13 23:10 .
> > drwxr-xr-x 6 root root 0 Sep 13 23:08 ..
> > -r--r--r-- 1 root root 4096 Sep 13 23:10 config_table
> > dr-xr-xr-x 2 root root 0 Sep 13 23:08 efivars
> > -r--r--r-- 1 root root 4096 Sep 13 23:10 fw_platform_size
> > -r--r--r-- 1 root root 4096 Sep 13 23:10 fw_vendor
> > -r--r--r-- 1 root root 4096 Sep 13 23:10 runtime
> > drwxr-xr-x 9 root root 0 Sep 13 23:10 runtime-map
> > -r-------- 1 root root 4096 Sep 13 23:09 systab
> > drwxr-xr-x 70 root root 0 Sep 13 23:10 vars
> > ls: cannot access '/sys/firmware/efi/efivars/BootCurrent*': No such file or directory
>
> Yuck, I was seeing the same thing in my VM, but I was sure it was an
> issue with the VM.
>
> I cannot fathom why that sys path is not accessible. Let me look more
> into my VM and see what's going on.

Well, it turns out that /sys/firmware/efi/efivars is a *mount* point
which should be
automatically mounted on UEFI systems. Something is fishy here.

If it's not mounted, one can run:

mount -t efivarfs efivarfs /sys/firmware/efi/efivars

Hrm, it looks like there are two vars paths:

/sys/firmware/efi/vars (part of the kernel, not a separate mount)
and
/sys/firmware/efi/efivars (special mount)

It seems that efibootmgr could show different values depending which
path it is taking.

Ryan Harper (raharper) wrote :
Download full text (3.7 KiB)

Here's an updated late_command to deploy with.

_hexdump_bootcurrent:
 - &hexdump |
   grep efi /proc/mounts
   mountpoint /sys/firmware/efi/efivars
   echo "checking /sys/firmware/efi/vars/"
   ls -al /sys/firmware/efi/vars/
   bcurrent=$(ls /sys/firmware/efi/efivars/BootCurrent*/data)
   [ -e "${bcurrent}" ] && hexdump $bcurrent
   echo "efibootmgr output before mounting efivars (uses vars)"
   efibootmgr -v
   echo "mounting efivars"
   mount -o defaults -t efivarfs efivarfs /sys/firmware/efi/efivars
   ls -al /sys/firmware/efi/efivars/
   echo "efibootmgr output after mounting efivars"
   efibootmgr -v
   bcurrent=$(ls /sys/firmware/efi/efivars/BootCurrent*)
   [ -e "${bcurrent}" ] && hexdump $bcurrent
   umount /sys/firmware/efi/efivars

late_commands:
  01_efivars: ['grep', 'efi', '/proc/mounts']
  02_efimnt: ['mountpoint', '/sys/firmware/efi/efivars']
  03_hexdump: ['curtin', 'in-target', '--', 'sh', '-c', *hexdump]

This runs fine on my VM now so it will be interesting to see what the
BootCurrent values show here.

One possible change to curtin here is we may need to start bind
mounting /sys/firmware/efi/efivars when we run commands in-target
The debug output from this should help us understand what's going on.

I did observe that without efivars mounted, the grub install which
adds a new ubuntu entry was only viable via /sys/firmware/efi/vars
and that if I mounted efivars up and then ran efibootmgr, it wouldn't
*show* the ubuntu entry; so it seems possible to have these
different paths out of sync which may explain the error.

On Fri, Sep 14, 2018 at 11:08 AM Ryan Harper <email address hidden> wrote:
>
> On Fri, Sep 14, 2018 at 9:51 AM Ryan Harper <email address hidden> wrote:
> >
> > On Thu, Sep 13, 2018 at 6:35 PM Rod Smith <email address hidden> wrote:
> > >
> > > I'm afraid the node is failing to deploy with those changes to
> > > /etc/maas/preseeds/curtin_userdata (I assume that's where you wanted
> > > them):
> > >
> > > Running command ['unshare', '--fork', '--pid', '--', 'chroot', '/tmp/tmpzfw22u7e/target', 'sh', '-c', 'ls -al /sys/firmware/efi\nbcurrent=$(ls /sys/firmware/efi/efivars/BootCurrent*)\n[ -e "${bcurrent}" ] && hexdump $bcurrent\n'] with allowed return codes [0] (capture=False)
> > > total 0
> > > drwxr-xr-x 5 root root 0 Sep 13 23:10 .
> > > drwxr-xr-x 6 root root 0 Sep 13 23:08 ..
> > > -r--r--r-- 1 root root 4096 Sep 13 23:10 config_table
> > > dr-xr-xr-x 2 root root 0 Sep 13 23:08 efivars
> > > -r--r--r-- 1 root root 4096 Sep 13 23:10 fw_platform_size
> > > -r--r--r-- 1 root root 4096 Sep 13 23:10 fw_vendor
> > > -r--r--r-- 1 root root 4096 Sep 13 23:10 runtime
> > > drwxr-xr-x 9 root root 0 Sep 13 23:10 runtime-map
> > > -r-------- 1 root root 4096 Sep 13 23:09 systab
> > > drwxr-xr-x 70 root root 0 Sep 13 23:10 vars
> > > ls: cannot access '/sys/firmware/efi/efivars/BootCurrent*': No such file or directory
> >
> > Yuck, I was seeing the same thing in my VM, but I was sure it was an
> > issue with the VM.
> >
> > I cannot fathom why that sys path is not accessible. Le...

Read more...

Rod Smith (rodsmith) wrote :

Here are the install files using your modified MAAS preseed.

Ryan Harper (raharper) wrote :

Rod,

Thanks for running that.

I don't know why, but there is no BootCurrent entry available during
the install.
This is going to prevent curtin from ensuring what we booted from is the first
entry.

I suspect that there's something firmware related to not having
BootCurrent around if you've PXE booted; you've show that if you boot
from a local disk, that efibootmgr shows a BootCurrent entry. but
during the PXE/ephemeral boot, it's not present in the efi variables
available.

I don't think there is anything else that curtin can do here.

On Fri, Sep 14, 2018 at 12:16 PM Rod Smith <email address hidden> wrote:
>
> Here are the install files using your modified MAAS preseed.
>
> ** Attachment added: "curtin-install.tgz"
> https://bugs.launchpad.net/curtin/+bug/1789650/+attachment/5188885/+files/curtin-install.tgz
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1789650
>
> Title:
> Servers set to boot from disk after MAAS installation
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1789650/+subscriptions

Ryan Harper (raharper) wrote :

Related:
   https://bugzilla.redhat.com/show_bug.cgi?id=1031876

On Fri, Sep 14, 2018 at 1:00 PM Ryan Harper <email address hidden> wrote:
>
> Rod,
>
> Thanks for running that.
>
> I don't know why, but there is no BootCurrent entry available during
> the install.
> This is going to prevent curtin from ensuring what we booted from is the first
> entry.
>
> I suspect that there's something firmware related to not having
> BootCurrent around if you've PXE booted; you've show that if you boot
> from a local disk, that efibootmgr shows a BootCurrent entry. but
> during the PXE/ephemeral boot, it's not present in the efi variables
> available.
>
> I don't think there is anything else that curtin can do here.
>
>
> On Fri, Sep 14, 2018 at 12:16 PM Rod Smith <email address hidden> wrote:
> >
> > Here are the install files using your modified MAAS preseed.
> >
> > ** Attachment added: "curtin-install.tgz"
> > https://bugs.launchpad.net/curtin/+bug/1789650/+attachment/5188885/+files/curtin-install.tgz
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1789650
> >
> > Title:
> > Servers set to boot from disk after MAAS installation
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/curtin/+bug/1789650/+subscriptions

Rod Smith (rodsmith) wrote :

It can't be a simple matter of BootCurrent not existing when PXE-booted, since after adjusting BootOrder manually and rebooting, it is present:

ubuntu@oil-jolteon:~$ sudo efibootmgr -v
BootCurrent: 0002
BootOrder: 0002,0000,0001,0003,0004,0005,0006,0007,0008
Boot0000* ubuntu HD(1,GPT,2f2ac784-ce90-471b-b036-e2776ee5bdd3,0x800,0x100000)/File(\EFI\ubuntu\shimx64.efi)
Boot0001* Hard Disk 0 FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0002* PXE Network FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0003 Enter Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)
Boot0004 Boot Devices FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS....h
Boot0005 Boot Manager FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0006 Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0007 Diagnostics FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0008 Firmware Log FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....

Might GRUB on the installation image be passing a parameter that's different from what the installed image passes? Or maybe there's a subtle timing issue that's triggering a race condition in the firmware? This is just wild speculation on my part, of course.

Ryan Harper (raharper) wrote :

If you can ssh in during deployment you can see if bootcurrent is available.

I don’t know what happens in the firmware when we write uefi setting during
grub install either. Can we try with different kernels? Like xenial ga?

On Fri, Sep 14, 2018 at 2:41 PM Rod Smith <email address hidden> wrote:

> It can't be a simple matter of BootCurrent not existing when PXE-booted,
> since after adjusting BootOrder manually and rebooting, it is present:
>
> ubuntu@oil-jolteon:~$ sudo efibootmgr -v
> BootCurrent: 0002
> BootOrder: 0002,0000,0001,0003,0004,0005,0006,0007,0008
> Boot0000* ubuntu
> HD(1,GPT,2f2ac784-ce90-471b-b036-e2776ee5bdd3,0x800,0x100000)/File(\EFI\ubuntu\shimx64.efi)
> Boot0001* Hard Disk 0
> FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
> Boot0002* PXE Network
> FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
> Boot0003 Enter Setup
> FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)
> Boot0004 Boot Devices
> FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS....h
> Boot0005 Boot Manager
> FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
> Boot0006 Setup
> FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
> Boot0007 Diagnostics
> FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
> Boot0008 Firmware Log
> FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
>
> Might GRUB on the installation image be passing a parameter that's
> different from what the installed image passes? Or maybe there's a
> subtle timing issue that's triggering a race condition in the firmware?
> This is just wild speculation on my part, of course.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1789650
>
> Title:
> Servers set to boot from disk after MAAS installation
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1789650/+subscriptions
>

Rod Smith (rodsmith) wrote :

During deployment:

$ sudo efibootmgr -v
sudo: efibootmgr: command not found
$ ls /sys/class/firmware/
timeout

After installing efibootmgr:

$ sudo efibootmgr -v
BootCurrent: 0002
BootOrder: 0002,0000,0001,0003,0004,0005,0006,0007,0008
Boot0001* Hard Disk 0 FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0002* PXE Network FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0003 Enter Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)
Boot0004 Boot Devices FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS....h
Boot0005 Boot Manager FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0006 Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0007 Diagnostics FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0008 Firmware Log FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....

So it seems that efibootmgr is able to extract BootCurrent, but I don't know what's going on with the /sys/class/firmware directory.

I'll try some deployments with other kernels next....

Rod Smith (rodsmith) wrote :

That was weird. After the previous attempt, the system looked OK when it was fully deployed -- BootOrder was set correctly. I therefore tried replicating the login while deploying, and this time the /sys/class/firmware directory looked more normal, but both it and efibootmgr (once installed) showed no BootCurrent variable:

$ sudo efibootmgr -v
BootOrder: 0000,0002,0001,0003,0004,0005,0006,0007,0008
Boot0000* ubuntu HD(1,GPT,3867d6c3-0241-43b0-a31e-a519c8271305,0x800,0x100000)/File(\EFI\ubuntu\shimx64.efi)
Boot0001* Hard Disk 0 FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0002* PXE Network FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0003 Enter Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)
Boot0004 Boot Devices FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS....h
Boot0005 Boot Manager FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0006 Setup FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
Boot0007 Diagnostics FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(1d76b9fa-87f2-4ecf-ab6a-985c4a958d6a)IbmS.....
Boot0008 Firmware Log FvVol(cdbb7b35-6833-4ed6-9ab2-57d2acddf6f0)/FvFile(3cc3fdbd-9658-47c1-b672-6263c1c7e403)IbmS.....
ubuntu@oil-jolteon:~$ ls /sys/firmware/efi/efivars/BootC*
ls: cannot access '/sys/firmware/efi/efivars/BootC*': Invalid argument

So there's some inconsistency. Maybe one in X boots works OK...?

Rod Smith (rodsmith) wrote :

My tests of different kernels and Ubuntu versions yielded no clues; everything back to 14.04 with its GA kernel appears to be affected; however...

I think I've figured out why the BootCurrent variable is not appearing during deployments:

If these Lenovos boot cold, BootCurrent is missing; if they warm boot, BootCurrent is present. This is reproducible post-installation by shutting down and powering up the server vs. rebooting the server. BootCurrent can then be checked using "sudo efibootmgr" or checking /sys/firmware/efi/efivars. A MAAS deployment, of course, involves a cold boot, so BootCurrent is missing during deployment; but after deployment, the system does a warm reboot into the deployed OS, so BootCurrent appears again.

Once the system is deployed, booting cold and then resetting the server via the BMC a few seconds after starting the server can cause the BootCurrent variable to appear. (I used the web UI and "Power Actions->Restart the Server Immediately" from the main screen.) This causes a reset in POST, which seems to be enough to get the BootCurrent variable to appear. This process, however, is NOT enough to cause a successful deployment with the correct boot order set. REntering the firmware setup utility via F1 after beginning a deployment and manually resetting the server via its setup utility, however, is effective; when this is done, the server deploys and sets the boot order correctly.

One more comment: It's possible, but not certain, that the Kontron MSP804x - MSP804x server is affected by this same bug, since it's failed the certification test that looks for a network boot. The full certification run on this server can be found here:

https://certification.canonical.com/hardware/201809-26486/submission/133138/

The failed test is here:

https://certification.canonical.com/hardware/201809-26486/submission/133138/test/67646/result/9957736/

Overall, this looks like either a firmware bug or a kernel bug (maybe a race condition in building the EFI variables list...?). The fact that a reset via the BMC causes the BootCurrent variable to appear post-deployment but (presumably) not during deployment is peculiar, though.

Is there a way to force the Linux kernel to rebuild its EFI variables list? If so, that might be worth trying as a workaround. If it's a kernel bug, then obviously fixing it is the best solution. If it's a firmware bug, then getting it fixed would also be the best solution, but that's likely to take a while, and the fix might never make it to some affected servers.

Rod Smith (rodsmith) on 2019-03-22
Changed in curtin:
status: Incomplete → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.