merlin boards set to boot from disk after MAAS deploy

Bug #1799835 reported by dann frazier
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
curtin
Expired
Low
Unassigned

Bug Description

curtin 18.1-632-gd879ca0-0ubuntu1~ubuntu16.04.1

We have some AMI X-Gene 2 "merlin" boards in CI using MAAS, which use UEFI (Tianocore) firmware. These systems used to work with MAAS but were out of commission (no pun intended) for a while due to a kernel bug. When that was resolved and we brought them online, we found that MAAS was no longer working reliably. Turns out that after an initial deployment, these boards were no longer PXE booting, but instead booting from the previous "ubuntu" boot entry.

I believe the root cause is the following: when these systems have both EFI boot entries for PXE and ubuntu, firmware sets "BootCurrent" to the "ubuntu" entry, even if we really booted from PXE[*]. What I gleaned from LP: #1789650 is that curtin will rejuggle the boot entries so that the "BootCurrent" entry is first, followed by "ubuntu". Due to this seemingly clear firmware bug, that will cause the PXE entry to get buried. I'm assuming that when these systems worked before, curtin was still calling grub-install w/ --no-nvram, so no ubuntu entry was created.

I don't believe there will ever be a firmware fix for these systems, so we'd probably need some workaround in curtin to proceed. One thought that comes to mind is to revert to the old --no-nvram parameter if we found that BootCurrent points to an on-disk entry. We'll lose the ability to boot when the MAAS server is down, but that seems like a fair trade-off.

[*] I'm not sure if firmware is incorrectly setting BootCurrent - it could just not be setting it at all. In my testing, ubuntu is always entry "0000" and BootCurrent is always "0000".

Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :

Is there a way to revert to the old behavior (update_nvram=false) using a preseed? If so, what would that look like?

Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1799835] Re: merlin boards set to boot from disk after MAAS deploy

Can you try with:

grub:
    update_nvram: False

On Thu, Oct 25, 2018 at 12:00 PM dann frazier
<email address hidden> wrote:
>
> Is there a way to revert to the old behavior (update_nvram=false) using
> a preseed? If so, what would that look like?
>
> --
> You received this bug notification because you are subscribed to curtin.
> Matching subscriptions: curtin-bugs-all
> https://bugs.launchpad.net/bugs/1799835
>
> Title:
> merlin boards set to boot from disk after MAAS deploy
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1799835/+subscriptions

Revision history for this message
dann frazier (dannf) wrote :

On Thu, Oct 25, 2018 at 1:05 PM Ryan Harper <email address hidden> wrote:
>
> Can you try with:
>
> grub:
> update_nvram: False

How is that passed in? I tried:

ubuntu@maas:/etc/maas/preseeds$ cat
curtin_userdata_ubuntu_arm64_generic_xenial_rnssh7
grub:
    update_nvram: False

But that didn't seem to make a difference.

Revision history for this message
Ryan Harper (raharper) wrote :

Do you have a boot log from that? it would be nice to see if that
config got passed but didn't get you what you wanted vs. it didn't get
passed through?
On Thu, Oct 25, 2018 at 1:40 PM dann frazier <email address hidden> wrote:
>
> On Thu, Oct 25, 2018 at 1:05 PM Ryan Harper <email address hidden> wrote:
> >
> > Can you try with:
> >
> > grub:
> > update_nvram: False
>
> How is that passed in? I tried:
>
> ubuntu@maas:/etc/maas/preseeds$ cat
> curtin_userdata_ubuntu_arm64_generic_xenial_rnssh7
> grub:
> update_nvram: False
>
>
> But that didn't seem to make a difference.
>
> --
> You received this bug notification because you are subscribed to curtin.
> Matching subscriptions: curtin-bugs-all
> https://bugs.launchpad.net/bugs/1799835
>
> Title:
> merlin boards set to boot from disk after MAAS deploy
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1799835/+subscriptions

Revision history for this message
dann frazier (dannf) wrote :

On Thu, Oct 25, 2018 at 2:00 PM Ryan Harper <email address hidden> wrote:
>
> Do you have a boot log from that? it would be nice to see if that
> config got passed but didn't get you what you wanted vs. it didn't get
> passed through?

Attached. It does appear to have gotten passed in:
 'debconf_selections': {'grub2': 'grub2 grub2/update_nvram boolean false'

But I still ended up w/ an 'ubuntu' entry.

Revision history for this message
Ryan Harper (raharper) wrote :

No, I don't think it did:

   17.576356] cloud-init[1244]: 2018-10-25 19:23:48,396 -
__init__.py[WARNING]: Unhandled non-multipart (text/x-not-multipart)
userdata: 'b'grub:'...'

and the Merged Config doesn't have it and the debconf set_selection
comes from maas.
On Thu, Oct 25, 2018 at 2:41 PM dann frazier <email address hidden> wrote:
>
> On Thu, Oct 25, 2018 at 2:00 PM Ryan Harper <email address hidden> wrote:
> >
> > Do you have a boot log from that? it would be nice to see if that
> > config got passed but didn't get you what you wanted vs. it didn't get
> > passed through?
>
> Attached. It does appear to have gotten passed in:
> 'debconf_selections': {'grub2': 'grub2 grub2/update_nvram boolean false'
>
> But I still ended up w/ an 'ubuntu' entry.
>
>
> ** Attachment added: "preseed-console.log"
> https://bugs.launchpad.net/bugs/1799835/+attachment/5205616/+files/preseed-console.log
>
> --
> You received this bug notification because you are subscribed to curtin.
> Matching subscriptions: curtin-bugs-all
> https://bugs.launchpad.net/bugs/1799835
>
> Title:
> merlin boards set to boot from disk after MAAS deploy
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1799835/+subscriptions

Revision history for this message
dann frazier (dannf) wrote :

On Thu, Oct 25, 2018 at 2:50 PM Ryan Harper <email address hidden> wrote:
>
> No, I don't think it did:
>
> 17.576356] cloud-init[1244]: 2018-10-25 19:23:48,396 -
> __init__.py[WARNING]: Unhandled non-multipart (text/x-not-multipart)
> userdata: 'b'grub:'...'
>
> and the Merged Config doesn't have it and the debconf set_selection
> comes from maas.

Gotcha. Any idea what I did wrong?

Revision history for this message
Rod Smith (rodsmith) wrote :

I'm far from 100% positive, but it's conceivable this is a duplicate of bug #1789650. The cause there is cold vs. warm booting -- the BootCurrent variable is missing when booting cold, but appears when booting warm. This results in the installer setting the BootOrder variable to the on-disk "ubuntu" entry first during an installation (which begins with a cold boot); but after the post-install warm reboot, the BootCurrent variable reappears.

So, I recommend Dann check to see if BootCurrent is present after cold vs. warm boots of the hardware.

Revision history for this message
dann frazier (dannf) wrote :

From my reading the symptoms are similar to bug 1789650, but there's a distinct difference. In this case the BootCurrent variable *is* present, it's just lying and always saying it booted from "0000".

Revision history for this message
Rod Smith (rodsmith) wrote :

Are you certain it's lying, though? Unless you watch the boot process on a console, it can be hard to tell how it ACTUALLY booted. It took me a long time to track down the cause in the case of bug #1789650, but it's easy enough to test -- just check for the presence of BootCurrent when cold booting vs. warm booting the machine. That said, a bug in which the machine improperly sets BootCurrent is also certainly a plausible EFI bug.

Revision history for this message
dann frazier (dannf) wrote :

On Fri, Oct 26, 2018 at 11:05 AM Rod Smith <email address hidden> wrote:
>
> Are you certain it's lying, though? Unless you watch the boot process on
> a console, it can be hard to tell how it ACTUALLY booted.

Yep, see the console log in comment #1.

Revision history for this message
dann frazier (dannf) wrote :

@raharper - any idea what I did wrong re: comment #7 ?

Revision history for this message
Ryan Harper (raharper) wrote :

Dann,

I don't; the preseed yaml editing magic is out of my area of expertise. Pinging someone in #maas would likely help here.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

For clarification, MAAS already sends the following:

debconf_selections:
  grub2: grub2 grub2/update_nvram boolean false

This, however, is used for the *installed* system to tell debconf that if for whatever reason the grub2 debian package is upgraded at any later date, it doesn't update the nvram, but then again, this is only for post installation upgrade.

That said, passing the following:

grub:
   update_nvram: False

is different from the above because this is done during the *installation* process.

Revision history for this message
dann frazier (dannf) wrote :

fyi, I managed to get this to work after chatting w/ Andres. I ended up appending the following to /etc/maas/preseeds/curtin_userdata:

{{if 'lp1799835-workaround' in node.tag_names()}}
# https://bugs.launchpad.net/bugs/1799835
grub:
  update_nvram: False
{{endif}}

And then tagging the impacted system lp1799835-workaround.

Now - question is, can curtin do this automatically say, if curtin determines that BootCurrent is not a network device?

Revision history for this message
Ryan Harper (raharper) wrote :

Hi Dann,

Sorry for not getting back sooner. Curtin doesn't look much at the efi menu output w.r.t understanding the entries; it merely re-orders entries.

I hesitate to bake that information into curtin itself as MAAS and the end-user are more in control over what entries are present.

Taking curtin out of the picture, how would someone know that a boot entry in efibootmgr output is emphatically *network* or not?

Changed in curtin:
importance: Undecided → Low
status: New → Incomplete
Revision history for this message
Rod Smith (rodsmith) wrote :

Concerning how to determine if a boot entry is or is not a network entry, I know of no way to tell that's guaranteed to be 100% reliable; however, there are certain strings that tend to occur in network boot entries' descriptions but not in the descriptions of non-network entries. These strings are "Network," "PXE," "NIC," "Ethernet," "IP4," and "IP6." For server certification testing, we have a test script, efi-pxeboot, that searches for these strings as indications that the system PXE-booted. (The script also searches for strings that indicate disk boots, like "ubuntu" and "Hard Drive.") Of course, there's no guarantee that somebody won't create a disk-based boot entry with "Ethernet" in the string; or a manufacturer might set the EFI to PXE-boot using a description that's not explicitly coded in the test (HTTP and future boot methods might do this, for instance), so the efi-pxeboot script isn't guaranteed to be 100% accurate. In practice, though, it has seemed to be quite reliable in use. Whether such a test would be reliable enough for something more critical than a certification test is not for me to decide; I simply thought I'd point out the existing code. If anybody cares to check it out, it's part of the plainbox-provider-checkbox project:

https://code.launchpad.net/plainbox-provider-checkbox

Specifically:

https://git.launchpad.net/plainbox-provider-checkbox/tree/bin/efi-pxeboot

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for curtin because there has been no activity for 60 days.]

Changed in curtin:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.