merlin boards set to boot from disk after MAAS deploy

Bug #1799835 reported by dann frazier on 2018-10-24
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
curtin
Undecided
Unassigned

Bug Description

curtin 18.1-632-gd879ca0-0ubuntu1~ubuntu16.04.1

We have some AMI X-Gene 2 "merlin" boards in CI using MAAS, which use UEFI (Tianocore) firmware. These systems used to work with MAAS but were out of commission (no pun intended) for a while due to a kernel bug. When that was resolved and we brought them online, we found that MAAS was no longer working reliably. Turns out that after an initial deployment, these boards were no longer PXE booting, but instead booting from the previous "ubuntu" boot entry.

I believe the root cause is the following: when these systems have both EFI boot entries for PXE and ubuntu, firmware sets "BootCurrent" to the "ubuntu" entry, even if we really booted from PXE[*]. What I gleaned from LP: #1789650 is that curtin will rejuggle the boot entries so that the "BootCurrent" entry is first, followed by "ubuntu". Due to this seemingly clear firmware bug, that will cause the PXE entry to get buried. I'm assuming that when these systems worked before, curtin was still calling grub-install w/ --no-nvram, so no ubuntu entry was created.

I don't believe there will ever be a firmware fix for these systems, so we'd probably need some workaround in curtin to proceed. One thought that comes to mind is to revert to the old --no-nvram parameter if we found that BootCurrent points to an on-disk entry. We'll lose the ability to boot when the MAAS server is down, but that seems like a fair trade-off.

[*] I'm not sure if firmware is incorrectly setting BootCurrent - it could just not be setting it at all. In my testing, ubuntu is always entry "0000" and BootCurrent is always "0000".

dann frazier (dannf) wrote :
dann frazier (dannf) wrote :

Is there a way to revert to the old behavior (update_nvram=false) using a preseed? If so, what would that look like?

Can you try with:

grub:
    update_nvram: False

On Thu, Oct 25, 2018 at 12:00 PM dann frazier
<email address hidden> wrote:
>
> Is there a way to revert to the old behavior (update_nvram=false) using
> a preseed? If so, what would that look like?
>
> --
> You received this bug notification because you are subscribed to curtin.
> Matching subscriptions: curtin-bugs-all
> https://bugs.launchpad.net/bugs/1799835
>
> Title:
> merlin boards set to boot from disk after MAAS deploy
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1799835/+subscriptions

dann frazier (dannf) wrote :

On Thu, Oct 25, 2018 at 1:05 PM Ryan Harper <email address hidden> wrote:
>
> Can you try with:
>
> grub:
> update_nvram: False

How is that passed in? I tried:

ubuntu@maas:/etc/maas/preseeds$ cat
curtin_userdata_ubuntu_arm64_generic_xenial_rnssh7
grub:
    update_nvram: False

But that didn't seem to make a difference.

Ryan Harper (raharper) wrote :

Do you have a boot log from that? it would be nice to see if that
config got passed but didn't get you what you wanted vs. it didn't get
passed through?
On Thu, Oct 25, 2018 at 1:40 PM dann frazier <email address hidden> wrote:
>
> On Thu, Oct 25, 2018 at 1:05 PM Ryan Harper <email address hidden> wrote:
> >
> > Can you try with:
> >
> > grub:
> > update_nvram: False
>
> How is that passed in? I tried:
>
> ubuntu@maas:/etc/maas/preseeds$ cat
> curtin_userdata_ubuntu_arm64_generic_xenial_rnssh7
> grub:
> update_nvram: False
>
>
> But that didn't seem to make a difference.
>
> --
> You received this bug notification because you are subscribed to curtin.
> Matching subscriptions: curtin-bugs-all
> https://bugs.launchpad.net/bugs/1799835
>
> Title:
> merlin boards set to boot from disk after MAAS deploy
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1799835/+subscriptions

dann frazier (dannf) wrote :

On Thu, Oct 25, 2018 at 2:00 PM Ryan Harper <email address hidden> wrote:
>
> Do you have a boot log from that? it would be nice to see if that
> config got passed but didn't get you what you wanted vs. it didn't get
> passed through?

Attached. It does appear to have gotten passed in:
 'debconf_selections': {'grub2': 'grub2 grub2/update_nvram boolean false'

But I still ended up w/ an 'ubuntu' entry.

Ryan Harper (raharper) wrote :

No, I don't think it did:

   17.576356] cloud-init[1244]: 2018-10-25 19:23:48,396 -
__init__.py[WARNING]: Unhandled non-multipart (text/x-not-multipart)
userdata: 'b'grub:'...'

and the Merged Config doesn't have it and the debconf set_selection
comes from maas.
On Thu, Oct 25, 2018 at 2:41 PM dann frazier <email address hidden> wrote:
>
> On Thu, Oct 25, 2018 at 2:00 PM Ryan Harper <email address hidden> wrote:
> >
> > Do you have a boot log from that? it would be nice to see if that
> > config got passed but didn't get you what you wanted vs. it didn't get
> > passed through?
>
> Attached. It does appear to have gotten passed in:
> 'debconf_selections': {'grub2': 'grub2 grub2/update_nvram boolean false'
>
> But I still ended up w/ an 'ubuntu' entry.
>
>
> ** Attachment added: "preseed-console.log"
> https://bugs.launchpad.net/bugs/1799835/+attachment/5205616/+files/preseed-console.log
>
> --
> You received this bug notification because you are subscribed to curtin.
> Matching subscriptions: curtin-bugs-all
> https://bugs.launchpad.net/bugs/1799835
>
> Title:
> merlin boards set to boot from disk after MAAS deploy
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1799835/+subscriptions

dann frazier (dannf) wrote :

On Thu, Oct 25, 2018 at 2:50 PM Ryan Harper <email address hidden> wrote:
>
> No, I don't think it did:
>
> 17.576356] cloud-init[1244]: 2018-10-25 19:23:48,396 -
> __init__.py[WARNING]: Unhandled non-multipart (text/x-not-multipart)
> userdata: 'b'grub:'...'
>
> and the Merged Config doesn't have it and the debconf set_selection
> comes from maas.

Gotcha. Any idea what I did wrong?

Rod Smith (rodsmith) wrote :

I'm far from 100% positive, but it's conceivable this is a duplicate of bug #1789650. The cause there is cold vs. warm booting -- the BootCurrent variable is missing when booting cold, but appears when booting warm. This results in the installer setting the BootOrder variable to the on-disk "ubuntu" entry first during an installation (which begins with a cold boot); but after the post-install warm reboot, the BootCurrent variable reappears.

So, I recommend Dann check to see if BootCurrent is present after cold vs. warm boots of the hardware.

dann frazier (dannf) wrote :

From my reading the symptoms are similar to bug 1789650, but there's a distinct difference. In this case the BootCurrent variable *is* present, it's just lying and always saying it booted from "0000".

Rod Smith (rodsmith) wrote :

Are you certain it's lying, though? Unless you watch the boot process on a console, it can be hard to tell how it ACTUALLY booted. It took me a long time to track down the cause in the case of bug #1789650, but it's easy enough to test -- just check for the presence of BootCurrent when cold booting vs. warm booting the machine. That said, a bug in which the machine improperly sets BootCurrent is also certainly a plausible EFI bug.

dann frazier (dannf) wrote :

On Fri, Oct 26, 2018 at 11:05 AM Rod Smith <email address hidden> wrote:
>
> Are you certain it's lying, though? Unless you watch the boot process on
> a console, it can be hard to tell how it ACTUALLY booted.

Yep, see the console log in comment #1.

dann frazier (dannf) wrote :

@raharper - any idea what I did wrong re: comment #7 ?

Ryan Harper (raharper) wrote :

Dann,

I don't; the preseed yaml editing magic is out of my area of expertise. Pinging someone in #maas would likely help here.

Andres Rodriguez (andreserl) wrote :

For clarification, MAAS already sends the following:

debconf_selections:
  grub2: grub2 grub2/update_nvram boolean false

This, however, is used for the *installed* system to tell debconf that if for whatever reason the grub2 debian package is upgraded at any later date, it doesn't update the nvram, but then again, this is only for post installation upgrade.

That said, passing the following:

grub:
   update_nvram: False

is different from the above because this is done during the *installation* process.

dann frazier (dannf) wrote :

fyi, I managed to get this to work after chatting w/ Andres. I ended up appending the following to /etc/maas/preseeds/curtin_userdata:

{{if 'lp1799835-workaround' in node.tag_names()}}
# https://bugs.launchpad.net/bugs/1799835
grub:
  update_nvram: False
{{endif}}

And then tagging the impacted system lp1799835-workaround.

Now - question is, can curtin do this automatically say, if curtin determines that BootCurrent is not a network device?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers