Failure to PXE-boot from secondary NIC

Bug #1437024 reported by Rod Smith on 2015-03-26
34
This bug affects 5 people
Affects Status Importance Assigned to Milestone
MAAS
Wishlist
Unassigned
grub2 (Ubuntu)
High
Unassigned
Trusty
High
Mathieu Trudel-Lapierre
Xenial
High
Mathieu Trudel-Lapierre

Bug Description

On a Lenovo x3550 M5 with the standard 4-port 1Gbps NICs and a secondary (plug-in) 2-port 10Gbps NIC, an attempt to PXE-boot from the secondary NIC in UEFI mode causes a boot failure with a "grub>" prompt on the display.

This server arrived for certification with both NICs enabled. I suspect, but am not positive, that which NIC is chosen for booting is semi-random, although it favors the secondary NIC. The system PXE-boots fine from the built-in ports (generally the first one, eth0), and when booting from eth0 is forced via the UEFI boot menu, enlistment, commissioning, deployment, and post-deployment booting all work fine.

When the system PXE-boots from the secondary NIC, though, the normal UEFI PXE-boot messages appear on the screen, followed by the aforementioned "grub>" prompt. This obviously prevents normal operation of the server with MAAS. It appears from the logs (attached) that when using the failed NIC, the MAAS server doesn't receive follow-on requests from GRUB.

As a workaround, the secondary NICs can be configured to not support PXE-booting in the firmware setup utility; this enables normal deployment via MAAS.

MAAS version information:

$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=====================================================-===================================================-============-===============================================================================
ii maas 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS server all-in-one metapackage
ii maas-cli 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS command line API tool
ii maas-cluster-controller 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS server cluster controller
ii maas-common 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS server common files
ii maas-dhcp 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS DHCP server
ii maas-dns 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS DNS server
ii maas-proxy 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS Caching Proxy
ii maas-region-controller 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS server complete region controller
ii maas-region-controller-min 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS Server minimum region controller
ii python-django-maas 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS server Django web framework
ii python-maas-client 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS python API client
ii python-maas-provisioningserver 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS server provisioning libraries

Rod Smith (rodsmith) wrote :
Rod Smith (rodsmith) wrote :
tags: added: uefi
Changed in maas:
milestone: none → next
Blake Rouse (blake-rouse) wrote :

This seems to be very similar to bug 1437353, targeting to grub2 as well.

tags: added: grubnet hwe
Changed in maas:
status: New → Triaged
importance: Undecided → High
Blake Rouse (blake-rouse) wrote :

This one might be more related to how we handle ipv6 on the tftp server. With this stack trace here:

http://pastebin.ubuntu.com/10689385/

Would it be possible to gain access to the MAAS and machine to debug?

Rod Smith (rodsmith) wrote :

Yes, we can give you access, but that may have to wait until next week. (I'm running certification tests on it now, and they'll go on until sometime mid-afternoon.) If you'll want to change the MAAS configuration, we'll probably want to move it from maaster to landmaas.

Blake Rouse (blake-rouse) wrote :

Yes I will need to modify MAAS to debug the issue.

Next week works fine for me, as I am also stuck on doing other stuff this week.

no longer affects: grub2 (Ubuntu)
tags: added: ipv6
removed: grubnet
summary: - Failure to PXE-boot from secondary NIC
+ Unable to UEFI boot machine over IPv6
Changed in maas:
milestone: next → 1.8.0
importance: High → Critical

The user reports that he has successfully worked around the issue by using a BIOS boot mode.

Rod Smith (rodsmith) wrote :

The network on which I encountered this bug is not explicitly configured for IPv6, although ifconfig shows the network interfaces have fe80:: IPv6 addresses. Thus, if IPv6 is involved, it's a misconfiguration or IPv6 is being used by one of the tools (MAAS, TFTP, etc.) when it shouldn't be.

Rod Smith (rodsmith) wrote :

The NIC in the affected machine in Lexington has the following driver and firmware details:

$ ethtool -i eth4
driver: bnx2x
version: 1.78.17-0
firmware-version: bc 7.10.4 phy 1.34
bus-info: 0000:06:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

The following has been reported on another off-site system showing the same symptoms:

$ ethtool -i eth2
driver: bnx2x
version: 1.78.17-0
firmware-version: bc 7.8.79
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

Changed in maas:
milestone: 1.8.0 → next
Changed in maas:
importance: Critical → Wishlist
Rod Smith (rodsmith) wrote :

The server's firmware is set up to use IPv4; IPv6 is explicitly disabled. Thus, if there's an IPv6 component to this bug, it's something in the MAAS software stack that's causing GRUB to attempt to use IPv6 inappropriately. Note that the failure occurs in GRUB; the server is able to pull down the GRUB binary from the TFTP server, but GRUB hangs at a "grub>" prompt.

Given that the failure occurs only on a secondary NIC, my suspicion is that GRUB is trying to use a network interface other than the one from which it was itself loaded. Perhaps when this fails it tries IPv6 over the same interface...?

summary: - Unable to UEFI boot machine over IPv6
+ Failure to PXE-boot from secondary NIC

Quanta reported problem provisioning over the onboard Intel 10G card 82599. The symptom is very similar that it stopped at grub> prompt. Please see attached picture at boot.

Blake Rouse (blake-rouse) wrote :

I think this is a grub issue not working with the network card, please try a newer version of grub to see if this works. If you install MAAS on wily it will pull the grub from wily. Lets see if that will fix this issue. We really heavily on grub to do the correct thing in this case.

Hi Blake,

Do I have to have MAAS on wily in order to get the newer grub? It's the grub in the new 15.10 cloud images that matters, right?

Quanta confirmed that with 15.10, MAAS can provision the server over the 10G link.

Sam Lee (samlee) wrote :

I had the same symptom with the Broadcom 10G on PCI Slot 1, and resolved it by updating the NIC firmware.

Ante Karamatić (ivoks) wrote :
Download full text (6.2 KiB)

Facing the same issue with X540-AT2.

When I PXE boot a VM it PXE boots just fine. On that same MAAS installation (tried 2.1.3 and 2.2b2) physical NIC, on the same machine where VM booted just fine, fails to PXE boot.

tcpdump shows enormous ARP traffic:

08:31:14.649459 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649482 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649493 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649504 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649516 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649528 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649551 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649562 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649573 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649597 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649607 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649619 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649642 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649653 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649664 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649675 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649687 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649709 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649720 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649732 2c:60:0c:f9:41:f0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.16.7.2 tell 172.16.7.101, length 46
08:31:14.649744 2c:60:0c...

Read more...

Ante Karamatić (ivoks) wrote :

I forgot to mention - this is also Quanta.

Andres Rodriguez (andreserl) wrote :

I've talked to ivoks over IRC, and he confirmed that this is a firmware issue rather than a MAAS issue.

He also mentioned that grub can address the issue and there's a patch available for upstream.

As such, I've added 'grub2' as an affected package and marking this bug as 'Invalid' in MAAS.

Changed in maas:
status: Triaged → Invalid
milestone: next → none
Rod Smith (rodsmith) wrote :

Is there any progress on this? I'm encountering similar problems on at least two more systems now (moltres and ostwald, which are Quanta S910 X31E and Supermicro SYS-6018R-WTR servers, respectively). I've managed to find a working firmware setup to get the Quanta booting reliably, but the Supermicro has eluded me thus far.

The Quanta has just two NICs, both of which are 1 Gbps units. To get it to boot, I need to tell it to attempt an IPv6 boot before the IPv4 boot. The IPv6 boot fails, but the system then attempts an IPv4 boot, which succeeds. If the IPv4 boot is attempted prior to an IPv6 boot, it fails with a "grub>" prompt; however, if the system is configured to attempt an IPv6 boot between IPv4 attempts on two NICs, and if the user types "exit" at the "grub>" prompt, the system will attempt the IPv6 boot, fail, and then try IPv4 on the second NIC, which succeeds.

I'm attaching a tarball with an excerpt from the MAAS rackd.log file, a video capture from the machine's remote KVM, and a Wireshark capture file illustrating this sequence. From this output, it appears as if GRUB is not requesting its configuration file.

Rod Smith (rodsmith) wrote :

Another update on this:

I updated jolteon (the Lenovo x3550 M5 that was the original source of this bug report) with the latest firmware I could find (v.2.21, build TBE126Q, dated 2016-11-18) and reproduced the original bug. It appears to be unchanged, with one twist: At one point, the system was booting normally, so long as MAAS configured the system to use DHCP for ALL its Ethernet ports. After some changes to the system settings, though, it stopped working again even with those settings. I suspect that the PXE and MAAS DHCP-provided IP addresses synced up briefly, but then went out of sync again.

Also, the boot sequence on a failure looks something like this (taken from jolteon, the Lenovo x3550):

>>Start PXE over IPv4.
  Station IP address is 10.1.10.128

  Server IP address is 10.1.10.2
  NBP filename is bootx64.efi
  NBP filesize is 1169992 Bytes
 Downloading NBP file...

  Succeed to download NBP file.

 Downloading NBP file...

  Succeed to download NBP file.
Fetching Netboot Image
error: couldn't send network packet.

                             GNU GRUB version 2.02~beta2-36ubuntu3.12

   Minimal BASH-like line editing is supported. For the first word, TAB lists possible
   command completions. Anywhere else TAB lists possible device or file completions.

grub>

Note the "couldn't send network packet" message. This message isn't always visible (perhaps it flashes by too quickly to notice?). This is consistent with the Wireshark output, which shows no packets received by the MAAS server.

I've managed to get the Supermicro server to boot reliably. The trick for it is to configure it to PXE-boot from one of its 10 Gbps NICs rather than from its 1 Gbps NIC. The procedure for configuring the Lenovo to boot reliably is documented at https://wiki.canonical.com/CDO/HardwareCertification/LenovoxM5. This is tedious because of the Lenovo's complex firmware setup utility with many interacting settings.

You could try https://launchpad.net/~cyphermox/+archive/ubuntu/efi/+sourcepub/8195249/+listing-archive-extra; it's a build of grub2 as it is in xenial, plus the two needed patches to cover this weird firmware. I'll prepare proper uploads for this tomorrow.

Rod Smith (rodsmith) wrote :

Mathieu, how would I install this? As I understand it, I'd need to replace /var/lib/maas/boot-resources/current/grubx64.efi on the MAAS server with a GRUB built for network booting, but it's unclear to me in which of those packages a suitable file exists.

Actually, nevermind that. The files would not be signed correctly to work with Secure Boot (at least, not without more fiddling).

Please use http://archive.ubuntu.com/ubuntu/dists/artful/main/uefi/grub2-amd64/2.02~beta3-4ubuntu6/grubnetx64.efi.signed , using that file from artful (to replace your grubx64.efi) should be sufficient to test. It already includes the fix.

Rod Smith (rodsmith) wrote :

Mathieu, I've tested on all three affected servers in our possession (jolteon, ostwald, and moltres), and the grubnetx64.efi.signed binary works on all of them (delivered via the MAAS server, of course). Thanks for this fix!

I've also tested on three systems that did not have this problem (wildorange, brennan, and kzanol [the latter two are on my home network]), and they had no problems with the new binary, either.

FWIW, Secure Boot was irrelevant for this test; because of bug #1711203, I've disabled Secure Boot on all of these systems for the time being. I tried enabling Secure Boot on a couple of them, and they failed as in bug #1711203, so this version does NOT address that bug.

Jeff Lane (bladernr) wrote :

@Mathieu is this a grub package only problem that can be SRU'd relatively easy, or is it something more complicated that will really require respins (and probably not be released until 16.04.4)?

Jeff Lane (bladernr) wrote :

@cyphermox, any update on this? I believe we have run into this in the field as well with a customer lab doing certification testing for 16.04.3.

Jeff Lane (bladernr) wrote :

This is a three year old bug that is still causing issues when booting certain systems in MAAS. My understanding is that this is fixable/fixed in Grub2 upstream (per Andres comments), so what do we need to do to fix this in 14.04 and 16.04? I presume 18.04 won't be affected... is that true?

Hopefully setting this to High will grab someone's attention.

Changed in grub2 (Ubuntu):
importance: Undecided → High

This is already fixed in bionic / artful (given that it's landed there as of grub2 2.02~beta3-4ubuntu6).

What is missing is landing the fixes for Xenial and possibly Trusty?

I can prepare the SRU for Trusty right now, but the one for Xenial will be temporarily blocked by a shim update which just made it to -proposed.

Changed in grub2 (Ubuntu):
status: New → Fix Released
Changed in grub2 (Ubuntu Trusty):
importance: Undecided → High
Changed in grub2 (Ubuntu Xenial):
importance: Undecided → High
Changed in grub2 (Ubuntu Trusty):
assignee: nobody → Mathieu Trudel-Lapierre (cyphermox)
Changed in grub2 (Ubuntu Xenial):
assignee: nobody → Mathieu Trudel-Lapierre (cyphermox)
Changed in grub2 (Ubuntu Trusty):
status: New → Triaged
Changed in grub2 (Ubuntu Xenial):
status: New → Triaged
Alec Duroy (acduroy) wrote :

Hi,
I was also affected by this issue - to PXE boot at UEFI for add-on adapter i350 1Gb interface. Upgrading the MAAS server version to 18.04 bionic release did not solve the symptom. But I was able to find a workaround by booting first the IPv6 before IPv4 - the same workaround mentioned by Rod Smith about his comment #21 on reported Bug #1437024. The detail of this was filed under Bug #1787637. Thanks

To post a comment you must log in.