UEFI network boot hangs at grub for adapter 82599ES 10-Gigabit SFI/SFP+
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| MAAS |
High
|
Unassigned | |||
| maas-images |
High
|
Unassigned | |||
| python-tx-tftp |
Invalid
|
Undecided
|
Unassigned | ||
| grub2 (Ubuntu) |
High
|
Mathieu Trudel-Lapierre | |||
| Trusty |
High
|
Mathieu Trudel-Lapierre | |||
| Xenial |
High
|
Mathieu Trudel-Lapierre | |||
| Yakkety |
High
|
Mathieu Trudel-Lapierre | |||
| grub2-signed (Ubuntu) |
Undecided
|
Unassigned | |||
| Trusty |
Undecided
|
Unassigned | |||
| Xenial |
Medium
|
Mathieu Trudel-Lapierre | |||
| Yakkety |
Undecided
|
Unassigned | |||
Bug Description
[Impact]
MAAS commissioning may fail when deploying Xenial images or using grubx64.efi from Xenial due to hardware particularities of some Intel 82599-based network cards. Other network manufacturers may be affected as well. The main failure mode appears to be an infinite re-send of some packets because of an unexpected response from the network hardware.
[Test case]
[Regression potential]
As this affects network in EFI mode; any failure to netboot using EFI should be considered a possible regression. Systems may fail to receive data from the network boot server and terminate the process with a timeout. Another possible failure scenario is to fail to receive complete data over the network, or data corruption.
----
I am using MAAS to commission and install machines. When I attempt to commission a machine with a "82599ES 10-Gigabit SFI/SFP+" network adapter the following happens:
1) TFTP Request — bootx64.efi
2) TFTP Request — /grubx64.efi
3) Console hangs at grub prompt
If I go into bios and force the adapter above into legacy mode then the machine is able to network boot and run through the commission process.
1) TFTP Request — ubuntu/
2) TFTP Request — ubuntu/
3) TFTP Request — ifcpu64.c32
4) PXE Request — power off
5) TFTP Request — pxelinux.
6) TFTP Request — pxelinux.
7) TFTP Request — pxelinux.0
Also, if I disconnect the cable to the adapter above and connect a cable to the integrated "I210 Gigabit" adapter which is configured for UEFI mode. The machine is able to network boot grubx64.efi and run through the commission process.
~$ dpkg -l '*maas*'|cat
Desired=
| Status=
|/ Err?=(none)
||/ Name Version Architecture Description
+++-===
ii maas 1.7.2+bzr3355-
ii maas-cli 1.7.2+bzr3355-
ii maas-cluster-
ii maas-common 1.7.2+bzr3355-
ii maas-dhcp 1.7.2+bzr3355-
ii maas-dns 1.7.2+bzr3355-
ii maas-proxy 1.7.2+bzr3355-
ii maas-region-
ii maas-region-
ii python-django-maas 1.7.2+bzr3355-
ii python-maas-client 1.7.2+bzr3355-
ii python-
~$
| Matt Dirba (5qxm) wrote : | #1 |
| Blake Rouse (blake-rouse) wrote : | #2 |
| Changed in maas: | |
| status: | New → Triaged |
| importance: | Undecided → High |
| milestone: | none → next |
| tags: | added: hwe |
| tags: | added: grubnet uefi |
| Blake Rouse (blake-rouse) wrote : | #3 |
This bug actually seems to be related more to how MAAS handles the TFTP request. Doesn't really seem to be a grub issue now that I look more a the logs. Going to remove grub2 on this bug as the MAAS TFTP server needs to be fixed.
I am seeing two different errors in the log.
http://
http://
| no longer affects: | grub2 (Ubuntu) |
| tags: |
added: ipv6 removed: grubnet |
| Andres Rodriguez (andreserl) wrote : | #4 |
Matt,
Are you using IPv6 with UEFI to PXE boot?
| Matt Dirba (5qxm) wrote : | #5 |
I am using IPv4 with UEFI. I have tested on a server from another manufacturer with the same network card, ThinkServer previously and dell just now in order to attempt to rule out a firmware issue on the motherboard. I get stuck at the grub prompt and do not see new error messages in pserv.log. I believe those where generated when I was playing with pulling down various files using a tftp client to debug this problem. The issue I see is that from the grub prompt I do not appear to have network connectivity. Here is a what I did on the ThinkServer yesterday.
grub> net_bootp
error: couldn't send network packet.
grub> set
?=0
cmdpath=
color_highlight
color_normal=
feature_200_final=y
feature_
feature_
feature_
feature_
feature_
feature_
feature_ntldr=y
feature_
feature_
grub_cpu=x86_64
grub_platform=efi
lang=
locale_dir=
net_default_
net_default_
net_default_
net_default_
net_efinet1_
net_efinet1_
net_efinet1_
net_efinet1_
net_efinet1_
pager=
prefix=
pxe_default_
root=tftp,
secondary_
| Matt Dirba (5qxm) wrote : | #6 |
Here is the pserv.log from todays attempt. Please note I have pulled out repetive log messages and have applied a patch from https:/
| Matt Dirba (5qxm) wrote : | #7 |
I captured a tcpdump for three cases.
1) Len10GEFI - Lenovo with latest firmware and "82599ES 10-Gigabit SFI/SFP+" network adapter booting in EFI mode.
2) Len1GEFI - Lenovo with latest firmware and the integrated "I210 Gigabit" network adapter booting in EFI mode.
3) Dell10GEFI - Dell with "82599ES 10-Gigabit SFI/SFP+" network adapter booting in EFI mode.
For all three cases I the initial UDP requested is aborted and restarted successfully.
1) Len10GEFI
21:07:17.781021 IP 10.112.96.14.1430 > 10.112.96.5.69: 41 RRQ "bootx64.efi" octet tsize 0 blksize 1468
0x0000: 4500 0045 b353 0000 4011 f261 0a70 600e E..E.S..@..a.p`.
0x0010: 0a70 6005 0596 0045 0031 098f 0001 626f .p`....E.1....bo
0x0020: 6f74 7836 342e 6566 6900 6f63 7465 7400 otx64.efi.octet.
0x0030: 7473 697a 6500 3000 626c 6b73 697a 6500 tsize.0.blksize.
0x0040: 3134 3638 00 1468.
21:07:17.793695 IP 10.112.96.5.51921 > 10.112.96.14.1430: UDP, length 29
0x0000: 4500 0039 1b37 4000 4011 4a8a 0a70 6005 E..9.7@.@.J..p`.
0x0010: 0a70 600e cad1 0596 0025 d529 0006 7473 .p`......%.)..ts
0x0020: 697a 6500 3133 3535 3733 3600 626c 6b73 ize.1355736.blks
0x0030: 697a 6500 3134 3030 00 ize.1400.
21:07:17.793749 IP 10.112.96.14.1430 > 10.112.96.5.51921: UDP, length 30
0x0000: 4500 003a b354 0000 4011 f26b 0a70 600e E..:.T..@..k.p`.
0x0010: 0a70 6005 0596 cad1 0026 e222 0005 0008 .p`......&."....
0x0020: 5573 6572 2061 626f 7274 6564 2074 6865 User.aborted.the
0x0030: 2074 7261 6e73 6665 7200 .transfer.
21:07:18.024758 IP 10.112.96.14.1431 > 10.112.96.5.69: 33 RRQ "bootx64.efi" octet blksize 1468
0x0000: 4500 003d b355 0000 4011 f267 0a70 600e E..=.U..@..g.p`.
0x0010: 0a70 6005 0597 0045 0029 7c8c 0001 626f .p`....E.)|...bo
0x0020: 6f74 7836 342e 6566 6900 6f63 7465 7400 otx64.efi.octet.
0x0030: 626c 6b73 697a 6500 3134 3638 00 blksize.1468.
21:07:18.036508 IP 10.112.96.5.40131 > 10.112.96.14.1431: UDP, length 15
0x0000: 4500 002b 1b4a 4000 4011 4a85 0a70 6005 E..+.J@.@.J..p`.
0x0010: 0a70 600e 9cc3 0597 0017 d51b 0006 626c .p`...........bl
0x0020: 6b73 697a 6500 3134 3030 00 ksize.1400.
2)Len1G
21:33:34.175617 IP 10.112.98.48.1958 > 10.112.96.5.69: 41 RRQ "bootx64.efi" octet tsize 0 blksize 1468 ...
| Matt Dirba (5qxm) wrote : | #8 |
I believe the ARP storm is caused by grub. Can you move this to their queue? As I stated previously, I have moved past the tftp errors found in pserv.log and I am currently attempting to build and test on a new version of grub with some of the UEFI fixes that have been added to grub.
| Matt Dirba (5qxm) wrote : | #9 |
I checked out the latest version of grub, built an image using grub-mknetdir, and the ARP storm is from the efinet driver. The
code is stuck in the following loop in net/drivers/
while (1)
{
txbuf = NULL;
st = efi_call_3 (net->get_status, net, 0, &txbuf);
if (st != GRUB_EFI_SUCCESS)
return grub_error (GRUB_ERR_IO,
if (txbuf == dev->txbuf)
{
break;
}
if (txbuf)
{
st = efi_call_7 (net->transmit, net, 0, dev->last_pkt_size,
if (st != GRUB_EFI_SUCCESS)
}
if (limit_time < grub_get_time_ms ())
return grub_error (GRUB_ERR_TIMEOUT,
}
| Matt Dirba (5qxm) wrote : | #10 |
The first get status call (efi_call_
| Matt Dirba (5qxm) wrote : | #11 |
Final conclusion. I did not have the latest firmware for the adapter. After updating the firmware I can UEFI boot the system.
| Changed in maas: | |
| status: | Triaged → Invalid |
| Changed in grub2 (Ubuntu): | |
| status: | New → Invalid |
| Changed in python-tx-tftp: | |
| status: | New → Invalid |
| meilei007 (meilei007) wrote : | #12 |
Hi Matt,
Can you supply the latest firmware version? I meet this bug but not sure whether the server has the latest firmware.
Thanks,
Andy
| Matt Dirba (5qxm) wrote : | #13 |
~/APPS/
Intel(R) Ethernet Flash Firmware Utility
BootUtil version 1.5.54.1
Copyright (C) 2003-2015 Intel Corporation
Flash firmware on port 1
UEFIx64 v4.7.02
> Press ENTER key to continue, 'q' to exit:q
Port Network Address Location Series WOL Flash Firmware Version
==== =============== ======== ======= === =======
1 90E2BA5224E8 3:00.0 10GbE N/A UEFI 4.7.02
2 90E2BA5224E9 3:00.1 10GbE N/A UEFI 4.7.02
| meilei007 (meilei007) wrote : | #14 |
Thanks, Matt, the Lenovo agent has upgrade the UEFI and this issue has gone.
We're seeing this (or something very similar) on an Intel X540 10GbE Dual port Mezzanine adaptor (Intel 82599 Controller) in a Dell server.
I'll follow up with firmware revision numbers when I have them. It'd be helpful if others who've hit this provided them as far as they can.
Does this patch look like it would fix/work around it? http://
I'm having a go at applying it to the Xenial version of grub2, in particular to build grubnetx64.efi We'd be interested in the results if anyone else is able to test this.
| Ante Karamatić (ivoks) wrote : | #16 |
@Russell I'd test it. I'm having this problem with latest packages in Xenial.
| Ante Karamatić (ivoks) wrote : | #17 |
I've replaced grubx64.efi from 16.04 with the one from 17.04 and problem went away. This means that problem is indeed in grub or grub in 16.04 doesn't have needed workarounds for faulty firmwares.
| Changed in grub2 (Ubuntu): | |
| status: | Invalid → Confirmed |
| importance: | Undecided → High |
| Andres Rodriguez (andreserl) wrote : | #18 |
I'm gonna add the maas-images project to this, so that when fixes get SRU'd we can just make sure new images get published.
| Changed in grub2 (Ubuntu): | |
| status: | Confirmed → Triaged |
| status: | Triaged → Confirmed |
| Changed in maas-images: | |
| status: | New → Triaged |
| importance: | Undecided → High |
We should be able to SRU this patch to the relevant releases.
| Changed in grub2 (Ubuntu): | |
| assignee: | nobody → Mathieu Trudel-Lapierre (cyphermox) |
| status: | Confirmed → In Progress |
| status: | In Progress → Fix Released |
| Changed in grub2 (Ubuntu Xenial): | |
| status: | New → In Progress |
| importance: | Undecided → High |
| Changed in grub2 (Ubuntu Trusty): | |
| importance: | Undecided → High |
| Changed in grub2 (Ubuntu Yakkety): | |
| importance: | Undecided → High |
| Changed in grub2 (Ubuntu Trusty): | |
| assignee: | nobody → Mathieu Trudel-Lapierre (cyphermox) |
| Changed in grub2 (Ubuntu Xenial): | |
| assignee: | nobody → Mathieu Trudel-Lapierre (cyphermox) |
| Changed in grub2 (Ubuntu Yakkety): | |
| assignee: | nobody → Mathieu Trudel-Lapierre (cyphermox) |
| Changed in grub2 (Ubuntu): | |
| milestone: | none → ubuntu-17.05 |
| Changed in grub2-signed (Ubuntu Xenial): | |
| assignee: | nobody → Mathieu Trudel-Lapierre (cyphermox) |
| importance: | Undecided → Medium |
| status: | New → In Progress |
| description: | updated |


Seems that grubnetx64.efi is hanging with that interface as it should next request the grub/grub.cfg file, but that never occurs. Feels like its a grubnetx64.efi issue, targeting to that as well.