UEFI network boot hangs at grub for adapter 82599ES 10-Gigabit SFI/SFP+

Bug #1437353 reported by Matt Dirba
58
This bug affects 8 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
High
Unassigned
maas-images
Fix Released
High
Unassigned
python-tx-tftp
Invalid
Undecided
Unassigned
grub2 (Ubuntu)
Fix Released
High
Mathieu Trudel-Lapierre
Trusty
Confirmed
High
Mathieu Trudel-Lapierre
Xenial
Fix Released
High
Mathieu Trudel-Lapierre
Yakkety
Won't Fix
High
Mathieu Trudel-Lapierre
grub2-signed (Ubuntu)
Confirmed
Undecided
Unassigned
Trusty
Confirmed
Undecided
Unassigned
Xenial
Fix Released
Medium
Mathieu Trudel-Lapierre
Yakkety
Won't Fix
Undecided
Unassigned

Bug Description

[Impact]
MAAS commissioning may fail when deploying Xenial images or using grubx64.efi from Xenial due to hardware particularities of some Intel 82599-based network cards. Other network manufacturers may be affected as well. The main failure mode appears to be an infinite re-send of some packets because of an unexpected response from the network hardware.

[Test case]
1) Attempt to netboot on a system with a "82599ES 10-Gigabit SFI/SFP+" network adapter; in UEFI mode.
2) Validate that netbooting happens correctly, passing control over to the kernel as configured in grub.cfg.

3) Validate that netbooting another system, not using an Intel 82599 adapter, behaves normally when booting in UEFI mode.

4) Validate that netbooting another system, not using an Intel 82599 adapter, behaves normally when booting in LEGACY mode.

[Regression potential]
As this affects network in EFI mode; any failure to netboot using EFI should be considered a possible regression. Systems may fail to receive data from the network boot server and terminate the process with a timeout. Another possible failure scenario is to fail to receive complete data over the network, or data corruption.

----

I am using MAAS to commission and install machines. When I attempt to commission a machine with a "82599ES 10-Gigabit SFI/SFP+" network adapter the following happens:
1) TFTP Request — bootx64.efi
2) TFTP Request — /grubx64.efi
3) Console hangs at grub prompt

If I go into bios and force the adapter above into legacy mode then the machine is able to network boot and run through the commission process.
1) TFTP Request — ubuntu/amd64/generic/trusty/release/boot-initrd
2) TFTP Request — ubuntu/amd64/generic/trusty/release/boot-kernel
3) TFTP Request — ifcpu64.c32
4) PXE Request — power off
5) TFTP Request — pxelinux.cfg/01-90-e2-ba-52-23-78
6) TFTP Request — pxelinux.cfg/71e3f102-bd8b-11e4-b634-3c18a001c80a
7) TFTP Request — pxelinux.0

Also, if I disconnect the cable to the adapter above and connect a cable to the integrated "I210 Gigabit" adapter which is configured for UEFI mode. The machine is able to network boot grubx64.efi and run through the commission process.

~$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=====================================-==================================-============-===============================================================================
ii maas 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS server all-in-one metapackage
ii maas-cli 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS command line API tool
ii maas-cluster-controller 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS server cluster controller
ii maas-common 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS server common files
ii maas-dhcp 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS DHCP server
ii maas-dns 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS DNS server
ii maas-proxy 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS Caching Proxy
ii maas-region-controller 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS server complete region controller
ii maas-region-controller-min 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS Server minimum region controller
ii python-django-maas 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS server Django web framework
ii python-maas-client 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS python API client
ii python-maas-provisioningserver 1.7.2+bzr3355-0ubuntu1~trusty1 all MAAS server provisioning libraries
~$

Revision history for this message
Matt Dirba (5qxm) wrote :
Revision history for this message
Blake Rouse (blake-rouse) wrote :

Seems that grubnetx64.efi is hanging with that interface as it should next request the grub/grub.cfg file, but that never occurs. Feels like its a grubnetx64.efi issue, targeting to that as well.

Changed in maas:
status: New → Triaged
importance: Undecided → High
milestone: none → next
tags: added: hwe
tags: added: grubnet uefi
Revision history for this message
Blake Rouse (blake-rouse) wrote :

This bug actually seems to be related more to how MAAS handles the TFTP request. Doesn't really seem to be a grub issue now that I look more a the logs. Going to remove grub2 on this bug as the MAAS TFTP server needs to be fixed.

I am seeing two different errors in the log.

http://paste.ubuntu.com/10689430/
http://paste.ubuntu.com/10689433/

no longer affects: grub2 (Ubuntu)
tags: added: ipv6
removed: grubnet
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Matt,

Are you using IPv6 with UEFI to PXE boot?

Revision history for this message
Matt Dirba (5qxm) wrote :

I am using IPv4 with UEFI. I have tested on a server from another manufacturer with the same network card, ThinkServer previously and dell just now in order to attempt to rule out a firmware issue on the motherboard. I get stuck at the grub prompt and do not see new error messages in pserv.log. I believe those where generated when I was playing with pulling down various files using a tftp client to debug this problem. The issue I see is that from the grub prompt I do not appear to have network connectivity. Here is a what I did on the ThinkServer yesterday.
grub> net_bootp
error: couldn't send network packet.

grub> set
?=0
cmdpath=(tftp,10.112.96.5)
color_highlight=black/light-gray
color_normal=light-gray/black
feature_200_final=y
feature_all_video_module=y
feature_chainloader_bpb=y
feature_default_font_path=y
feature_menuentry_id=y
feature_menuentry_options=y
feature_nativedisk_cmd=y
feature_ntldr=y
feature_platform_search_hint=y
feature_timeout_style=y
grub_cpu=x86_64
grub_platform=efi
lang=
locale_dir=
net_default_interface=efinet1
net_default_ip=10.112.96.14
net_default_mac=90:e2:ba:80:59:0c
net_default_server=10.112.96.5
net_efinet1_boot_file=bootx64.efi
net_efinet1_domain=ast.arm.com
net_efinet1_hostname=1p2620v3-3
net_efinet1_ip=10.112.96.14
net_efinet1_mac=90:e2:ba:80:59:0c
pager=
prefix=(tftp,10.112.96.5)/grub
pxe_default_server=10.112.96.5
root=tftp,10.112.96.5
secondary_locale_dir=

Revision history for this message
Matt Dirba (5qxm) wrote :

Here is the pserv.log from todays attempt. Please note I have pulled out repetive log messages and have applied a patch from https://github.com/shylent/python-tx-tftp/pull/20/files in order to attempt to handle the tsize request.

http://paste.ubuntu.com/10689786/

Revision history for this message
Matt Dirba (5qxm) wrote :
Download full text (9.8 KiB)

I captured a tcpdump for three cases.
1) Len10GEFI - Lenovo with latest firmware and "82599ES 10-Gigabit SFI/SFP+" network adapter booting in EFI mode.
2) Len1GEFI - Lenovo with latest firmware and the integrated "I210 Gigabit" network adapter booting in EFI mode.
3) Dell10GEFI - Dell with "82599ES 10-Gigabit SFI/SFP+" network adapter booting in EFI mode.

For all three cases I the initial UDP requested is aborted and restarted successfully.
1) Len10GEFI
21:07:17.781021 IP 10.112.96.14.1430 > 10.112.96.5.69: 41 RRQ "bootx64.efi" octet tsize 0 blksize 1468
        0x0000: 4500 0045 b353 0000 4011 f261 0a70 600e E..E.S..@..a.p`.
        0x0010: 0a70 6005 0596 0045 0031 098f 0001 626f .p`....E.1....bo
        0x0020: 6f74 7836 342e 6566 6900 6f63 7465 7400 otx64.efi.octet.
        0x0030: 7473 697a 6500 3000 626c 6b73 697a 6500 tsize.0.blksize.
        0x0040: 3134 3638 00 1468.
21:07:17.793695 IP 10.112.96.5.51921 > 10.112.96.14.1430: UDP, length 29
        0x0000: 4500 0039 1b37 4000 4011 4a8a 0a70 6005 E..9.7@.@.J..p`.
        0x0010: 0a70 600e cad1 0596 0025 d529 0006 7473 .p`......%.)..ts
        0x0020: 697a 6500 3133 3535 3733 3600 626c 6b73 ize.1355736.blks
        0x0030: 697a 6500 3134 3030 00 ize.1400.
21:07:17.793749 IP 10.112.96.14.1430 > 10.112.96.5.51921: UDP, length 30
        0x0000: 4500 003a b354 0000 4011 f26b 0a70 600e E..:.T..@..k.p`.
        0x0010: 0a70 6005 0596 cad1 0026 e222 0005 0008 .p`......&."....
        0x0020: 5573 6572 2061 626f 7274 6564 2074 6865 User.aborted.the
        0x0030: 2074 7261 6e73 6665 7200 .transfer.
21:07:18.024758 IP 10.112.96.14.1431 > 10.112.96.5.69: 33 RRQ "bootx64.efi" octet blksize 1468
        0x0000: 4500 003d b355 0000 4011 f267 0a70 600e E..=.U..@..g.p`.
        0x0010: 0a70 6005 0597 0045 0029 7c8c 0001 626f .p`....E.)|...bo
        0x0020: 6f74 7836 342e 6566 6900 6f63 7465 7400 otx64.efi.octet.
        0x0030: 626c 6b73 697a 6500 3134 3638 00 blksize.1468.
21:07:18.036508 IP 10.112.96.5.40131 > 10.112.96.14.1431: UDP, length 15
        0x0000: 4500 002b 1b4a 4000 4011 4a85 0a70 6005 E..+.J@.@.J..p`.
        0x0010: 0a70 600e 9cc3 0597 0017 d51b 0006 626c .p`...........bl
        0x0020: 6b73 697a 6500 3134 3030 00 ksize.1400.

2)Len1G
21:33:34.175617 IP 10.112.98.48.1958 > 10.112.96.5.69: 41 RRQ "bootx64.efi" octet tsize 0 blksize 1468 ...

Revision history for this message
Matt Dirba (5qxm) wrote :

I believe the ARP storm is caused by grub. Can you move this to their queue? As I stated previously, I have moved past the tftp errors found in pserv.log and I am currently attempting to build and test on a new version of grub with some of the UEFI fixes that have been added to grub.

Revision history for this message
Matt Dirba (5qxm) wrote :

I checked out the latest version of grub, built an image using grub-mknetdir, and the ARP storm is from the efinet driver. The
code is stuck in the following loop in net/drivers/efi/efinet.c lines 43 through 66. efi_call_3 always returns a txbuf that does not match dev->txbuf, and eif_call_7 is repeatedly called until the limit_time has been exceeded. Any thoughts from the grub team on what could be the problem or what I should do to continue to debug?

   while (1)
      {
        txbuf = NULL;
        st = efi_call_3 (net->get_status, net, 0, &txbuf);
        if (st != GRUB_EFI_SUCCESS)
          return grub_error (GRUB_ERR_IO,
                             N_("couldn't send network packet"));
        if (txbuf == dev->txbuf)
          {
            dev->txbusy = 0;
            break;
          }
        if (txbuf)
          {
            st = efi_call_7 (net->transmit, net, 0, dev->last_pkt_size,
                             dev->txbuf, NULL, NULL, NULL);
            if (st != GRUB_EFI_SUCCESS)
              return grub_error (GRUB_ERR_IO,
                                 N_("couldn't send network packet"));
          }
        if (limit_time < grub_get_time_ms ())
          return grub_error (GRUB_ERR_TIMEOUT,
                             N_("couldn't send network packet"));
      }

Revision history for this message
Matt Dirba (5qxm) wrote :

The first get status call (efi_call_3(net->get_status, ...)) returns GRUB_EFI_SUCCESS and sets txbuf to 0. The second time this function is called it enters the while loop and the get status call (efi_call_3(net->get_status, ...)) returns GRUB_EFI_SUCCESS and sets txbuf to 1 which does not match the address of dev->txbuf.

Revision history for this message
Matt Dirba (5qxm) wrote :

Final conclusion. I did not have the latest firmware for the adapter. After updating the firmware I can UEFI boot the system.

Changed in maas:
status: Triaged → Invalid
Changed in grub2 (Ubuntu):
status: New → Invalid
Changed in python-tx-tftp:
status: New → Invalid
Revision history for this message
meilei007 (meilei007) wrote :

Hi Matt,
    Can you supply the latest firmware version? I meet this bug but not sure whether the server has the latest firmware.

Thanks,
Andy

Revision history for this message
Matt Dirba (5qxm) wrote :

https://downloadcenter.intel.com/downloads/eula/19186/Intel-Ethernet-Connections-Boot-Utility-Preboot-images-and-EFI-Drivers?httpDown=http%3A%2F%2Fdownloadmirror.intel.com%2F19186%2Feng%2FPreboot.tar.gz

~/APPS/BootUtil/Linux_x64$ sudo ./bootutil64e -IV -ALL

Intel(R) Ethernet Flash Firmware Utility
BootUtil version 1.5.54.1
Copyright (C) 2003-2015 Intel Corporation

Flash firmware on port 1
UEFIx64 v4.7.02

> Press ENTER key to continue, 'q' to exit:q

Port Network Address Location Series WOL Flash Firmware Version
==== =============== ======== ======= === ============================= =======
  1 90E2BA5224E8 3:00.0 10GbE N/A UEFI 4.7.02
  2 90E2BA5224E9 3:00.1 10GbE N/A UEFI 4.7.02

Revision history for this message
meilei007 (meilei007) wrote :

Thanks, Matt, the Lenovo agent has upgrade the UEFI and this issue has gone.

Revision history for this message
Russell Jones (russell-jones-oxphys) wrote :

We're seeing this (or something very similar) on an Intel X540 10GbE Dual port Mezzanine adaptor (Intel 82599 Controller) in a Dell server.

I'll follow up with firmware revision numbers when I have them. It'd be helpful if others who've hit this provided them as far as they can.

Does this patch look like it would fix/work around it? http://git.savannah.gnu.org/cgit/grub.git/commit/?id=4fe8e6d4a1279b1840171d8e797d911cd8443333

I'm having a go at applying it to the Xenial version of grub2, in particular to build grubnetx64.efi We'd be interested in the results if anyone else is able to test this.

Revision history for this message
Ante Karamatić (ivoks) wrote :

@Russell I'd test it. I'm having this problem with latest packages in Xenial.

Revision history for this message
Ante Karamatić (ivoks) wrote :

I've replaced grubx64.efi from 16.04 with the one from 17.04 and problem went away. This means that problem is indeed in grub or grub in 16.04 doesn't have needed workarounds for faulty firmwares.

Changed in grub2 (Ubuntu):
status: Invalid → Confirmed
importance: Undecided → High
Revision history for this message
Andres Rodriguez (andreserl) wrote :

I'm gonna add the maas-images project to this, so that when fixes get SRU'd we can just make sure new images get published.

Changed in grub2 (Ubuntu):
status: Confirmed → Triaged
status: Triaged → Confirmed
Changed in maas-images:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

We should be able to SRU this patch to the relevant releases.

Changed in grub2 (Ubuntu):
assignee: nobody → Mathieu Trudel-Lapierre (cyphermox)
status: Confirmed → In Progress
status: In Progress → Fix Released
Changed in grub2 (Ubuntu Xenial):
status: New → In Progress
importance: Undecided → High
Changed in grub2 (Ubuntu Trusty):
importance: Undecided → High
Changed in grub2 (Ubuntu Yakkety):
importance: Undecided → High
Changed in grub2 (Ubuntu Trusty):
assignee: nobody → Mathieu Trudel-Lapierre (cyphermox)
Changed in grub2 (Ubuntu Xenial):
assignee: nobody → Mathieu Trudel-Lapierre (cyphermox)
Changed in grub2 (Ubuntu Yakkety):
assignee: nobody → Mathieu Trudel-Lapierre (cyphermox)
Changed in grub2 (Ubuntu):
milestone: none → ubuntu-17.05
Changed in grub2-signed (Ubuntu Xenial):
assignee: nobody → Mathieu Trudel-Lapierre (cyphermox)
importance: Undecided → Medium
status: New → In Progress
description: updated
Nobuto Murata (nobuto)
tags: added: cpe-onsite
description: updated
Steve Langasek (vorlon)
Changed in grub2 (Ubuntu Yakkety):
status: New → Won't Fix
Changed in grub2-signed (Ubuntu Yakkety):
status: New → Won't Fix
description: updated
Revision history for this message
Steve Langasek (vorlon) wrote : Please test proposed package

Hello Matt, or anyone else affected,

Accepted grub2 into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/grub2/2.02~beta2-36ubuntu3.18 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in grub2 (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-xenial
Revision history for this message
Steve Langasek (vorlon) wrote :

Hello Matt, or anyone else affected,

Accepted grub2-signed into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/grub2-signed/1.66.18 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in grub2-signed (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: id-5ab2aac1fcfcb094be6eb2e1
Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

@Matt,
@Russell,

Is the package is xenial-proposed something you could help verifying? I do not have access to hardware that would exhibit this issue; it would help greatly in providing the fix to have someone with this hardware follow the steps in https://wiki.ubuntu.com/QATeam/PerformingSRUVerification (or comment #21 here) to validate that the fix works and doesn't adversely affect systems.

Revision history for this message
Matt Dirba (5qxm) wrote :

I have already flashed all of my Intel Nics with the latest firmware. I am unable to verify this.

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote : [grub2/xenial] possible regression found

As a part of the Stable Release Updates quality process a search for Launchpad bug reports using the version of grub2 from xenial-proposed was performed and bug 1759877 was found. Please investigate this bug report to ensure that a regression will not be created by this SRU. In the event that this is not a regression remove the "verification-failed-xenial" tag from this bug report and add the tag "bot-stop-nagging" to bug 1759877 (not this bug). Thanks!

tags: added: verification-failed-xenial
Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

The bug identified was not a regression from the SRU. We still need testing on this stable update.

tags: removed: verification-failed-xenial
Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

Anybody able to help verifying this?

Revision history for this message
Andres Rodriguez (andreserl) wrote :

The MAAS CI has built new images using -proposed, which includes the bootloaders (and hence, grub from -proposed):

{
 "content_id": "com.ubuntu.maas:daily:1:bootloader-download",
 "datatype": "image-downloads",
 "format": "products:1.0",
 "products": {
  "com.ubuntu.maas.daily:1:grub-efi-signed:uefi:amd64": {
   "arch": "amd64",
   "arches": "amd64",
   "bootloader-type": "uefi",
   "label": "daily",
   "os": "grub-efi-signed",
   "versions": {
    "20180424.0": {
     "items": {
      "grub2-signed": {
       "ftype": "archive.tar.xz",
       "path": "bootloaders/uefi/amd64/20180424.0/grub2-signed.tar.xz",
       "sha256": "c36c148eba15eda8af4e300af1d382a13595acac41db521d86f40c6ae789b57b",
       "size": 284308,
       "src_package": "grub2-signed",
       "src_release": "xenial",
       "src_version": "1.66.18+2.02~beta2-36ubuntu3.18"
      },
      "shim-signed": {
       "ftype": "archive.tar.xz",
       "path": "bootloaders/uefi/amd64/20180424.0/shim-signed.tar.xz",
       "sha256": "22e8518eaa8e5a55ec188976e8b6e01da797df65a6143bcb741a7cf432d30c28",
       "size": 308604,
       "src_package": "shim-signed",
       "src_release": "xenial",
       "src_version": "1.33.1~16.04.1+13-0ubuntu2"
      }
     }
    }

Our CI completed successfully with this.

tags: added: verification-done-xenial
removed: verification-needed-xenial
tags: added: verification-done
removed: verification-needed
Revision history for this message
KingJ (kj-kingj) wrote :

I'm seeing this with a HP NC523SFP 10G NIC and MASS 2.4.0~beta2 on Ubuntu 18.04. If I UEFI boot from the NIC, it eventually brings up a grub shell. Taking a packet capture confirmed the same ARM storm behaviour when attempting to boot - thousands of packets from the server that is being commissioned requesting the MAC of the MAAS server. Running net_bootp from the grub prompt results in the "error: couldn't send network packet." error, and running net_nslookup for a domain results in an endless ARP storm!

Unfortunately, the last firmware release for this NIC was a few years ago and i'm already running the latest version. Reading the comments here, it sounds like my only hope is a fix within grub2 - however that should already be live. The version of grub booted is grub2 2.02~beta2-36ubuntu3.18, which according to the comments here should fix it.

I've also tried using the 2.02-2ubuntu8 version of bootx64.efi, by grabbing it from http://archive.ubuntu.com/ubuntu/dists/bionic/main/uefi/grub2-amd64/2.02-2ubuntu8/grubnetx64.efi.signed and replacing /var/lib/maas/boot-resources/current/bootloader/uefi/amd64/grubx64.efi but with the same result.

Revision history for this message
Russell Jones (russell-jones-oxphys) wrote :

@kj-kingj

I applied the patch myself

$ cd /home/$USER/software
$ mkdir grub2 && cd grub2
$ apt-get source grub-efi-amd64
(add patch to grub2-2.02~beta2/debian/patches)
# apt-get install dh-autoreconf help2man gcc-4.7-multilib xfonts-unifont libusb-dev libdevmapper-dev libsdl1.2-dev xorriso qemu-system libfuse-dev libxen-dev
$ cd /home/$USER/software/grub2/grub2-2.02~beta2
$ fakeroot dpkg-buildpackage

That was on a trusty install, IIRC. The patch was already in Xenial. I was building grubnetx64.efi, i.e. for UEFI PXE booting.

Revision history for this message
Chris Gregan (cgregan) wrote :

Field High SLA now requires that a estimated date for a fix is listed in the comments. Please provide this estimate for the open tasks.

Revision history for this message
Łukasz Zemczak (sil2100) wrote :

From the SRU team's perspective, seeing the comments here, it's really hard to see if the fix has been verified or not. The verification comment #27 does not make it clear if the test was performed on affected hardware or not. Was it? If yes, please clarify. Also, the test case in the description lists 4 different test cases that should be performed on the system - it's not clear if those have been done or not. 'Our CI' is not necessarily what is needed to verify the fix.

Also, comment #28 seems to indicate that the fix does NOT work. Can anyone confirm?

We cannot release a package update from -proposed into -updates without a solid indication that the bug is fixed and regression-free.

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

Well, the CI part confirms that there is no regression, but there is as yet no indication that the issue is fixed aside from the cases where firmware was updated (but then, it's not the SRU).

There's still a need to verify the fix positively on affected hardware.

Now, KingJ's comment says that testing has been done with the version of grub in bionic, and with the version of grub being SRUed to xenial, and neither did work. I think that qualifies as verification-failed, and we'll need to have another look at grub's behavior. At this point, there is no obvious patch missing, and we'll need to debug using packet captures to try and make sense of it.

tags: added: verification-failed-xenial
removed: verification-done verification-done-xenial
Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote : Re: [Bug 1437353] Re: UEFI network boot hangs at grub for adapter 82599ES 10-Gigabit SFI/SFP+

I've been doing more testing on this, after finding a system with a
10GE NIC that seems affected. With 2.02~beta2-36ubuntu3.17 it's
unhappy, but with 2.02~beta2-36ubuntu3.18 it looks like things are
working just fine.

For now, I'm putting this back to verification-needed until I can
finish the testing to be sure.

tags: added: verification-needed-xenial
removed: verification-failed-xenial
Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

I have positively verified that an affected system (which has a 82599ES 10-Gigabit SFI/SFP+) exhibits the ARP storm behavior when booting with MAAS using the grubnetx64.efi binary in xenial(-updates), leading to stopping in the grub prompt; and with the grubnetx64.efi binary in xenial-proposed no ARP storm is noticeable, and the system boots normally as expected.

The bad binaries are any 2.02~beta2-36ubuntu3.17 and prior; the binary installed by MAAS or I would most commonly expect to see on an affected setup comes from xenial-updates (grub2 2.02~beta2-36ubuntu3.17) has a sha256sum of:

b164561b4f42223b6d37e00f613adc32c22e5377c0fb6a6615e101c625d9b9cb

And the valid binary from xenial-proposed (until this SRU is released to xenial-updates), comes from grub2 2.02~beta2-36ubuntu3.18 and has a sha256sum of:

a92ed9943c6569a999b9b437e7ca07ccac7d30a4df1a18cdd68b406e5d45013c

If running 'sha256sum /var/lib/maas/boot-resources/current/bootloader/uefi/amd64/grubx64.efi' (or using the path appropriate to non-MAAS netboot setups) yields the same value as above (a92ed9943c6569a999b9b437e7ca07ccac7d30a4df1a18cdd68b406e5d45013c), then you are running the patched version of grub2. If you are still noticing issues, then you would be seeing a different bug, one that should be reported separately.

Marking this verification-done.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package grub2 - 2.02~beta2-36ubuntu3.18

---------------
grub2 (2.02~beta2-36ubuntu3.18) xenial; urgency=medium

  * debian/patches/efinet_check_imm_completion.patch: check for immediate
    completion when sending data to the net device buffer. This is a required
    commit for the patch below.
  * debian/patches/efinet_handle_buggy_get_status.patch: correctly handle the
    output of get_status() for EFI net devices on buggy firmware.
    (LP: #1437353)

 -- Mathieu Trudel-Lapierre <email address hidden> Mon, 19 Mar 2018 16:11:06 -0400

Changed in grub2 (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for grub2 has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package grub2-signed - 1.66.18

---------------
grub2-signed (1.66.18) xenial; urgency=medium

  * Rebuild against grub2 2.02~beta2-36ubuntu3.18. (LP: #1437353)

 -- Mathieu Trudel-Lapierre <email address hidden> Tue, 20 Mar 2018 10:27:27 -0400

Changed in grub2-signed (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

I consider this as tested as well. It's less of a problem when the fix simply does not work for some, as in the worst case we can just re-open the bug (or get another one filled) if not all use-cases are handled. As long as the fix works for certain users and does not introduce regressions, for higher-priority bugfixes that seems to be enough.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in grub2 (Ubuntu Trusty):
status: New → Confirmed
Changed in grub2-signed (Ubuntu Trusty):
status: New → Confirmed
Changed in grub2-signed (Ubuntu):
status: New → Confirmed
Revision history for this message
Jeff Lane  (bladernr) wrote :

So this also affects Bionic... will this fix land in 18.04.1 (or sooner via SRU?)

Changed in maas-images:
status: Triaged → Fix Released
Revision history for this message
acd (alecd-smc) wrote :

Hi,
I was also affected by this issue - to PXE boot at UEFI for add-on adapter i350 1Gb interface. Upgrading the MAAS server version to 18.04 bionic release did not solve the symptom. But I was able to find a workaround by booting first the IPv6 before IPv4 - the same workaround mentioned by Rod Smith about his comment #21 on reported Bug #1437024. The detail of this was filed under Bug #1787637. Thanks

Additional info:
i350 FW released for AOC is v1.63
MAAS server grub version = 2.02-2ubuntu8.2
Node deployed successfully due to workaround = 2.02-beta2-36ubuntu3.18

Revision history for this message
Christian Sarrasin (sxc731) wrote :

+1 with i350 NICs (many of them so this is consistently reproducible).

We worked it around as follows (on the MAAS server):

cd /tmp
wget http://archive.ubuntu.com/ubuntu/dists/xenial-updates/main/uefi/grub2-amd64/2.02~beta2-36ubuntu3.18/grubnetx64.efi.signed
cp -p grubnetx64.efi.signed /var/lib/maas/boot-resources/current/bootloader/uefi/amd64/grubx64.efi

It may or may not be necessary to reboot your MAAS server after doing this.

`strings` tells me that the broken version, which was in /var/lib/maas/boot-resources/current/bootloader/uefi/amd64/ before we clobbered it was version 2.02-2ubuntu8.2, dated Aug 7.

Revision history for this message
Benjamin Winston (dolemite) wrote :

My organization has been struggling with this issue for ~6 months, and the fix Christian describes in comment #44 resolved this problem for us. I used the bionic-updates instead of xenial-updates, and enterprise wide, the grub rescue console does NOT display anymore when loading the PXE menu. Monday morning, I will be the HERO at the office! All the thanks to this thread and the dedicated team who resolved this bug.

Changed in maas:
milestone: next → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.