pxe boot on arm64 stopped working

Bug #2044549 reported by Marian Gasparovic
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
grub2 (Ubuntu)
Incomplete
High
Mate Kukri

Bug Description

We have a lab with eight arm64 machines which are deployed from MAAS (also on arm64).
Two weeks ago we noticed the commissioning in MAAS is not working.
We can see MAAS sends two files (bootaa64.efi and grubaa64.efi) as requested and then machine goes to grub prompt.
@alexsander-souza from MAAS team had a look and he suspects grub issue. net_bootp command returns "error: couldn't send network packet".
Sometimes when machine is power cycled it can deploy again, but not always.
Any idea how to debug this?

Tags: cdo-qa
Revision history for this message
Marian Gasparovic (marosg) wrote :

It reports grub version 2.06

Revision history for this message
Alexsander de Souza (alexsander-souza) wrote :

additional info:

- the machine has a Intel I210-T1 NIC
- when the boot process doesn't work, the UEFI firmware requests the 2 grub files (bootaa64.efi and grubaa64.efi), and I can confirm MAAS serves them correctly and timely. In the machine console we can see that Grub starts and after a pause it drops into the prompt. No request for configuration files reached the MAAS controller.
- when the process works, I can see requests for multiple configuration files just after MAAS sends the '.efi' files to the machine

Revision history for this message
Marian Gasparovic (marosg) wrote :

Last working state was on Sep 20th. Failure was noticed on Nov 8th.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

can you provide version details of the boot artifacts used? which stream? which version?

Revision history for this message
Marian Gasparovic (marosg) wrote :

It is stable stream, this is what Alexsander retrieved from DB

grub2-signed.tar.xz - {"src_package": "grub2-signed", "src_release": "jammy", "src_version": "1.187.6+2.06-2ubuntu14.4"}

shim-signed.tar.xz - {"src_package": "shim-signed", "src_release": "jammy", "src_version": "1.51.3+15.7-0ubuntu1"}

Revision history for this message
Julian Andres Klode (juliank) wrote :

So the last upload only touched NTFS and device tree fix up protocol, it shouldn't cause a regression here.

The one before that was prepared in February and released on Sep 29, and includes additional memory management changes:

https://bugs.launchpad.net/ubuntu/+source/grub2-unsigned/+bug/2004643

We initially heard from other vendor that they noticed some platforms failing to boot, but there have been neither updates from them nor any complaints in the over 3 months in proposed either, so I don't know.

In any case, the next steps for you would be to see if ubuntu14.2 works or not.

If it's broken, iteratively unapply the last three patches until it works to find the culprit.

If ubuntu14.2 is not broken, this is not a grub regression but a change in your lab that is outside of our control.

If you have further questions do not hesitate to reach out in the ~uefi channel on Mattermost.

Revision history for this message
Mate Kukri (mkukri) wrote (last edit ):

This might be a duplicate of 2043084

UPDATE: it looks like it is not a duplicate of the above

Mate Kukri (mkukri)
Changed in grub2 (Ubuntu):
assignee: nobody → Mate Kukri (mkukri)
Mate Kukri (mkukri)
tags: added: foundations-todo
Changed in grub2 (Ubuntu):
importance: Undecided → High
Mate Kukri (mkukri)
Changed in grub2 (Ubuntu):
status: New → Incomplete
Revision history for this message
Mate Kukri (mkukri) wrote :

Few bits of information about this:
- It was found that the cause of the regression is GRUB binary size growing instead of any code or compiler changes. Padding the last pre-regression binary to the same size as the first regressed one reproduces the same failures. (Padding to much larger binary size seems to avoid this.)
- After considerable effort debugging this, it was found that the failure happens because the firmware interface used by GRUB's efinet driver locks up and stops transmitting packets. (This is manifested by transmit buffers never being "recycled", and adding new buffers to the queue eventually fill it up and
lock the driver to permanently return EFI_NOT_READY until platform reset.)
- This is either a UNDI driver / firmware bug on the target machines, or an issue in GRUB's usage of the EFI simple networking protocol that always existed (I consider this rather unlikely).

Unfortunately, there is no realistic code change in stable release GRUBs that could fix this (unless a rather unlikely existing GRUB bug is identified).

Proposed workarounds were:
- Padding GRUB binary sizes to a larger size which seems to experimentally avoid locking up the network card (tho this wasn't proven).
- Using the UEFI provided TFTP stack for TFTP netbooting in future GRUB releases.

Mate Kukri (mkukri)
tags: removed: foundations-todo
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.