Timeout downloading initrd

Bug #1900773 reported by dann frazier on 2020-10-20
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Release Notes for Ubuntu
Undecided
Unassigned
grub2 (Ubuntu)
Status tracked in Hirsute
Xenial
Undecided
dann frazier
Bionic
Undecided
dann frazier
Focal
Undecided
dann frazier
Groovy
Undecided
dann frazier
Hirsute
Undecided
dann frazier

Bug Description

[Impact]
GRUB times out when downloading downloading large files w/ tftp. This notably breaks subiquity based PXE installs which feature a large initrd. (Observed on several arm64 platforms, though the symptom is not arch-specific).

[Test Case]
Simple test case using an x86 UEFI VM:
Place a kernel/ramdisk on a tftp server. Inflate the initrd or kernel to 87M, e.g.:

dd if=/dev/zero of=initrd.img bs=1M count=87
dd if=initrd.img.orig of=initrd.img conv=notrunc

Success looks like:
Shell> fs0:
FS0:\> \efi\grubnetx64.efi
grub> net_dhcp efinet0
grub> linux (tftp,192.168.122.1)/vmlinuz.orig console=ttyS0,115200n8
grub> initrd (tftp,192.168.122.1)/initrd.img
grub>

Failure looks like:

grub> net_dhcp efinet0
grub> linux (tftp,192.168.122.1)/vmlinuz.orig console=ttyS0,115200n8
grub> initrd (tftp,192.168.122.1)/initrd.img
!!!! X64 Exception Type - 06(#UD - Invalid Opcode) CPU Apic ID - 00000000 !!!!
RIP - 0000000000099080, CS - 0000000000000038, RFLAGS - 0000000000010286
RAX - 000000007DC2FF00, RCX - 000000004FF99013, RDX - 000000007BF4CCF4
RBX - 000000007BE43FC0, RSP - 000000007FF25AE8, RBP - 000000007BE3C2A0
RSI - 000000000000000B, RDI - 000000007BE3C340
R8 - 000000007DC21168, R9 - 000000007DC1D4AE, R10 - 0000000000000067
R11 - 0000000000000002, R12 - 000000007BE3CCA0, R13 - 000000007BE3C260
R14 - 0000000000020004, R15 - 000000007DC1A613
DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030
GS - 0000000000000030, SS - 0000000000000030
CR0 - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FC01000
CR4 - 0000000000000668, CR8 - 0000000000000000
DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
GDTR - 000000007F9EE698 0000000000000047, LDTR - 0000000000000000
IDTR - 000000007F4B2018 0000000000000FFF, TR - 0000000000000000
FXSAVE_STATE - 000000007FF25740
!!!! Can't find image information. !!!!

This was originally discovered on a Cavium ThunderX CRB system using subiquity from the groovy arm64 ISO. Failure there looks like:

                         GNU GRUB version 2.04

 ����������������������������������������������������������������������������Ŀ
 �*Ubuntu Server �
 � �
 � �
 � �
 � �
 � �
 � �
 � �
 � �
 � �
 � �
 � �
 � �
 ������������������������������������������������������������������������������

      Use the and keys to select which entry is highlighted.
      Press enter to boot the selected OS, `e' to edit the commands
      before booting or `c' for a command-line.

error: timeout reading `initrd'.

Press any key to continue...

[Fix]
https://git.savannah.gnu.org/cgit/grub.git/commit/?id=a6838bbc6726ad624bd2b94991f690b8e9d23c69

[Where problems could occur]
The fix is to the tftp command, so problems would like appear in the tftp stack, possibly due to inconsistencies between tftp server implementations.

Ubuntu QA Website (ubuntuqa) wrote :

This bug has been reported on the Ubuntu ISO testing tracker.

A list of all reports related to this bug can be found here:
http://iso.qa.ubuntu.com/qatracker/reports/bugs/1900773

tags: added: iso-testing
Steve Langasek (vorlon) wrote :

Just to confirm, are you using the initrd as extracted from the .iso?

affects: ubuntu-cdimage → livecd-rootfs (Ubuntu)
Steve Langasek (vorlon) wrote :

And is there something intrinsic to this hardware that leads to the timeout? Or would this perhaps work if the machine had a faster network link to the tftp server?

On Tue, Oct 20, 2020 at 5:30 PM Steve Langasek
<email address hidden> wrote:
>
> Just to confirm, are you using the initrd as extracted from the .iso?

I am.

dann frazier (dannf) wrote :

On Tue, Oct 20, 2020 at 5:30 PM Steve Langasek
<email address hidden> wrote:
>
> And is there something intrinsic to this hardware that leads to the
> timeout?

I wonder if there might be a firmware bug. Here's what I see on the wire:

52120 56.958016 10.229.50.135 10.229.50.84 TFTP 78 Read Request, File:
initrd, Transfer type: octet, blksize=1024, tsize=0
52121 56.960144 10.229.50.84 10.229.50.135 TFTP 72 Option
Acknowledgement, blksize=1024, tsize=90771321
<---snip--->
183218 65.602857 10.229.50.84 10.229.50.135 TFTP 1070 Data Packet, Block: 65535
183219 65.603129 10.229.50.135 10.229.50.84 TFTP 60 Acknowledgement,
Block: 65535
<---snip--->
229458 68.561417 10.229.50.135 10.229.50.84 TFTP 60 Acknowledgement,
Block: 88643
229459 68.561466 10.229.50.84 10.229.50.135 TFTP 935 Data Packet,
Block: 88644 (last)
229460 68.561519 10.229.50.135 10.229.50.84 TFTP 60 Acknowledgement,
Block: 88644
229462 68.962413 10.229.50.135 10.229.50.84 TFTP 60 Acknowledgement,
Block: 65535

It looks like the entire initrd was successfully transferred. The
client ACK'd the last block but then, for some reason, it comes back
about half a second later and re-ACKs a block it had already ACK'd.
And that block being number number 65535 is *interesting*.

> Or would this perhaps work if the machine had a faster network
> link to the tftp server?

Perhaps - but these systems are in the same physical location, slowest
link between is 1Gbps.

I hit this issue on Hisilicon d06 when PXE groovy subiquity. System boots without initrd and hang on no rootfs.

error: timeout reading `/casper/initrd'.

Press any key to continue...

On Wed, Oct 21, 2020, 03:20 Ike Panhc <email address hidden> wrote:

> I hit this issue on Hisilicon d06 when PXE groovy subiquity. System
> boots without initrd and hang on no rootfs.
>
>
> error: timeout reading `/casper/initrd'.
>
> Press any key to continue...
>

Interesting, does recompressing with lzma suffice as a workaround for d06
also?

>

90771321 initrd
44620048 initrd.lzma

Yes. recompress with `lzma -9` and I can finish the installation.

dann frazier (dannf) wrote :

Since technically GRUB is requesting the initrd file, I tried to rule it out as a possible cause by seeing if the initial payload that UEFI downloads directly would time out if it was the same size. I did this by padding my grubnetaa64.efi binary to be the same size as the installer initrd.

 mv grubnetaa64.efi grubnetaa64.efi.orig
 cp casper/initrd grubnetaa64.efi
 dd if=grubnetaa64.efi.orig of=grubnetaa64 conv=notrunc

This did *not* timeout. So unfortunately we can't obviously rule out GRUB as a factor.

Ike Panhc (ikepanhc) wrote :

More infomation. I can not reproduce with grubnetaa64.efi from focal.

This is the focal grub I download, which can not reproduce

http://ports.ubuntu.com/ubuntu-ports/dists/focal/main/uefi/grub2-arm64/current/grubnetaa64.efi.signed

This is the groovy grub, which is able to reproduce

http://ports.ubuntu.com/ubuntu-ports/dists/groovy/main/uefi/grub2-arm64/current/grubnetaa64.efi.signed

Taihsiang Ho (taihsiangho) wrote :

My apologies. The comment#9 https://bugs.launchpad.net/ubuntu/+source/livecd-rootfs/+bug/1900773/comments/9 used the grubnetaa64.efi from focal http://ports.ubuntu.com/ubuntu-ports/dists/focal/main/uefi/grub2-arm64/current/grubnetaa64.efi.signed

If d05 (d05-4) uses the groovy grub, it COULD reproduce this issue as well.

So I would say the testing result of d05 is the same as d06 by @Ike.
    - grub from focal ---> not reproduce this issue
    - grub from groovy --> able to reproduce this issue

My next will be:
    - hide comment#9 to not confuse people
    - also try the initrd.lzma workaround

Taihsiang Ho (taihsiangho) wrote :

Repacking initrd as initrd.lz is a working workaround on d05. By using initrd.lz, I could not reproduce this issue on d05-4.

dann frazier (dannf) on 2020-10-22
summary: - thunderx CRB systems tftp timeout downloading initrd
+ ARM servers timeout downloading initrd

Groovy rc 20201022 daily image could reproduce on d05 (d05-1) and d06 (kreiken). Besides, the lz workaround still works on both of the platforms.

dann frazier (dannf) wrote :

Turns out this is a GRUB issue. It was fixed upstream in the following commit, which cleanly cherry-picks to groovy. I prepared a test build in ppa:dannf/test and confirmed it resolves the issue.

commit a6838bbc6726ad624bd2b94991f690b8e9d23c69
Author: Javier Martinez Canillas <email address hidden>
Date: Thu Sep 10 17:17:57 2020 +0200

    tftp: Roll-over block counter to prevent data packets timeouts

    Commit 781b3e5efc3 (tftp: Do not use priority queue) caused a regression
    when fetching files over TFTP whose size is bigger than 65535 * block size.

      grub> linux /images/pxeboot/vmlinuz
      grub> echo $?
      0
      grub> initrd /images/pxeboot/initrd.img
      error: timeout reading '/images/pxeboot/initrd.img'.
      grub> echo $?
      28

Changed in livecd-rootfs (Ubuntu):
status: New → Invalid
Changed in grub2 (Ubuntu):
status: New → Confirmed
dann frazier (dannf) wrote :

I am able to reproduce w/ focal as well. The above patch is tagged as "Fixes: 781b3e5efc3 (tftp: Do not use priority queue)", which is in focal (d/p/0090-tftp-Do-not-use-priority-queue.patch). However, it wasn't added until 2.04-1ubuntu26.1. My guess is that it was not reproducible in Comment #12 because that test may have used the original GA image from 2.04-1ubuntu26[1] instead of the latest[2]. @Tai can you confirm?

[1] http://ports.ubuntu.com/ubuntu-ports/dists/focal/main/uefi/grub2-arm64/current/grubnetaa64.efi
[2] http://ports.ubuntu.com/ubuntu-ports/dists/focal-updates/main/uefi/grub2-arm64/current/grubnetaa64.efi

Changed in grub2 (Ubuntu Focal):
status: New → Confirmed
Changed in livecd-rootfs (Ubuntu Focal):
status: New → Invalid
dann frazier (dannf) on 2020-11-12
summary: - ARM servers timeout downloading initrd
+ Timeout downloading initrd
description: updated
Changed in grub2 (Ubuntu Hirsute):
assignee: nobody → dann frazier (dannf)
status: Confirmed → In Progress
dann frazier (dannf) on 2020-11-13
description: updated

Hello dann, or anyone else affected,

Accepted grub2 into groovy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/grub2/2.04-1ubuntu35.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-groovy to verification-done-groovy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-groovy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in grub2 (Ubuntu Groovy):
status: Confirmed → Fix Committed
tags: added: verification-needed verification-needed-groovy
Changed in grub2 (Ubuntu Focal):
status: Confirmed → Fix Committed
tags: added: verification-needed-focal
Łukasz Zemczak (sil2100) wrote :

Hello dann, or anyone else affected,

Accepted grub2 into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/grub2/2.04-1ubuntu26.7 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package grub2 - 2.04-1ubuntu36

---------------
grub2 (2.04-1ubuntu36) hirsute; urgency=medium

  * Avoid "EFI stub: FIRMWARE BUG" message when booting >= 5.7 kernels
    on arm64 by setting the image base address before jumping to the
    PE/COFF entry point LP: #1900774
  * Fix tftp timeouts when fetch large files. LP: #1900773

 -- dann frazier <email address hidden> Wed, 11 Nov 2020 07:17:49 -0700

Changed in grub2 (Ubuntu Hirsute):
status: In Progress → Fix Released
Łukasz Zemczak (sil2100) wrote :

Hello dann, or anyone else affected,

Accepted grub2 into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/grub2/2.02-2ubuntu8.20 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in grub2 (Ubuntu Bionic):
status: New → Fix Committed
tags: added: verification-needed-bionic
Łukasz Zemczak (sil2100) wrote :

Hello dann, or anyone else affected,

Accepted grub2 into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/grub2/2.02~beta2-36ubuntu3.29 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in grub2 (Ubuntu Xenial):
status: New → Fix Committed
tags: added: verification-needed-xenial

All autopkgtests for the newly accepted grub2 (2.04-1ubuntu26.7) for focal have finished running.
The following regressions have been reported in tests triggered by the package:

ubuntu-image/1.10+20.04ubuntu1 (arm64)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/focal/update_excuses.html#grub2

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

All autopkgtests for the newly accepted grub2 (2.04-1ubuntu35.1) for groovy have finished running.
The following regressions have been reported in tests triggered by the package:

grubzfs-testsuite/unknown (amd64)
ubuntu-image/unknown (amd64)
ubiquity/unknown (amd64)
zsys/unknown (amd64)
grml2usb/unknown (amd64)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/groovy/update_excuses.html#grub2

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

dann frazier (dannf) on 2020-11-18
Changed in grub2 (Ubuntu Groovy):
assignee: nobody → dann frazier (dannf)
Changed in grub2 (Ubuntu Focal):
assignee: nobody → dann frazier (dannf)
Changed in grub2 (Ubuntu Bionic):
assignee: nobody → dann frazier (dannf)
Changed in grub2 (Ubuntu Xenial):
assignee: nobody → dann frazier (dannf)
dann frazier (dannf) wrote :

= groovy verification =
                   GNU GRUB GNU GRUB version 2.04buntu3.29

   Minimal BASH-like line editing is supported. For the first word, TAB
   lists possible command completions. Anywhere else TAB lists possible
   device or file completions.

grub> net_dhcp efinet0
grub> linux (tftp,192.168.122.1)/vmlinuz.orig console=ttyS0,115200n8
grub> initrd (tftp,192.168.122.1)/initrd.img
grub>

= focal verification =
                             GNU GRUB version 2.04

   Minimal BASH-like line editing is supported. For the first word, TAB
   lists possible command completions. Anywhere else TAB lists possible
   device or file completions.

grub> net_dhcp efinet0
grub> linux (tftp,192.168.122.1)/vmlinuz.orig console=ttyS0,115200n8
grub> initrd (tftp,192.168.122.1)/initrd.img
grub>

= bionic verification =
                             GNU GRUB version 2.02

   Minimal BASH-like line editing is supported. For the first word, TAB
   lists possible command completions. Anywhere else TAB lists possible
   device or file completions.

grub> net_add_addr test efinet0 192.168.122.86
grub> linux (tftp,192.168.122.1)/vmlinuz.orig console=ttyS0,115200n8
grub> initrd (tftp,192.168.122.1)/initrd.img
grub>

= xenial verification =
                   GNU GRUB version 2.02~beta2-36ubuntu3.29

   Minimal BASH-like line editing is supported. For the first word, TAB
   lists possible command completions. Anywhere else TAB lists possible
   device or file completions.

grub> net_add_addr test efinet0 192.168.122.86
grub> linux (tftp,192.168.122.1)/vmlinuz.orig console=ttyS0,115200n8
grub> initrd (tftp,192.168.122.1)/initrd.img
grub>

tags: added: verification-done verification-done-focal verification-done-groovy verification-done-xenial
removed: verification-needed verification-needed-focal verification-needed-groovy verification-needed-xenial
Mathew Hodson (mhodson) on 2020-11-23
no longer affects: livecd-rootfs (Ubuntu)
no longer affects: livecd-rootfs (Ubuntu Focal)
no longer affects: livecd-rootfs (Ubuntu Groovy)
no longer affects: livecd-rootfs (Ubuntu Hirsute)
Łukasz Zemczak (sil2100) wrote :

The listed grub versions in the output look weird, but I trust that the right packages from -proposed have been used for validation.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package grub2 - 2.04-1ubuntu35.1

---------------
grub2 (2.04-1ubuntu35.1) groovy; urgency=medium

  * Avoid "EFI stub: FIRMWARE BUG" message when booting >= 5.7 kernels
    on arm64 by setting the image base address before jumping to the
    PE/COFF entry point LP: #1900774
  * Fix tftp timeouts when fetching large files. LP: #1900773

 -- dann frazier <email address hidden> Thu, 12 Nov 2020 16:08:57 -0700

Changed in grub2 (Ubuntu Groovy):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for grub2 has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package grub2 - 2.04-1ubuntu26.7

---------------
grub2 (2.04-1ubuntu26.7) focal; urgency=medium

  * Avoid "EFI stub: FIRMWARE BUG" message when booting >= 5.7 kernels
    on arm64 by setting the image base address before jumping to the
    PE/COFF entry point LP: #1900774
  * Fix tftp timeouts when fetching large files. LP: #1900773

 -- dann frazier <email address hidden> Thu, 12 Nov 2020 16:15:13 -0700

Changed in grub2 (Ubuntu Focal):
status: Fix Committed → Fix Released
tags: added: verification-done-bionic
removed: verification-needed-bionic
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package grub2 - 2.02-2ubuntu8.20

---------------
grub2 (2.02-2ubuntu8.20) bionic; urgency=medium

  * Avoid "EFI stub: FIRMWARE BUG" message when booting >= 5.7 kernels
    on arm64 by setting the image base address before jumping to the
    PE/COFF entry point LP: #1900774
  * Fix tftp timeouts when fetching large files. LP: #1900773

 -- dann frazier <email address hidden> Fri, 13 Nov 2020 17:40:19 -0700

Changed in grub2 (Ubuntu Bionic):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package grub2 - 2.02~beta2-36ubuntu3.29

---------------
grub2 (2.02~beta2-36ubuntu3.29) xenial; urgency=medium

  * Avoid "EFI stub: FIRMWARE BUG" message when booting >= 5.7 kernels
    on arm64 by setting the image base address before jumping to the
    PE/COFF entry point LP: #1900774
  * Fix tftp timeouts when fetching large files. LP: #1900773

 -- dann frazier <email address hidden> Fri, 13 Nov 2020 18:03:44 -0700

Changed in grub2 (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers