initrdfail can result in resuming with different initrd images and hanging resume

Bug #1929860 reported by Francis Ginther
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
grub2 (Ubuntu)
Fix Released
Critical
Unassigned
Bionic
In Progress
Undecided
Julian Andres Klode
Focal
Fix Released
Undecided
Unassigned

Bug Description

[Impact]
Ubuntu Focal (and new releases) on AWS will normally boot without an initrd image (just the microcode.cpio). There is a fallback mechanism to reboot with the full initrd image when the boot fails to complete. The grub environment variable "initrdfail" is used to track when a boot failed and switch between the optimized initrd-less boot path and the full initrd path.

On a normal successful boot, the "initrdfail" variable is cleared by grub-initrd-fallback.service. However, this doesn't happen when resuming from hibernation. As a result, the initrd fallback will get triggered on the second hibernation / resume cycle despite the original boot using only the microcode.cpio. This switch in initrd images leads to the second resume hanging.

We've been able to successfully avoid this issue by adding the following to the ec2-hibinit-agent resume handler:

/usr/bin/grub-editenv - unset initrdfail
/usr/bin/grub-editenv - unset recordfail

(Note: clearing recordfail may not be necessary, will need to try again without it.)

This bug was filed against grub2 as it appears to own initrdfail.

[Test plan]
TBD w/ CPC

[Regression potential]
Services get changed to oneshot, and wantedby=multi-user sleep; maybe we miss other places it should run, or record the wrong thing on resume?

Related branches

affects: ec2-hibinit-agent (Ubuntu) → grub2 (Ubuntu)
tags: added: rls-ii-incomings
tags: added: fr-1421
Changed in grub2 (Ubuntu):
status: New → In Progress
importance: Undecided → Critical
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package grub2 - 2.04-1ubuntu46

---------------
grub2 (2.04-1ubuntu46) impish; urgency=medium

  * debian/grub-common.service: change type to oneshot, add wantedby
    sleep.target, after sleep.target. The service will now start after
    resume from hybernation. LP: #1929860
  * grub-initrd-fallback.service: add wantedby sleep.target, after
    sleep.target. The service will now start after resume from
    hybernation. LP: #1929860
  * cherrypick upstream fix to make armhf efi boot work. LP: #1788940
  * debian/rules: disable LTO. LP: #1922005
  * grub-initrd-fallback.service, debian/grub-common.service: only start
    units when booted with grub. Use presence of /boot/grub/grub.cfg as
    proxy. LP: #1925507
  * tests: patch qemu command to use ide-hd instead of the removed
    ide-drive.

 -- Dimitri John Ledkov <email address hidden> Fri, 16 Jul 2021 14:01:31 +0100

Changed in grub2 (Ubuntu):
status: In Progress → Fix Released
Changed in grub2 (Ubuntu Focal):
status: New → Triaged
Changed in grub2 (Ubuntu Focal):
milestone: none → ubuntu-20.04.3
description: updated
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Francis, or anyone else affected,

Accepted grub2 into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/grub2/2.04-1ubuntu26.13 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in grub2 (Ubuntu Focal):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-focal
Revision history for this message
Francis Ginther (fginther) wrote :

@juliank

This does not appear to be a complete fix for the issue seen on AWS t2.nano. I see failures in about 5 - 10% of cases. There are two different symptoms:

 * grub-initrd-fallback.service sometimes runs right before the system hibernates (it runs after the hibernation request was sent to the VM and before it and before it fully hibernates).
 * grub-initrd-fallback.service sometimes runs after resume, but checking the status of `initrdfail` a few minutes later indicates it is still set (not sure if it was ever cleared).

Both of these were determined by checking the last active timestamp reported by 'systemctl status grub-initrd-fallback.service' and comparing this with timestamps generated by the hibernation test. If either situation occurs, the next hibernation/resume will fail.

I'm trying to collect more information.

Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Thank you Francis for checking up on this! In this case we'll probably not block on this re: our usual 20.04.3. Would that be fine if this gets fixed after the regular flavors of 20.04.3 are released?

Revision history for this message
Francis Ginther (fginther) wrote :

@sil2100

Following up on our chat conversation. I actually thing this grub2 package update could be released. It is greatly improving the situation and I think is doing the right thing. But I think there may be other external factors in play that may result in these corner cases.

I can give a verification done.

Revision history for this message
Brian Murray (brian-murray) wrote :

I tested the grub2 update on a virtual machine and was able to reboot and login without any issues.

bdmurray@clean-focal-amd64:~$ dpkg -l | grep ii.*grub
ii grub-common 2.04-1ubuntu26.13 amd64 GRand Unified Bootloader (common files)
ii grub-gfxpayload-lists 0.7 amd64 GRUB gfxpayload blacklist
ii grub-pc 2.04-1ubuntu26.13 amd64 GRand Unified Bootloader, version 2 (PC/BI
OS version)
ii grub-pc-bin 2.04-1ubuntu26.13 amd64 GRand Unified Bootloader, version 2 (PC/BI
OS modules)
ii grub2-common 2.04-1ubuntu26.13 amd64 GRand Unified Bootloader (common files for
 version 2)

Revision history for this message
Brian Murray (brian-murray) wrote :

I also tested this update on an Intel NUC and I was able to reboot w/o any issues.

bdmurray@atom:~$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
bdmurray@atom:~$ uname -a
Linux atom 5.4.0-81-generic #91-Ubuntu SMP Thu Jul 15 19:09:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
bdmurray@atom:~$ dpkg -l | grep ii.*grub
ii grub-common 2.04-1ubuntu26.13 amd64 GRand Unified Bootloader (common files)
ii grub-efi-amd64 2.04-1ubuntu44.2 amd64 GRand Unified Bootloader, version 2 (EFI-AMD64 version)
ii grub-efi-amd64-bin 2.04-1ubuntu44.2 amd64 GRand Unified Bootloader, version 2 (EFI-AMD64 modules)
ii grub-efi-amd64-signed 1.167.2+2.04-1ubuntu44.2 amd64 GRand Unified Bootloader, version 2 (EFI-AMD64 version, signed)
ii grub2-common 2.04-1ubuntu26.13 amd64 GRand Unified Bootloader (common files for version 2)
ii ubuntu-recovery-grub-hotkey 1.1columbia1 all Ubuntu Recovery Grub Hotkey Configuration

Revision history for this message
Francis Ginther (fginther) wrote :

Adding verification-done as this does fix the problem in nearly all testing. I think there is a corner case or something specific to the testing which is causing these outliers.

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package grub2 - 2.04-1ubuntu26.13

---------------
grub2 (2.04-1ubuntu26.13) focal; urgency=medium

  [ Julian Andres Klode ]
  * unapply all patches, use gbp pq instead of git-dpm

  [ Dimitri John Ledkov ]
  * 10_linux: emit messages when initrdless boot is configured, attempted and
    fails triggering fallback. LP: #1901553
  * grub-common.service: port init.d script to systemd unit. Add warning
    message, when initrdless boot fails triggering fallback. LP: #1901553
  * debian/grub-common.service: change type to oneshot, add wantedby
    sleep.target, after sleep.target. The service will now start after resume
    from hybernation. (LP: #1929860)
  * grub-initrd-fallback.service: add wantedby sleep.target, after
    sleep.target. The service will now start after resume from hybernation.
    LP: #1929860
  * grub-initrd-fallback.service, debian/grub-common.service: only start units
    when booted with grub. Use presence of /boot/grub/grub.cfg as proxy. LP:
    #1925507

 -- Julian Andres Klode <email address hidden> Thu, 12 Aug 2021 11:18:25 +0200

Changed in grub2 (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for grub2 has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Changed in grub2 (Ubuntu Bionic):
status: New → In Progress
Revision history for this message
Steve Langasek (vorlon) wrote : Proposed package upload rejected

An upload of grub2 to bionic-proposed has been rejected from the upload queue for the following reason: "The debdiff for this is a mess, there is delta to all the packages under debian/patches, impossible to review; please reupload without the extra delta".

tags: added: foundations-todo
tags: removed: fr-1421
Changed in grub2 (Ubuntu Bionic):
assignee: nobody → Julian Andres Klode (juliank)
tags: removed: foundations-todo
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.