Grub fails to load kernel from squashfs if mem < 1500mb

Bug #1878541 reported by Michael Vogt
38
This bug affects 28 people
Affects Status Importance Assigned to Milestone
snapd
High
Unassigned
grub2 (Ubuntu)
High
Unassigned
Focal
Undecided
Unassigned
Groovy
High
Unassigned

Bug Description

[Impact]

 * loopback command uses too much ram, resulting in OOM on small machines

[Test Case]

 * Download & Copy kernel.snap from amd64 pc image onto ESP partitition

 * Boot VM with secureboot, uefi and tpm and drop into grub recovery shell

 * observe ram usage of the machine (for example by using virt-manager graphs)

 * execute "loopback loop0 /path/to/kernel.snap"

 * observe ram usage of the machine again.

 * The RAM usage should stay almost constant with the patched grub just like it did in bionic. If it grows by the size of the kernel.snap (~500MB+), it is booting using buggy grub as shipped in focal GA.

[Regression Potential]

 * This patch changes UEFI secureboot verifier behaviour for the loopback command. The whole loopback file is no longer read & stored into memory.

This changes the PCR values. However Ubuntu has not yet been using or sealing against that PCR value. Also normally, on every kernel/grub update, the same PCR value is changed. Thus normal resealing procedure after a grub update would accommodate for this change of the PCR value.

The loopback devices as a whole are no longer measured into TPM and cannot be attested. The resurrect such behavior, there is upstream design plan to allow storing hashes of all blocks and validate them with reduced memory requirement. Currently this is deemed out of scope, and of low interest/priority.

[Other Info]

[Original bug report]

Booting a uc20 system fails early currently. The image used was:
http://cdimage.ubuntu.com/ubuntu-core/20/beta/20200513.2/

Attached is a screenshot of the debug output.

This appears to be some sort of regression with grub in 20.04 or with UEFI grub - this used to work in uc18.

Note that there is memory < 1500mb

Related branches

Revision history for this message
Michael Vogt (mvo) wrote :
Michael Vogt (mvo)
summary: - uc20 image fails with 512mb ram
+ Grub fails to load kernel from squashfs if mem < 1500mb
description: updated
Michael Vogt (mvo)
tags: added: rls-gg-incoming
description: updated
Revision history for this message
Michael Vogt (mvo) wrote :

Dimitri suggested to sort the squashfs with the "-sort" option. I created the attached test for this but it has no effect for me.

tags: added: uc20
Changed in snapd:
importance: Undecided → High
Michael Vogt (mvo)
Changed in grub2 (Ubuntu):
importance: Undecided → High
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Note, the Testscript specifies 512MB which is quite small.
Previously, we wanted to ensure that amd64 reference target is "a typical NUC with TPMv2.0 and secureboot", at the time typical NUC models had 2GB of ram.

What is the target minimum ram usage we must achieve?

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Starting uc20 in a virsh domain, whilst controlling for peak memory usage, and modifying command line to boot to "rdinit=/bin/sh" => meaning boot to unpacked initrd and start busybox shell without doing anything else.

The rss memory achieved to get to that point was 744684, out of 2033104 available (i'm not sure which units virsh is using here, but it is ~740MB out of 2048MB).

Note on any other platform or mode, we do not loop mount xz compressed snap. And we have stopped using lzma/xz for kernel image or modules compression throughout Ubuntu.

Next steps is to try booting with kernel.snap without compression, or unpacked.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Using sorting didn't change peak rss much, it's at 742524

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

lzo compression ended up being more 797568

Also, it feels like we try to read the _whole_ of the snap prior to loading it.

As if, measurement of the whole squashfs / partition is taken.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

With unpacked kernel.efi boot to rdinit=/bin/sh res usage is 456756

so it feels as if (loop) device is not freed by grub / shim / firmware.

Next up is to try to play with things interactively in grub shell, to try to figure out which commands cause memory to baloon.

Or like see if it can be freed after loading things from squashfs.

UC18 loads kernels from squashfs in under 512MB => compare if grub in uc18 is better.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

UC18 size:
8100 kernel.img
3808 initrd.img

~12MB, loaded from .snap, on ext4

UC20 size:
48196 kernel.efi

~50MB, loaded from .snap, on fat

More than 4x larger

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

the kernel snap sizes, are roughly similar.

204M for uc18
284M for uc20

1.4x larger

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

The minimum reproducer i have is this:

1) Fetch UC20 image from http://cdimage.ubuntu.com/ubuntu-core/20/edge/pending/
2) boot to grub cmdline prompt
2) execute

loopback loop (hd0,gpt3)/pc-kernel_502.snap

(or use tabcompletion for the right kernel snap)

Equivalent command on UC18 image (with bionic's grub) result in no additional memory used, with the same kernel snap.

On UC20 image, executing that command uses up 400MB of RAM which does not appear to be reclaimed.

It appears to be irrelevant as to what underlying fs type is (UC18 had kernel snap on ext4, UC20 has it on ESP/fat).

Changed in grub2 (Ubuntu):
status: New → Confirmed
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

seems to work fine under BIOS, loopback loop does not appear to be using up any more data.

It feels like a bug in EFI memory page allocation, which never get released. And/or max_agglomerate implementation under EFI.

Revision history for this message
Chris Coulson (chrisccoulson) wrote :

I did a bit of digging on this, and it seems to happen because the grub verifier module reads in to memory the entire contents of any file that is opened via grub_file_open without the GRUB_FILE_TYPE_SKIP_SIGNATURE flag or any file which doesn't have a type of GRUB_FILE_TYPE_SIGNATURE or GRUB_FILE_TYPE_VERIFY_SIGNATURE, so that it can provide the file contents to the registered verifier modules and provide the verified contents to the grub file API from memory without having to load it from disk again (which would obviously be vulnerable to TOCTOU type bugs).

Configuring a loopback device via the loopback command opens the underlying disk image, which results in grub's verifier code reading the entire image in to memory. In the case of booting a UC20 recovery system, the loopback image is the kernel snap squashfs. This doesn't happen with the UC18 version of grub because it doesn't ship the verifier module (which is pulled in in UC20 because of the TPM verifier module. The TPM verifier just calculates a hash of the file contents and measures it to PCR9).

I'm not sure that passing loopback image files through the verifier module is a sensible default. The loopback device is just another disk backend, and grub doesn't pass entire physical disk images through the verifier. It seems weird that loopback images would be treated differently, particularly because files opened from the filesystem within the loopback image will be passed through the verifier.

I tested a local build of grub with the attached patch, and was able to boot a UC20 recovery kernel via a loop mounted kernel snap squashfs in a VM with 512MB of RAM. I'm not sure if it's the correct fix for this though.

Revision history for this message
Chris Coulson (chrisccoulson) wrote :

Hi Colin, I wouldn't mind hearing your thoughts on the previous comment.

tags: added: patch
tags: added: id-5ec540751c801c607c3d8c33
tags: removed: rls-gg-incoming
Revision history for this message
Julian Andres Klode (juliank) wrote :

The patch looks right to me.

Changed in grub2 (Ubuntu Groovy):
status: Confirmed → Triaged
Revision history for this message
Claudio Matsuoka (cmatsuoka) wrote :

Chris Coulson's patch should also solve the problem that breaks install on the Thinkcentre m920s with TPM enabled. The last printed message when booting with grub debug enabled is the type of the loopback file, and nothing happens after that. It finishes installing if you rmmod tpm.

Changed in grub2 (Ubuntu Groovy):
status: Triaged → In Progress
Revision history for this message
Julian Andres Klode (juliank) wrote :

An easy minimal test case would be appreciated. I guess I could just put grub into a directory and then tftp boot that inside qemu, and add a large file in there or something? (or use -vfat on a dir)

Changed in grub2 (Ubuntu Groovy):
status: In Progress → Fix Committed
Revision history for this message
Julian Andres Klode (juliank) wrote :
description: updated
Changed in snapd:
status: New → In Progress
Revision history for this message
Julian Andres Klode (juliank) wrote :
Changed in grub2 (Ubuntu Focal):
status: New → In Progress
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package grub2 - 2.04-1ubuntu28

---------------
grub2 (2.04-1ubuntu28) groovy; urgency=medium

  * Ensure that grub-multi-install can always find templates (LP: #1879948)
  * Fix changelog entries for security update

 -- Julian Andres Klode <email address hidden> Mon, 10 Aug 2020 15:07:29 +0200

Changed in grub2 (Ubuntu Groovy):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Michael, or anyone else affected,

Accepted grub2 into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/grub2/2.04-1ubuntu26.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in grub2 (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-focal
Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (grub2/2.04-1ubuntu26.3)

All autopkgtests for the newly accepted grub2 (2.04-1ubuntu26.3) for focal have finished running.
The following regressions have been reported in tests triggered by the package:

ubuntu-image/1.9+20.04ubuntu1 (arm64)
grubzfs-testsuite/0.4.10 (amd64)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/focal/update_excuses.html#grub2

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Booted existing core20 vm with old grub2.
rss went up from 128264 to 422636, after executing loopback loop1 (hd0,gpt2)/snaps/pc-kernel_565.snap.

Replacing grubx64.efi with the one from grub-efi-amd64-signed_1.142.5+2.04-1ubuntu26.3_amd64.deb.

loopback command was very quick, and rss went up from 129244 to just 129452.

So this is good!

tags: added: verification-done verification-done-focal
removed: verification-needed verification-needed-focal
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for grub2 has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package grub2 - 2.04-1ubuntu26.3

---------------
grub2 (2.04-1ubuntu26.3) focal; urgency=medium

  * 2.04-1ubuntu27 and 2.04-1ubuntu28 folded together for focal
  * debian/patches/ubuntu-flavour-order.patch:
    - Add a (hidden) GRUB_FLAVOUR_ORDER setting that can mark certain kernel
      flavours as preferred, and specify an order between those preferred
      flavours (LP: #1882663)
  * debian/patches/ubuntu-zfs-enhance-support.patch:
    - Use version_find_latest for ordering kernels, so it also supports
      the GRUB_FLAVOUR_ORDER setting.
  * debian/patches/ubuntu-dont-verify-loopback-images.patch:
    - disk/loopback: Don't verify loopback images (LP: #1878541),
      Thanks to Chris Coulson for the patch
  * debian/patches/ubuntu-recovery-dis_ucode_ldr.patch
    - Pass dis_ucode_ldr to kernel for recovery mode (LP: #1831789)
  * debian/patches/ubuntu-add-initrd-less-boot-fallback.patch:
    - Merge changes from xnox to fix multiple initrds support (LP: #1878705)
  * debian/patches/ubuntu-clear-invalid-initrd-spacing.patch:
    - Remove, no longer needed thanks to xnox's patch
  * Ensure that grub-multi-install can always find templates (LP: #1879948)

 -- Julian Andres Klode <email address hidden> Mon, 17 Aug 2020 16:04:31 +0200

Changed in grub2 (Ubuntu Focal):
status: Fix Committed → Fix Released
tags: added: fr-167
Revision history for this message
Michael Vogt (mvo) wrote :

This is fixed now.

Changed in snapd:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers