Kernel sometimes panics during early boot if CPU microcode archive prepended to initramfs

Bug #1743798 reported by David McBride
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux-hwe (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

As part of my response to the recent Meltdown and Spectre security issues, I've started deploying the intel-microcode package (initially, version 3.20170707.1~ubuntu16.04.0, though I realise this does not include recent security-related fixes) to desktops and servers equipped with Intel CPUs.

This has caused machine boots to sometimes fail, though the behaviour does not appear deterministic.

The error reported by the kernel is:

 initramfs unpacking failed: junk in compressed archive

This then immediately leads to the kernel panicing, as the initramfs is needed for mounting the local root filesystem.

(Fortunately, I have set the panic=300 kernel command-line option, so physical machines that panic in this way will auto-reboot after 5 minutes, and can thus be rescued from afar via network boot.)

I've seen these failures on two different varieties of desktop (one HP/Compaq, one Dell), and also on VMs hosted by VMware. I believe that this problem is a non-deterministic race-condition during the machine early boot sequence—probably in the kernel—as the same machine with the same disk contents can exhibit either working or failing behaviour on subsequent boot attempts.

Unfortunately, this particular error message appears in three different places in init/initramfs.c, so it's not precisely clear what specific problem is occurring.

This problem has been difficult to reproduce on hosts reliably. Machines that are affected by this issue typically present it on most boot attempts, but this cannot be relied on.

Attempting to gather more information from the kernel via the 'debug' command-line option produces more data, but this is difficult to capture. Attempting to also add "console=ttyS0" on a VM that was reliably presenting this problem caused the error to stop triggering, presumably due to changed timing.

The intel-microcode package works by prepending a prepared initramfs image with a CPIO archive that contains microcode files, with predictable names, for early application by the kernel.

See also: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/x86/microcode.txt

Removing the intel-microcode package, and thus regenerating initramfs files without any CPIO archive prepended to them, appears to prevent this issue from triggering. My suspicion is that the kernel is failing to handle this compound archive structure in a reliable way.

However, it's conceivable that this problem is not in the Linux kernel, but in the GRUB2 bootloader in use on these machines. As I understand things, it is the responsibility of the GRUB2 bootloader to read the kernel and initramfs files from disk, and execute them both together. It's thus conceivable that the defect does not lie in the kernel, but that the GRUB2 bootloader is instead failing to reliably parse the btrfs root filesystem data-structures, and thus the kernel is correctly rejecting an invalid initramfs payload being passed to it.

However, given I've been successfully using GRUB2 and btrfs in this way without issue for some years with a variety of kernels and initramfs configurations, this strikes me as being less likely.

I have no reason to believe that this issue is limited to this (major) version of the kernel.

Revision history for this message
David McBride (david-mcbride) wrote :
Revision history for this message
Cody J. Egan (codyegan14) wrote :

This exact same issue has happened to me as well. I attempted to run a livecd of Ubuntu 19.04 on a HP g72 notebook PC and came across this exact error.

Revision history for this message
Nikolas Zimmermann (nzimmermann) wrote :

I have the very same issue with a ThinkPad P53, two days old, and Ubuntu 19.10.
The kernel panic looks the same.

I tried many BIOS settings, without any luck - the problem persists.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-hwe (Ubuntu):
status: New → Confirmed
Revision history for this message
Liam Proven (lproven) wrote :

I had the same issue with 20.04 on a Thinkpad X220.

I managed to resolve it by installing the HWE kernel, adding a dedicated swap partition on another drive, purging ZRAM, and rebuilding my `initrd`.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.