cc_grub_dpkg was fixed to support nvme drives, but didn't clear the state of cc_grub_dpkg and didn't rerun it on upgrades
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
cloud-init (Ubuntu) |
Fix Released
|
Undecided
|
Dan Watkins | ||
Xenial |
Fix Released
|
Undecided
|
Unassigned | ||
Bionic |
Fix Released
|
Undecided
|
Unassigned | ||
Focal |
Fix Released
|
Undecided
|
Unassigned | ||
Groovy |
Fix Released
|
Undecided
|
Dan Watkins |
Bug Description
=== Begin SRU Template ===
[Impact]
Older versions of cloud-init could misconfigure grub on nvme devices,
which could prevent instances from booting after a grub upgrade.
[Test Case]
For focal, bionic, and xenial verify the following:
1. on an affected instance, test that installing the new version of cloud-init appropriately updates debconf
2. on an affected instance, modify of the debconf settings and test that installing the new version of cloud-init does not touch those values
3. in a container, confirm that cloud-init does not touch the values
4. in an unaffected instance (i.e. one without an NVMe root), confirm that cloud-init does not touch the values
Steps for test 1:
# Find an old affected image with
aws ec2 describe-images --filters "Name=name,
# Launch an AWS with affected image-id, ssh in
# After startup, connect via SSH, then
# Verify we're on an nvme device
lsblk | grep nvme
# Verify install_devices set incorrectly
debconf-show grub-pc | grep "install_devices:"
# update cloud-init to proposed
mirror=http://
echo deb $mirror $(lsb_release -sc)-proposed main | tee /etc/apt/
apt-get update -q
apt-get install -qy cloud-init
# Verify "Reconfiguring grub" message in upgrade output
# Verify install_devices set correctly
debconf-show grub-pc | grep "install_devices:"
# Verify that after reboot we can still connect
Steps for test 2:
# Find an old affected image with
aws ec2 describe-images --filters "Name=name,
# Launch an AWS with affected image-id
# After startup, connect via SSH, then
# Verify we're on an nvme device
lsblk | grep nvme
# Verify install_devices set incorrectly
debconf-show grub-pc | grep "install_devices:"
# Update install device to something (anything) else
echo 'set grub-pc/
# update cloud-init to proposed
mirror=http://
echo deb $mirror $(lsb_release -sc)-proposed main | tee /etc/apt/
apt-get update -q
apt-get install -qy cloud-init
# Verify no "Reconfiguring grub" message in upgrade output
# Verify install_devices not changed
debconf-show grub-pc | grep "install_devices:"
Steps for test 3:
# lxd launch affected image
lxc launch <image>
# Obtain bash shell
lxc exec <image> bash
# Check install_devices
debconf-show grub-pc | grep "install_devices:"
# Update cloud-init to propsed
mirror=http://
echo deb $mirror $(lsb_release -sc)-proposed main | tee /etc/apt/
apt-get update -q
apt-get install -qy cloud-init
# Verify no "Reconfiguring grub" message in upgrade output
# Verify install_devices not changed
debconf-show grub-pc | grep "install_devices:"
Steps for test 4:
# Launch GCE image with:
gcloud compute instances create falcon-test --image <image> --image-project ubuntu-os-cloud --zone=
# After startup, connect via SSH, then
# Verify we're not on an nvme device
lsblk | grep nvme
# Check install_devices
debconf-show grub-pc | grep "install_devices:"
# update cloud-init to proposed
# Verify "Reconfiguring grub" message not in upgrade output
# Verify install_devices set correctly
debconf-show grub-pc | grep "install_devices:"
# Verify that after reboot we can still connect
[Regression Potential]
If a user manually configured their system in such a way that both devices
exist and it matches our error condition, the grub install device
could be reconfigured incorrectly.
[Other Info]
Pull request: https:/
Upstream commit: https:/
==== Original Description ====
cc_grub_dpkg was fixed to support nvme drives, but didn't clear the state of cc_grub_dpkg and didn't rerun it on upgrades
However, that only fixed the issue for the newly first-booted instances on nvme.
All existing boots of cloud-init on nvmes are still broken, and will fail to apply the latest grub2 update for BootHole mitigation.
Please add maintainer scripts changes to re-run cc_grub_dpkg, once-only, when cloud-init is upgraded to a new sru. To ensure that cc_grub_dpkg has been rerun, once, since nvme fixes.
You could guard this call, if debconf database grub-pc devices do not exist on the instance. (i.e. debconf has /dev/sda, and yet /dev/sda does not exist)
information type: | Public → Public Security |
tags: | added: regression-update |
Changed in cloud-init (Ubuntu Groovy): | |
status: | New → In Progress |
assignee: | nobody → Dan Watkins (oddbloke) |
description: | updated |
OK, so the issue we're dealing with here is that bug 1877491 fixed the grub install device for _new_ NVMe instances, but it did not fix it on existing NVMe instances. So, for existing instances, they will still have an incorrect grub install device configured (something like /dev/sda).
grub has two parts: the core and its modules. These two components are expected to be updated in lockstep. If each component is using a different ABI (i.e. they are not in lockstep), then systems will fail to boot. The components are kept in lockstep by the grub packaging; it will install the core to the grub install device(s) to ensure this.
For NVMe systems which have an incorrect grub install device configured (i.e. any which were launched before 2020/07/15), the grub packaging will fail to perform this core installation. This means that the core and modules will be using incompatible ABIs, so such systems will fail to boot when next rebooted.
(Note that the core/modules ABI does not change on every update to the grub package, so this mismatched boot failure will not be observed on every grub update. This bug has been filed, however, because we _do_ have such a grub update pending/in progress.)
There is a grub bug (bug 1889556) for handling the case where the core and modules are mismatched, but the solution there will require manual user intervention. cloud-init can fix this for NVMe drives non-interactively by redetermining the grub install devices in its postinst using its existing logic.