system fails to boot after refreshing core20 & a bad kernel at the same time

Bug #1978785 reported by Robert Liu
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OEM Priority Project
Fix Committed
Critical
Unassigned
snapd
Fix Released
Undecided
Alfonso Sanchez-Beato

Bug Description

[Summary]
When a UC20 system refreshes kernel and core20 snaps at the same time, if the kernel is bad, the system will not boot (this is expected).
After restarting the system, it will use the previous kernel snap. Then the system will not only revert the kernel snap, but also the core20 snap.
It will reboot automatically then fail to boot. (see the attached screenshot.

[Step to reproduce]
1. use https://cdimage.ubuntu.com/ubuntu-core/20/dangerous-beta/20220622.1/ubuntu-core-20-amd64.img.xz as the testing image
2. follow https://ubuntu.com/download/kvm to boot the system and finish first-boot process
$ qemu-system-x86_64 -smp 2 -m 2048 -net nic,model=virtio -net user,hostfwd=tcp::8022-:22,hostfwd=tcp::8090-:80 -vga qxl -drive file=/usr/share/OVMF/OVMF_CODE.fd,if=pflash,format=raw,unit=0,readonly=on -drive file=ubuntu-core-20-amd64.img,cache=none,format=raw,id=disk1,if=none -device virtio-blk-pci,drive=disk1,bootindex=1 -machine accel=kvm
3. login the system and make sure all snaps are up to date before testing.
$ sudo snap refresh
4. run these commands:
# to have a new kernel snap
$ snap switch pc-kernel --channel 22/stable
"pc-kernel" switched to the "22/stable" channel
# to have a new core20 snap
# depends on the release, the channel could be stable or edge
$ snap switch core20 --stable
"core20" switched to the "latest/edge" channel
$ snap refresh --list
Name Version Rev Size Publisher Notes
core20 20220614 1550 64MB canonical✓ base
pc-kernel 5.15.0-35.36.1 1021 289MB canonical✓ kernel
$ snap refresh
snapd is about to reboot the system
# postponse the shutdown
$ sudo shutdown -c
# make a bad kernel
$ sudo sh -c "cat /dev/null > /run/mnt/ubuntu-boot/EFI/ubuntu/try-kernel.efi"
$ sudo reboot
4. after reboot, GRUB will fail to boot. Close the qemu window and restart it
5. The system will continue the boot process, reboot again and later show:
  error: base snap "core20_[rev].snap" does not exist on ubuntu-data

example input and output: https://pastebin.ubuntu.com/p/z2R8bGf7Gr/

[Expected result]
The system should boot correctly with the previous kernel. (I wonder if it make sense to revert core20 as well)

[Actual result]
The system fails to boot and there is no way to recover.

[Fail rate]
100%

[Additional information]
I also observed this issue on different platform.

Tags: oem-priority
Revision history for this message
Robert Liu (robertliu) wrote :
tags: added: oem-priority
Revision history for this message
Robert Liu (robertliu) wrote :
Revision history for this message
Robert Liu (robertliu) wrote :

in ubuntu-date:system-data/var/lib/snapd/snaps/ of the disk image, there's no core20_1550.snap.

$ tree img/system-data/var/lib/snapd/snaps/
img/system-data/var/lib/snapd/snaps/
├── core20_1518.snap
├── pc_115.snap
├── pc-kernel_1032.snap
└── snapd_16010.snap

Revision history for this message
Maciej Borzecki (maciek-borzecki) wrote :

Would it be possible to fetch /var/lib/snapd/modeenv from the disk image?

Revision history for this message
Michael Vogt (mvo) wrote :
Download full text (3.5 KiB)

Thanks for your bugreport. I started on reproducing this but the current instructions have an isssue. The command "sudo sh -c 'cat /dev/null > kernel.efi'" destroys the current kernel (the good kernel). So trying to revert to that will always fail.

To simulate a bad kernel boot you will have to stop the shutdown after the refresh with "shutdown -c" and then use "sudo sh -c "cat /dev/null > /run/mnt/ubuntu-boot/EFI/ubuntu/try-kernel.efi""

With that I see that the try-boot fails and I need to power-cycle the system in grub. then it rollsback though:
"""
mvo@ubuntu:~$ snap change 6
Status Spawn Ready Summary
Done today at 10:41 UTC today at 10:47 UTC Ensure prerequisites for "pc-kernel" are available
Undone today at 10:41 UTC today at 10:47 UTC Download snap "pc-kernel" (1021) from channel "22/stable"
Done today at 10:41 UTC today at 10:47 UTC Fetch and check assertions for snap "pc-kernel" (1021)
Undone today at 10:41 UTC today at 10:47 UTC Mount snap "pc-kernel" (1021)
Undone today at 10:41 UTC today at 10:47 UTC Run pre-refresh hook of "pc-kernel" snap if present
Undone today at 10:41 UTC today at 10:47 UTC Stop snap "pc-kernel" services
Undone today at 10:41 UTC today at 10:47 UTC Remove aliases for snap "pc-kernel"
Undone today at 10:41 UTC today at 10:47 UTC Make current revision for snap "pc-kernel" unavailable
Done today at 10:41 UTC today at 10:47 UTC Update assets from kernel "pc-kernel" (1021)
Undone today at 10:41 UTC today at 10:47 UTC Copy snap "pc-kernel" data
Undone today at 10:41 UTC today at 10:47 UTC Setup snap "pc-kernel" (1021) security profiles
Undone today at 10:41 UTC today at 10:47 UTC Make snap "pc-kernel" (1021) available to the system
Error today at 10:41 UTC today at 10:47 UTC Automatically connect eligible plugs and slots of snap "pc-kernel"
Hold today at 10:41 UTC today at 10:47 UTC Set automatic aliases for snap "pc-kernel"
Hold today at 10:41 UTC today at 10:47 UTC Setup snap "pc-kernel" aliases
Hold today at 10:41 UTC today at 10:47 UTC Run post-refresh hook of "pc-kernel" snap if present
Hold today at 10:41 UTC today at 10:47 UTC Start snap "pc-kernel" (1021) services
Hold today at 10:41 UTC today at 10:47 UTC Clean up "pc-kernel" (1021) install
Hold today at 10:41 UTC today at 10:47 UTC Run configure hook of "pc-kernel" snap if present
Hold today at 10:41 UTC today at 10:47 UTC Run health check of "pc-kernel" snap
Done today at 10:41 UTC today at 10:47 UTC Handling re-refresh of "pc-kernel" as needed

......................................................................
Update assets from kernel "pc-kernel" (1021)

2022-06-20T10:42:43Z INFO No gadget assets update needed

......................................................................
Make snap "pc-kernel" (1021) available to the system

2022-06-20T10:42:43Z INFO Requested system restart.

......................................................................
Automatically connect eligible plugs and slots of snap "pc-kernel"

2022-06-20T10:42:43Z INFO Waiting for automatic snapd restart...
2022-06-20T10:42:43Z INFO Waiting for automatic ...

Read more...

Changed in snapd:
status: New → Incomplete
Robert Liu (robertliu)
description: updated
Revision history for this message
Robert Liu (robertliu) wrote :

I've update the reproduce instructions a bit.

The original image is not available now, so I used the latest (20220622.1) version and still replicated the issue.

content of modeenv:
mode=run
current_recovery_systems=20220622
good_recovery_systems=20220622
base=core20_1518.snap
try_base=core20_1550.snap
base_status=try
current_kernels=pc-kernel_1049.snap
model=canonical/ubuntu-core-20-amd64-dangerous
grade=dangerous
model_sign_key_id=9tydnLa6MTJ-jaQTFUXEwHl1yRx7ZS4K5cyFDhYDcPzhS7uyEkDxdUjg9g08BtNn
current_kernel_command_lines=["snapd_recovery_mode=run console=ttyS0 console=tty1 panic=-1"]

Robert Liu (robertliu)
description: updated
Revision history for this message
Michael Vogt (mvo) wrote :

Thank you! I can reproduce this now too. I see exactly the same "snap changes" output.

What is interesting is that we have a test that tries to simulate this behavior in our code (TestUpdateKernelBaseSingleRebootKernelUndo). However something in this test is misbehaving.

The real failure error in the auto-connect task and reports a rollback across reboots for the kernel:
"""
Error today at 10:10 UTC today at 10:12 UTC Automatically connect eligible plugs and slots of snap "pc-kernel"
"""

However the test does not show this, it checks that the rollback across reboots is detect for the "core20" snap - something that is not observed on the real device and that is also counter-intuitive as the failure happens on the kernel. So something in the test setup behaves differently from the real device and that seems to be screwing up the test.

What is interesting is that the undo path on the real device shows:
"""
90 Automatically connect eligible plugs and slots of snap "pc-kernel"
  2022-06-22T13:25:35Z ERROR cannot finish pc-kernel installation, there was a rollback across reboot
---
105 Make current revision for snap "core20" unavailable
  2022-06-22T13:25:36Z INFO Requested system restart.
"""
i.e. the undo of the core20 triggers another reboot.

Revision history for this message
Michael Vogt (mvo) wrote :

It turns out our existing manager test was testing the slightly wrong thing - it simulated only that both the kernel and the base fail at the same time. I added a test that tests this issue now in https://github.com/snapcore/snapd/commit/6a260bc5cdf7452d19f087214b4484e84e17cfab and I see the problematic behavior with that now too - i.e. that the "base" is set to try and the new core20 that is still set as base got removed from the system.

Revision history for this message
Michael Vogt (mvo) wrote :

With the new/updated test it's more clear now what is going on:

1. system refreshes core20/kernel
2. kernel fails to boot, system gets reset
3. bootloader falls back to "good" kernel, boots again
4. system boots up with "try_base=new-core20", that works fine
5. bootOk runs, detects that modeenv is fine and updates modeenv to not have "trying" anymore
6. snapd detects rollback-accross-reboot for the kernel snap
7. runs undo on kernel+base
8. the undo of the base triggers
   a) a rollback to the old base which now triggers a new "base_status=try" cleared that status earlier already
   b) a cleanup of the "new" base because that got undone because of the kernel+base undo
   c) now the system reboots but has "base=new-core (no longer on the system)", "try_base=old-core"
9. the system fails to boot because in snap-bootstrap we error hard (bootstate20.go:genericInitramfsSelectSnap) if the base=snap snap is missing.

Revision history for this message
Michael Vogt (mvo) wrote :

I'm actually not 100% certain what the best/right fix for this is.

In the old days of uc16/uc18 we had a combined "snap_mode" that went from try->trying->"" and if that failed it would unset both snap_try_kernel and snap_try_base. However in uc20 we have a "base_status=try" and a independent bootloader setting for the "kernel_status=try".

For robustness we should probably:
1. ensure that no base/kernel that is in use can be cleaned up in the undo code

Changed in snapd:
assignee: nobody → Alfonso Sanchez-Beato (alfonsosanchezbeato)
status: Incomplete → In Progress
Revision history for this message
Alfonso Sanchez-Beato (alfonsosanchezbeato) wrote :
Michael Vogt (mvo)
Changed in snapd:
status: In Progress → Fix Committed
Rex Tsai (chihchun)
Changed in oem-priority:
importance: Undecided → Critical
status: New → Fix Committed
Michael Vogt (mvo)
Changed in snapd:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.