focal: backport kexec fallback patch

Bug #1969365 reported by Dan Watkins
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
systemd (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Released
Low
Unassigned

Bug Description

It would be great if focal's systemd could have https://github.com/systemd/systemd/commit/71180f8e57f8fbb55978b00a13990c79093ff7b3 backported to it.

[Impact]

We have observed that kexec'ing to another kernel will fail as the drive containing the `kexec` binary has been unmounted by the time systemd attempts to do so, indicated in the console:

         Starting Reboot via kexec...
[ 163.960938] shutdown[1]: (sd-kexec) failed with exit status 1.
[ 163.963463] reboot: Restarting system

[Test Plan]

1) Launch a 20.04 instance
2) `apt-get install kexec-tools`
3) In `/boot`, filling in whatever <cmdline> needed in your environment:

kexec -l vmlinuz --initrd initrd.img --append '<cmdline>'

4) `reboot`

(I have reproduced this in a single-disk VM, so I assume it reproduces ~everywhere: if not, `apt-get remove kexec-tools` before the `reboot` could be used to emulate the unmounting.)

[Where problems could occur]

Users could inadvertently be relying on the current behaviour: if they have configured their systems to kexec, they currently will be rebooting normally, and this patch would cause them to start actually kexec'ing.

[Other info]

We're currently maintaining a systemd tree with only this patch added to focal's tree: this patch has received a bunch of testing from us in focal.

This patch landed in v246, so it's already present in supported releases later than focal.

Related branches

Revision history for this message
Nick Rosbrook (enr0n) wrote :

The patch for this is indeed present in Jammy and newer. I don't currently see a strong enough reason to SRU this to Focal, but if you or someone else thinks it's important, feel free to explain here.

Changed in systemd (Ubuntu):
status: New → Fix Released
Revision history for this message
Dan Watkins (oddbloke) wrote :

Thanks for the reply, Nick!

I think it's important enough to land because:

* you cannot execute `kexec` correctly on an Ubuntu 20.04 system without this patch (it will fall back to performing a full reboot),
* kexec can be used to reduce downtime for critical systems which take a long time to reboot (e.g. because they have a lot of hardware to initialise), and
* kexec-tools is in main (and has been since at least trusty) which indicates to me that it is expected that kexec will work on Ubuntu

I'd also add that the patch is three lines in a code path which is only used by people opting into using `kexec`, so the potential downside is pretty minimal.

(I'll set the bug back to New for now, until you have a chance to respond.)

Changed in systemd (Ubuntu):
status: Fix Released → New
Nick Rosbrook (enr0n)
Changed in systemd (Ubuntu):
status: New → Fix Released
Changed in systemd (Ubuntu Focal):
importance: Undecided → Low
Revision history for this message
Nick Rosbrook (enr0n) wrote :

Fair enough. Thanks for the justification and for filling out the SRU template already.

Changed in systemd (Ubuntu Focal):
status: New → Triaged
tags: added: systemd-sru-next
Revision history for this message
Dan Watkins (oddbloke) wrote :

Thanks Nick, much appreciated!

Revision history for this message
Nick Rosbrook (enr0n) wrote :

The test case with removing kexec-tools before rebooting works for me. But I can only reproduce the issue by doing that. Can you share more about your setup so we can understand why exactly you hit this?

I think that having this fallback makes sense, and is fine for an SRU, but it would be good to understand the root cause better.

Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Dan, or anyone else affected,

Accepted systemd into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/systemd/245.4-4ubuntu3.23 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in systemd (Ubuntu Focal):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-focal
Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (systemd/245.4-4ubuntu3.23)

All autopkgtests for the newly accepted systemd (245.4-4ubuntu3.23) for focal have finished running.
The following regressions have been reported in tests triggered by the package:

casync/2+20190213-1 (armhf)
gvfs/1.44.1-1ubuntu1.2 (ppc64el)
linux-gcp-5.15/5.15.0-1048.56~20.04.1 (arm64)
linux-hwe-5.15/5.15.0-91.101~20.04.1 (armhf)
linux-oracle-5.15/5.15.0-1049.55~20.04.1 (arm64)
mariadb-10.3/1:10.3.38-0ubuntu0.20.04.1 (armhf)
netplan.io/0.104-0ubuntu2~20.04.4 (s390x)
puppet/5.5.10-4ubuntu3 (armhf)
upower/0.99.11-1build2 (armhf)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/focal/update_excuses.html#systemd

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message
Dan Watkins (oddbloke) wrote :

Apologies, I completely missed your comment, Nick! I was just able to reproduce this using uvtool.

To launch the VM (and monitor the console output):

uvt-simplestreams-libvirt sync release=focal arch=amd64
uvt-kvm create firsttest release=focal
virsh console firsttest

Then, within the instance via `uvt-kvm ssh firsttest`:

sudo apt update
sudo apt install kexec-tools
sudo reboot # I wanted to check `virsh console` was getting output

After reboot and reconnecting:

cd /boot
sudo kexec -l vmlinuz --initrd initrd.img --append 'BOOT_IMAGE=/boot/vmlinuz-5.4.0-167-generic root=LABEL=cloudimg-rootfs ro console=tty1 console=ttyS0'
sudo reboot

I then observed the following in `virsh console`'s output (note the "Unmount All Filesystems", which is what I think is causing the problem):

[ OK ] Reached target Unmount All Filesystems.
         Stopping Monitoring of LVM…meventd or progress polling...
         Stopping Device-Mapper Multipath Device Controller...
[ OK ] Stopped Create Static Device Nodes in /dev.
[ OK ] Stopped Create System Users.
[ OK ] Stopped Remount Root and Kernel File Systems.
[ OK ] Stopped File System Check on Root Device.
[ OK ] Stopped Device-Mapper Multipath Device Controller.
[ OK ] Stopped Monitoring of LVM2… dmeventd or progress polling.
[ OK ] Reached target Shutdown.
[ OK ] Reached target Final Step.
         Starting Reboot via kexec...
[ 53.030829] systemd-udevd[372]: proc_inode_cache(1258:kexec.service): Worker [960] did not accept message, killing the worker: Connection refused
[ 53.032915] systemd-udevd[372]: proc_inode_cache(1258:kexec.service): Worker [954] did not accept message, killing the worker: Connection refused
[ 53.034697] systemd-udevd[372]: proc_inode_cache(1258:kexec.service): Worker [953] did not accept message, killing the worker: Connection refused
[ 53.036756] systemd-udevd[372]: proc_inode_cache(1258:kexec.service): Worker [955] did not accept message, killing the worker: Connection refused
[ 53.039144] systemd-udevd[372]: proc_inode_cache(1258:kexec.service): Worker [956] did not accept message, killing the worker: Connection refused
[ 53.041049] systemd-udevd[372]: proc_inode_cache(1258:kexec.service): Worker [958] did not accept message, killing the worker: Connection refused
[ 53.042673] systemd-udevd[372]: proc_inode_cache(1258:kexec.service): Worker [961] did not accept message, killing the worker: Connection refused
[ 53.123845] shutdown[1]: (sd-kexec) failed with exit status 1.
[ 53.138542] reboot: Restarting system
[ 0.000000] Linux version 5.4.0-167-generic (buildd@lcy02-amd64-010) (gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.2)) #184-Ubuntu SMP Tue Oct 31 09:21:49 UTC 2023 (Ubuntu 5.4.0-167.184-generic 5.4.252)

(I initially tried to re-reproduce using an Incus VM (and so a community image rather than a cloud-images one) and could not do so. Executing the above `kexec` command produced "ima: impossible to appraise a kernel image without a file descriptor; try using kexec_file_load syscall." in the console. Adding `-s`/`--kexec-file-syscall` to the kexec command caused it to exit zero, and the VM did then successfully kexec on `reboot`.)

Revision history for this message
Dan Watkins (oddbloke) wrote :

In the VM created in my above comment, I enabled proposed, installed the new systemd and rebooted. After that, I re-ran:

cd /boot
sudo kexec -l vmlinuz --initrd initrd.img --append 'BOOT_IMAGE=/boot/vmlinuz-5.4.0-167-generic root=LABEL=cloudimg-rootfs ro console=tty1 console=ttyS0'
sudo reboot

And observed the following in the console:

[ OK ] Reached target Final Step.
         Starting Reboot via kexec...
[ 102.782246] systemd-udevd[369]: anon_vma(937:udisks2.service): Worker [960] did not accept message, killing the worker: Connection refused
[ 102.784224] systemd-udevd[369]: anon_vma(937:udisks2.service): Worker [963] did not accept message, killing the worker: Connection refused
[ 102.786398] systemd-udevd[369]: anon_vma(937:udisks2.service): Worker [966] did not accept message, killing the worker: Connection refused
[ 102.788390] systemd-udevd[369]: anon_vma(937:udisks2.service): Worker [968] did not accept message, killing the worker: Connection refused
[ 102.789954] systemd-udevd[369]: anon_vma(937:udisks2.service): Worker [962] did not accept message, killing the worker: Connection refused
[ 102.791562] systemd-udevd[369]: anon_vma(937:udisks2.service): Worker [961] did not accept message, killing the worker: Connection refused
[ 102.793198] systemd-udevd[369]: anon_vma(937:udisks2.service): Worker [964] did not accept message, killing the worker: Connection refused
[ 102.794669] systemd-udevd[369]: anon_vma(937:udisks2.service): Worker [967] did not accept message, killing the worker: Connection refused
[ 102.796841] systemd-udevd[369]: anon_vma(937:udisks2.service): Worker [965] did not accept message, killing the worker: Connection refused
[ 102.904092] kexec_core: Starting new kernel
[ 0.000000] Linux version 5.4.0-167-generic (buildd@lcy02-amd64-010) (gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.2)) #184-Ubuntu SMP Tue Oct 31 09:21:49 UTC 2023 (Ubuntu 5.4.0-167.184-generic 5.4.252)

So I can confirm that the new systemd addresses this bug.

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Nick Rosbrook (enr0n) wrote :

Thanks for the verification, Dan!

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 245.4-4ubuntu3.23

---------------
systemd (245.4-4ubuntu3.23) focal; urgency=medium

  [ Nick Rosbrook ]
  * core/device: ignore DEVICE_FOUND_UDEV bit on switching root (LP: #2037281)
    File: debian/patches/lp2037281-core-device-ignore-DEVICE_FOUND_UDEV-bit-on-switching-roo.patch
    https://git.launchpad.net/~ubuntu-core-dev/ubuntu/+source/systemd/commit/?id=7793563bb38a84a3dc6bc0da1c08546c3b915ab8
  * dns-query: bump CNAME_MAX to 16 (LP: #2024009)
    File: debian/patches/lp2024009-dns-query-bump-CNAME_MAX-to-16.patch
    https://git.launchpad.net/~ubuntu-core-dev/ubuntu/+source/systemd/commit/?id=193899d103d44c642d362e9916b14df844ec702f
  * Fall back to kexec when no kexec binary exists (LP: #1969365)
    File: debian/patches/lp1969365-Fall-back-to-kexec-when-no-kexec-binary-exists.patch
    https://git.launchpad.net/~ubuntu-core-dev/ubuntu/+source/systemd/commit/?id=3934f3794427dee4e72824998dd4c6e6d5875289
  * test: ignore LXC filesystem when checking for writable locations (LP: #2029352)
    File: debian/patches/lp2029352-test-ignore-LXC-filesystem-when-checking-for-writable-loc.patch
    https://git.launchpad.net/~ubuntu-core-dev/ubuntu/+source/systemd/commit/?id=70facbfbf54c4ffb31ba392dbe3fec3084fdf3bc

  [ Heitor Alves de Siqueira ]
  * core/mount: adjust deserialized state based on /proc/self/mountinfo (LP: #1837227)
    Author: Heitor Alves de Siqueira
    File: debian/patches/lp1837227-core-mount-adjust-deserialized-state-based-on-proc-self-m.patch
    https://git.launchpad.net/~ubuntu-core-dev/ubuntu/+source/systemd/commit/?id=a0a749953d309f48bc45140102adf205d1071c4d

 -- Nick Rosbrook <email address hidden> Tue, 21 Nov 2023 16:10:21 -0500

Changed in systemd (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for systemd has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.