Ubuntu
util-linux package

livecd-rootfs uses losetup -P for theoretically reliable/synchronous partition setup but it's not reliable

Mantic (23.10)
Bug #2045586

Bug #2045586 reported by Steve Langasek on 2023-12-04

This bug affects 1 person

	Status	Importance	Assigned to
linux (Ubuntu)	New	Undecided	Unassigned
Jammy	New	Undecided	Unassigned
Mantic	Won't Fix	Undecided	Unassigned
Noble	New	Undecided	Unassigned
livecd-rootfs (Ubuntu)	Fix Released	Undecided	Unassigned
Jammy	Fix Released	Undecided	Unassigned
Mantic	Won't Fix	Undecided	Unassigned
Noble	Fix Released	Undecided	Unassigned
util-linux (Ubuntu)	New	Undecided	Unassigned
Jammy	New	Undecided	Unassigned
Mantic	Won't Fix	Undecided	Unassigned
Noble	New	Undecided	Unassigned

Bug Description

[impact]
In mantic, we migrated livecd-rootfs to use losetup -P instead of kpartx, with the expectation that this would give us a reliable, race-free way of loop-mounting partitions from a disk image during image build.

In noble, we are finding that it is no longer reliable, and in fact fails rather often.

It is most noticeable with riscv64 builds, which is the architecture where we most frequently ran into problems before with kpartx. The first riscv64+generic build in noble where the expected loop partition device is not available is

https://launchpad.net/~ubuntu-cdimage/+livefs/ubuntu/noble/cpc/+build/531790

The failure is however not unique to riscv64, and the autopkgtest for the latest version of livecd-rootfs (24.04.7) - an update that specifically tries to add more debugging code for this scenario - has also failed on ppc64el.

https://autopkgtest.ubuntu.com/packages/l/livecd-rootfs/noble/ppc64el

The first failure happened on November 16. While there has been an update to the util-linux package in noble, this did not land until November 23.

The losetup usage has been backported to Jammy, and sees frequent failures there.

[test case]
The autopkgtests will provide enough confidence that the changes are not completely broken. Whether the change helps with the races on riscv can be "tested in prod" just as well as any other way.

[regression potential]
If the backport has been done incorrectly, image builds can fail (and the autopkgtests will fail if it has been completely bungled). This can be quickly handled. There is no foreseeable way for this to result in successful builds but broken images, which would be a much more difficult failure mode to unpick.

See original description

Tags:

Related branches

~simpoir/livecd-rootfs/+git/livecd-rootfs:lp2064175-jammy-buildd-grub-udev

Superseded for merging into livecd-rootfs:ubuntu/master

Canonical Foundations Team: Pending requested 2024-04-29

Diff: 1974 lines (+1042/-17) (has conflicts)

76 files modified

debian/changelog (+314/-3)
debian/control (+3/-0)
live-build/apparmor/5.19/capability (+1/-0)
live-build/apparmor/5.19/caps/mask (+1/-0)
live-build/apparmor/5.19/dbus/mask (+1/-0)
live-build/apparmor/5.19/domain/attach_conditions/xattr (+1/-0)
live-build/apparmor/5.19/domain/change_hat (+1/-0)
live-build/apparmor/5.19/domain/change_hatv (+1/-0)
live-build/apparmor/5.19/domain/change_onexec (+1/-0)
live-build/apparmor/5.19/domain/change_profile (+1/-0)
live-build/apparmor/5.19/domain/computed_longest_left (+1/-0)
live-build/apparmor/5.19/domain/fix_binfmt_elf_mmap (+1/-0)
live-build/apparmor/5.19/domain/post_nnp_subset (+1/-0)
live-build/apparmor/5.19/domain/stack (+1/-0)
live-build/apparmor/5.19/domain/version (+1/-0)
live-build/apparmor/5.19/file/mask (+1/-0)
live-build/apparmor/5.19/ipc/posix_mqueue (+1/-0)
live-build/apparmor/5.19/mount/mask (+1/-0)
live-build/apparmor/5.19/namespaces/pivot_root (+1/-0)
live-build/apparmor/5.19/namespaces/profile (+1/-0)
live-build/apparmor/5.19/network/af_mask (+1/-0)
live-build/apparmor/5.19/network/af_unix (+1/-0)
live-build/apparmor/5.19/network_v8/af_mask (+1/-0)
live-build/apparmor/5.19/policy/set_load (+1/-0)
live-build/apparmor/5.19/policy/versions/v5 (+1/-0)
live-build/apparmor/5.19/policy/versions/v6 (+1/-0)
live-build/apparmor/5.19/policy/versions/v7 (+1/-0)
live-build/apparmor/5.19/policy/versions/v8 (+1/-0)
live-build/apparmor/5.19/ptrace/mask (+1/-0)
live-build/apparmor/5.19/query/label/data (+1/-0)
live-build/apparmor/5.19/query/label/multi_transaction (+1/-0)
live-build/apparmor/5.19/query/label/perms (+1/-0)
live-build/apparmor/5.19/rlimit/mask (+1/-0)
live-build/apparmor/5.19/signal/mask (+1/-0)
live-build/apparmor/6.2/capability (+1/-0)
live-build/apparmor/6.2/caps/mask (+1/-0)
live-build/apparmor/6.2/dbus/mask (+1/-0)
live-build/apparmor/6.2/domain/attach_conditions/xattr (+1/-0)
live-build/apparmor/6.2/domain/change_hat (+1/-0)
live-build/apparmor/6.2/domain/change_hatv (+1/-0)
live-build/apparmor/6.2/domain/change_onexec (+1/-0)
live-build/apparmor/6.2/domain/change_profile (+1/-0)
live-build/apparmor/6.2/domain/computed_longest_left (+1/-0)
live-build/apparmor/6.2/domain/fix_binfmt_elf_mmap (+1/-0)
live-build/apparmor/6.2/domain/post_nnp_subset (+1/-0)
live-build/apparmor/6.2/domain/stack (+1/-0)
live-build/apparmor/6.2/domain/version (+1/-0)
live-build/apparmor/6.2/file/mask (+1/-0)
live-build/apparmor/6.2/ipc/posix_mqueue (+1/-0)
live-build/apparmor/6.2/mount/mask (+1/-0)
live-build/apparmor/6.2/namespaces/pivot_root (+1/-0)
live-build/apparmor/6.2/namespaces/profile (+1/-0)
live-build/apparmor/6.2/network/af_mask (+1/-0)
live-build/apparmor/6.2/network/af_unix (+1/-0)
live-build/apparmor/6.2/network_v8/af_mask (+1/-0)
live-build/apparmor/6.2/policy/set_load (+1/-0)
live-build/apparmor/6.2/policy/versions/v5 (+1/-0)
live-build/apparmor/6.2/policy/versions/v6 (+1/-0)
live-build/apparmor/6.2/policy/versions/v7 (+1/-0)
live-build/apparmor/6.2/policy/versions/v8 (+1/-0)
live-build/apparmor/6.2/ptrace/mask (+1/-0)
live-build/apparmor/6.2/query/label/data (+1/-0)
live-build/apparmor/6.2/query/label/multi_transaction (+1/-0)
live-build/apparmor/6.2/query/label/perms (+1/-0)
live-build/apparmor/6.2/rlimit/mask (+1/-0)
live-build/apparmor/6.2/signal/mask (+1/-0)
live-build/auto/build (+13/-0)
live-build/auto/config (+198/-0)
live-build/buildd/hooks/02-disk-image-uefi.binary (+22/-14)
live-build/functions (+68/-0)
live-build/lb_binary_layered (+4/-0)
live-build/ubuntu-cpc/hooks.d/base/disk-image-uefi.binary (+313/-0)
live-build/ubuntu-cpc/hooks.d/base/riscv64/grub/cmdline.cfg (+4/-0)
live-build/ubuntu-cpc/hooks.d/chroot/999-cpc-fixes.chroot (+32/-0)
live-build/ubuntu-server/hooks/01-unminimize.chroot_early (+4/-0)
live-build/ubuntu-server/hooks/03-kernel-metapkg.chroot_early (+3/-0)

~mwhudson/livecd-rootfs/+git/livecd-rootfs:lp-2045586-jammy

Merged into livecd-rootfs:ubuntu/jammy at revision 43133764fd1a035fbb860610a9ca502a7df6f2df

Canonical Foundations Team: Pending requested 2024-02-18

~mwhudson/livecd-rootfs/+git/livecd-rootfs:lp-2045586-jammy

Superseded for merging into livecd-rootfs:ubuntu/master

Canonical Foundations Team: Pending requested 2024-02-18

Diff: 1898 lines (+996/-14) (has conflicts)

76 files modified

debian/changelog (+292/-3)
debian/control (+3/-0)
live-build/apparmor/5.19/capability (+1/-0)
live-build/apparmor/5.19/caps/mask (+1/-0)
live-build/apparmor/5.19/dbus/mask (+1/-0)
live-build/apparmor/5.19/domain/attach_conditions/xattr (+1/-0)
live-build/apparmor/5.19/domain/change_hat (+1/-0)
live-build/apparmor/5.19/domain/change_hatv (+1/-0)
live-build/apparmor/5.19/domain/change_onexec (+1/-0)
live-build/apparmor/5.19/domain/change_profile (+1/-0)
live-build/apparmor/5.19/domain/computed_longest_left (+1/-0)
live-build/apparmor/5.19/domain/fix_binfmt_elf_mmap (+1/-0)
live-build/apparmor/5.19/domain/post_nnp_subset (+1/-0)
live-build/apparmor/5.19/domain/stack (+1/-0)
live-build/apparmor/5.19/domain/version (+1/-0)
live-build/apparmor/5.19/file/mask (+1/-0)
live-build/apparmor/5.19/ipc/posix_mqueue (+1/-0)
live-build/apparmor/5.19/mount/mask (+1/-0)
live-build/apparmor/5.19/namespaces/pivot_root (+1/-0)
live-build/apparmor/5.19/namespaces/profile (+1/-0)
live-build/apparmor/5.19/network/af_mask (+1/-0)
live-build/apparmor/5.19/network/af_unix (+1/-0)
live-build/apparmor/5.19/network_v8/af_mask (+1/-0)
live-build/apparmor/5.19/policy/set_load (+1/-0)
live-build/apparmor/5.19/policy/versions/v5 (+1/-0)
live-build/apparmor/5.19/policy/versions/v6 (+1/-0)
live-build/apparmor/5.19/policy/versions/v7 (+1/-0)
live-build/apparmor/5.19/policy/versions/v8 (+1/-0)
live-build/apparmor/5.19/ptrace/mask (+1/-0)
live-build/apparmor/5.19/query/label/data (+1/-0)
live-build/apparmor/5.19/query/label/multi_transaction (+1/-0)
live-build/apparmor/5.19/query/label/perms (+1/-0)
live-build/apparmor/5.19/rlimit/mask (+1/-0)
live-build/apparmor/5.19/signal/mask (+1/-0)
live-build/apparmor/6.2/capability (+1/-0)
live-build/apparmor/6.2/caps/mask (+1/-0)
live-build/apparmor/6.2/dbus/mask (+1/-0)
live-build/apparmor/6.2/domain/attach_conditions/xattr (+1/-0)
live-build/apparmor/6.2/domain/change_hat (+1/-0)
live-build/apparmor/6.2/domain/change_hatv (+1/-0)
live-build/apparmor/6.2/domain/change_onexec (+1/-0)
live-build/apparmor/6.2/domain/change_profile (+1/-0)
live-build/apparmor/6.2/domain/computed_longest_left (+1/-0)
live-build/apparmor/6.2/domain/fix_binfmt_elf_mmap (+1/-0)
live-build/apparmor/6.2/domain/post_nnp_subset (+1/-0)
live-build/apparmor/6.2/domain/stack (+1/-0)
live-build/apparmor/6.2/domain/version (+1/-0)
live-build/apparmor/6.2/file/mask (+1/-0)
live-build/apparmor/6.2/ipc/posix_mqueue (+1/-0)
live-build/apparmor/6.2/mount/mask (+1/-0)
live-build/apparmor/6.2/namespaces/pivot_root (+1/-0)
live-build/apparmor/6.2/namespaces/profile (+1/-0)
live-build/apparmor/6.2/network/af_mask (+1/-0)
live-build/apparmor/6.2/network/af_unix (+1/-0)
live-build/apparmor/6.2/network_v8/af_mask (+1/-0)
live-build/apparmor/6.2/policy/set_load (+1/-0)
live-build/apparmor/6.2/policy/versions/v5 (+1/-0)
live-build/apparmor/6.2/policy/versions/v6 (+1/-0)
live-build/apparmor/6.2/policy/versions/v7 (+1/-0)
live-build/apparmor/6.2/policy/versions/v8 (+1/-0)
live-build/apparmor/6.2/ptrace/mask (+1/-0)
live-build/apparmor/6.2/query/label/data (+1/-0)
live-build/apparmor/6.2/query/label/multi_transaction (+1/-0)
live-build/apparmor/6.2/query/label/perms (+1/-0)
live-build/apparmor/6.2/rlimit/mask (+1/-0)
live-build/apparmor/6.2/signal/mask (+1/-0)
live-build/auto/build (+7/-0)
live-build/auto/config (+199/-0)
live-build/buildd/hooks/02-disk-image-uefi.binary (+19/-11)
live-build/functions (+52/-0)
live-build/lb_binary_layered (+4/-0)
live-build/ubuntu-cpc/hooks.d/base/disk-image-uefi.binary (+313/-0)
live-build/ubuntu-cpc/hooks.d/base/riscv64/grub/cmdline.cfg (+4/-0)
live-build/ubuntu-cpc/hooks.d/chroot/999-cpc-fixes.chroot (+32/-0)
live-build/ubuntu-server/hooks/01-unminimize.chroot_early (+4/-0)
live-build/ubuntu-server/hooks/03-kernel-metapkg.chroot_early (+3/-0)

~dannf/livecd-rootfs/+git/livecd-rootfs:udevadm-lock

Merged into livecd-rootfs:ubuntu/master at revision 3a00ad5263b2429a15e4915c4124d87b0d59dbfb

Michael Hudson-Doyle: Approve on 2024-01-31

~mwhudson/livecd-rootfs/+git/livecd-rootfs:lp-2045586

Merged into livecd-rootfs:ubuntu/master at revision 57592e6dc1ee59c1e98ac8ced0263ba3252c6972

Dan Bungert: Approve on 2024-01-24

Revision history for this message

Steve Langasek (vorlon) wrote on 2023-12-04:

November 16 was 2 days after livecd-rootfs 24.04.4 landed in the noble release pocket, superseding 24.04.2.

The code delta between 24.04.2 and 24.04.4 includes removal of support for "legacy" images (SUBPROJECT=legacy) which doesn't apply here; and some reorganization of code related to "preinstalled" images which could affect the riscv64+generic image, that is a preinstalled image using the cpc project, but there were no code changes touching any of the image partitioning code so it's unclear how those code changes could have introduced this bug.

Revision history for this message

Steve Langasek (vorlon) wrote on 2023-12-04:

Failing build had kernel

Kernel version: Linux bos03-riscv64-014 5.19.0-1021-generic #23~22.04.1-Ubuntu SMP Thu Jun 22 12:49:35 UTC 2023 riscv64

The build immediately before the first failure had kernel

Kernel version: Linux riscv64-qemu-lgw01-069 5.13.0-1019-generic #21~20.04.1-Ubuntu SMP Thu Mar 24 22:36:01 UTC 2022 riscv64

So maybe this is a kernel regression?

Revision history for this message

Steve Langasek (vorlon) wrote on 2023-12-09:

https://launchpad.net/~ubuntu-cdimage/+livefs/ubuntu/noble/cpc/+build/544490 is a log from a build with a new livecd-rootfs that spits out more debugging info on failure.

+ sgdisk binary/boot/disk-uefi.ext4 --print
Disk binary/boot/disk-uefi.ext4: 9437184 sectors, 4.5 GiB
Sector size (logical): 512 bytes
Disk identifier (GUID): CD1DD3AE-E4C8-4C5F-BD64-9236C39B9824
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 9437150
Partitions will be aligned on 2-sector boundaries
Total free space is 0 sectors (0 bytes)

Number Start (sector) End (sector) Size Code Name
   1 235520 9437150 4.4 GiB 8300
  12 227328 235519 4.0 MiB 8300 CIDATA
  13 34 2081 1024.0 KiB FFFF loader1
  14 2082 10239 4.0 MiB FFFF loader2
  15 10240 227327 106.0 MiB EF00
+ mount_image binary/boot/disk-uefi.ext4 1
+ trap clean_loops EXIT
+ backing_img=binary/boot/disk-uefi.ext4
+ local rootpart=1
++ losetup --show -f -P -v binary/boot/disk-uefi.ext4
+ loop_device=/dev/loop5
+ '[' '!' -b /dev/loop5 ']'
+ rootfs_dev_mapper=/dev/loop5p1
+ '[' '!' -b /dev/loop5p1 ']'
+ echo '/dev/loop5p1 is not a block device'
/dev/loop5p1 is not a block device
+ ls -l /dev/loop5p1 /dev/loop5p12
brw------- 1 root root 259, 2 Dec 9 04:16 /dev/loop5p1
brw------- 1 root root 259, 3 Dec 9 04:16 /dev/loop5p12
+ exit 1

This clearly shows that:
- there are 5 partitions on the image being passed to losetup
- after losetup exits, /dev/loop5p1 is not present
- after this check fails, an ls of /dev/loop5p* shows devices present for two of the partitions - including /dev/loop5p1 that we were looking for in the first place - but not all 5.

So this definitely means we have a race after calling losetup -P.

Is this the expected behavior from the kernel? How do we make this race-free?

https://launchpad.net/~ubuntu-cdimage/+livefs/ubuntu/noble/cpc/+build/544490 is a log from a build with a new livecd-rootfs that spits out more debugging info on failure.

Number  Start (sector)    End (sector)  Size       Code  Name
   1          235520         9437150   4.4 GiB     8300  
  12          227328          235519   4.0 MiB     8300  CIDATA
  13              34            2081   1024.0 KiB  FFFF  loader1
  14            2082           10239   4.0 MiB     FFFF  loader2
  15           10240          227327   106.0 MiB   EF00  
+ mount_image binary/boot/disk-uefi.ext4 1
+ trap clean_loops EXIT
+ backing_img=binary/boot/disk-uefi.ext4
+ local rootpart=1
++ losetup --show -f -P -v binary/boot/disk-uefi.ext4
+ loop_device=/dev/loop5
+ '[' '!' -b /dev/loop5 ']'
+ rootfs_dev_mapper=/dev/loop5p1
+ '[' '!' -b /dev/loop5p1 ']'
+ echo '/dev/loop5p1 is not a block device'
/dev/loop5p1 is not a block device
+ ls -l /dev/loop5p1 /dev/loop5p12
brw------- 1 root root 259, 2 Dec  9 04:16 /dev/loop5p1
brw------- 1 root root 259, 3 Dec  9 04:16 /dev/loop5p12
+ exit 1

So this definitely means we have a race after calling losetup -P.

Is this the expected behavior from the kernel?  How do we make this race-free?

Revision history for this message

Andy Whitcroft (apw) wrote on 2023-12-09:

Was there any systemd/udev change in this timeframe? As the device files are very much connected to those.

Revision history for this message

Steve Langasek (vorlon) wrote on 2023-12-09: Re: [Bug 2045586] Re: livecd-rootfs uses losetup -P for theoretically reliable/synchronous partition setup but it's not reliable in noble

On Sat, Dec 09, 2023 at 05:13:28PM -0000, Andy Whitcroft wrote:
> Was there any systemd/udev change in this timeframe? As the device
> files are very much connected to those.

My understanding is that these devices are supposed to be created directly
by the kernel on devtmpfs and NOT via udev, which is part of how we expected
to fix the earlier races.

And systemd did not change in this time frame in any release. If there was
a change to the HOST udev in this timeframe causing a regression because a
new base image was published that includes a newer udev, we don't have
visibility on it.

Revision history for this message

Steve Langasek (vorlon) wrote on 2023-12-09: Re: livecd-rootfs uses losetup -P for theoretically reliable/synchronous partition setup but it's not reliable in noble

Oh. To the question of whether there was a systemd change in this window: yes absolutely, because this is the point at which the riscv64 builders moved from lgw manually-operated qemu with a 20.04 guest image, to bos03 openstack-operated qemu with a 22.04 guest image.

Which is also why we've moved from 5.13.0-1019-generic to 5.19.0-1021-generic.

But again, it was my understanding that these devices are supposed to be created synchronously WITHOUT the involvement of udev. In fact, we had to make launchpad-buildd changes to make use of these devices at all because udev would NOT set them up for us.

So if these are now being set up via udev, that's a significant departure from expectations and it's not clear we even CAN have synchronous behavior given that they would be set up by the host udev and not the udev in the lxd container!

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2023-12-11:

my expectation is that udev should be running (somewhere, not sure if it needs to be both the host and the lxd guest) and that it should process the device using locks https://systemd.io/BLOCK_DEVICE_LOCKING/.

After that is done, the device should be safe to operate on, in a consistent manner.

After all,

       -P, --partscan
           Force the kernel to scan the partition table on a newly created
           loop device. Note that the partition table parsing depends on
           sector sizes. The default is sector size is 512 bytes, otherwise
           you need to use the option --sector-size together with --partscan.

Is only asking kernel to scan the device; to then generate "kernel udev" events; for then udev to wakeup and process/emit "udev udev" events; and create the required device nodes.

We have always been fixing and supporting running udev inside the lxd containers, because of such things (in contexts of priviledged containers, but outside of lp-buildd) to make all of this work.

Revision history for this message

Emil Renner Berthing (esmil) wrote on 2023-12-11 (last edit on 2023-12-11):

I don't have a good explanation, but in the past I've "fixed" such races by adding a `sync "$loop_device"` before using any of the newly created partitions. Maybe it's worth trying.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2023-12-11: Re: [Bug 2045586] Re: livecd-rootfs uses losetup -P for theoretically reliable/synchronous partition setup but it's not reliable in noble

> Is only asking kernel to scan the device; to then generate "kernel udev"
> events; for then udev to wakeup and process/emit "udev udev" events; and
> create the required device nodes.
>

It's not udev that creates nodes like /dev/loop1p1 though is it? That's
devtmpfs surely.

Revision history for this message

dann frazier (dannf) wrote on 2024-01-22: Re: livecd-rootfs uses losetup -P for theoretically reliable/synchronous partition setup but it's not reliable in noble

#10

Download full text (5.2 KiB)

I ran into this on jammy/amd64: https://autopkgtest.ubuntu.com/results/autopkgtest-jammy/jammy/amd64/l/livecd-rootfs/20240121_173406_e4f9a@/log.gz

I downloaded all of the amd64 failures and searched for this failure pattern. These were the kernels that were running at the time:

"Linux 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023"
"Linux 6.2.0-21-generic #21-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 14 12:34:02 UTC 2023"
"Linux 6.3.0-7-generic #7-Ubuntu SMP PREEMPT_DYNAMIC Thu Jun 8 16:02:30 UTC 2023"
"Linux 6.5.0-9-generic #9-Ubuntu SMP PREEMPT_DYNAMIC Sat Oct 7 01:35:40 UTC 2023"
"Linux 6.6.0-14-generic #14-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 30 10:27:29 UTC 2023"

Here's the count of failures per image type:
     12 017-disk-image-uefi.binary
      3 018-disk-image.binary
      3 020-kvm-image.binary
      1 023-vagrant.binary
      1 024-vagrant.binary

I can confirm that /dev/loop0p1 is created by devtmpfs. This surprised me because I'd never actually need to know what devtmpfs was, and I saw devices being created even though I had SIGSTOP'd systemd-udevd. But watching udevadm monitor and forktrace output convinced me.

I had a theory that something was opening the first created partition before all partitions were created. loop_reread_partitions() can fail without returning an error to userspace:
https://elixir.bootlin.com/linux/v5.15.147/source/drivers/block/loop.c#L676

that could happen if bdev_disk_changed() aborts because it finds another partition on the device is open:
https://elixir.bootlin.com/linux/v5.15.147/source/block/partitions/core.c#L662

But then we should see this in dmesg:
pr_warn("%s: partition scan of loop%d (%s) failed (rc=%d)\n"

I added dmesg calls to check that:
https://autopkgtest.ubuntu.com/results/autopkgtest-jammy-dannf-test/jammy/amd64/l/livecd-rootfs/20240122_161631_62ecd@/log.gz

.. but no such message appeared, so that's not it. But what *is* interesting there is that it shows *2* partition scan lines:

1248s [ 990.855361] loop0: detected capacity change from 0 to 4612096
1248s [ 990.855628] loop0: p1 p14 p15
1248s [ 990.874241] loop0: p1 p14 p15

Previously we just saw 1:

1189s [ 932.268459] loop0: detected capacity change from 0 to 4612096
1189s [ 932.268715] loop0: p1 p14 p15

That only gets printed when bdev_disk_changed() is called. So do we have 2 racing callers?

One thing that seems off is that loop_configure() unsuppresses uevents for the full device before the partition scan, but loop_change_fd() waits until the partition scan is complete. Shouldn't they be following the same pattern? I wonder if that could cause the following race:

[livecd-rootfs] losetup creates /dev/loop0
[livecd-rootfs] kernel sends uevent for /dev/loop0
[livecd-rootfs] /dev/loop0p* appear in devtmpfs
      [udev] receives uevent for loop0
      [udev] partprobe /dev/loop0
[livecd-rootfs] losetup exit(0)
      [partprobe] /dev/loop0p* cleared
[livecd-rootfs] check for /dev/loop0p1 FAILS
      [partprobe] /dev/loop0p* recreated

I tried checking for this using ftrace in a local jammy VM. I haven't been able to reproduce this in a local VM, but I wanted to see what happens in a normal losetup.. er.....

I ran into this on jammy/amd64: https://autopkgtest.ubuntu.com/results/autopkgtest-jammy/jammy/amd64/l/livecd-rootfs/20240121_173406_e4f9a@/log.gz

I downloaded all of the amd64 failures and searched for this failure pattern. These were the kernels that were running at the time:

"Linux 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023"
"Linux 6.2.0-21-generic #21-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 14 12:34:02 UTC 2023"
"Linux 6.3.0-7-generic #7-Ubuntu SMP PREEMPT_DYNAMIC Thu Jun  8 16:02:30 UTC 2023"
"Linux 6.5.0-9-generic #9-Ubuntu SMP PREEMPT_DYNAMIC Sat Oct  7 01:35:40 UTC 2023"
"Linux 6.6.0-14-generic #14-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 30 10:27:29 UTC 2023"

Here's the count of failures per image type:
     12 017-disk-image-uefi.binary
      3 018-disk-image.binary
      3 020-kvm-image.binary
      1 023-vagrant.binary
      1 024-vagrant.binary

I had a theory that something was opening the first created partition before all partitions were created. loop_reread_partitions() can fail without returning an error to userspace:
 https://elixir.bootlin.com/linux/v5.15.147/source/drivers/block/loop.c#L676

that could happen if bdev_disk_changed() aborts because it finds another partition on the device is open:
 https://elixir.bootlin.com/linux/v5.15.147/source/block/partitions/core.c#L662

But then we should see this in dmesg:
  pr_warn("%s: partition scan of loop%d (%s) failed (rc=%d)\n"

I added dmesg calls to check that:
https://autopkgtest.ubuntu.com/results/autopkgtest-jammy-dannf-test/jammy/amd64/l/livecd-rootfs/20240122_161631_62ecd@/log.gz

.. but no such message appeared, so that's not it. But what *is* interesting there is that it shows *2* partition scan lines:

1248s [  990.855361] loop0: detected capacity change from 0 to 4612096
1248s [  990.855628]  loop0: p1 p14 p15
1248s [  990.874241]  loop0: p1 p14 p15

Previously we just saw 1:

1189s [  932.268459] loop0: detected capacity change from 0 to 4612096
1189s [  932.268715]  loop0: p1 p14 p15

That only gets printed when bdev_disk_changed() is called. So do we have 2 racing callers?

I tried checking for this using ftrace in a local jammy VM. I haven't been able to reproduce this in a local VM, but I wanted to see what happens in a normal losetup.. er... setup.
 
>>>>> First I used losetup to create the device:

root@dannf-livecd-rootfs-debug:/sys/kernel/debug/tracing# loopdev="$(losetup --show -f -P -v /home/ubuntu/disk.img)"
root@dannf-livecd-rootfs-debug:/sys/kernel/debug/tracing# cat trace
# tracer: function
#
# entries-in-buffer/entries-written: 1/1   #P:1
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| / _-=> migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
         losetup-1996    [000] .....   657.573994: bdev_disk_changed <-loop_reread_partitions

>>>>> Only the expected bdev_disk_change() call
>>>>> Then I remove the device:

root@dannf-livecd-rootfs-debug:/sys/kernel/debug/tracing# losetup -v -d $loopdev
root@dannf-livecd-rootfs-debug:/sys/kernel/debug/tracing# cat trace
# tracer: function
#
# entries-in-buffer/entries-written: 3/3   #P:1
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| / _-=> migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
         losetup-1996    [000] .....   657.573994: bdev_disk_changed <-loop_reread_partitions
   systemd-udevd-2524    [000] .....   680.555336: bdev_disk_changed <-blkdev_get_whole
         losetup-2523    [000] .....   680.568473: bdev_disk_changed <-__loop_clr_fd

>>>>> udev did rescan the partitions... but only during removal. And its strange that it gets in there before losetup calls it itself.

Perhaps I should add this tracing into a debug livecd-rootfs and see if I can reproduce in our autopkgtest infra.

Revision history for this message

dann frazier (dannf) wrote on 2024-01-23:

#11

I ran the above test:
https://autopkgtest.ubuntu.com/results/autopkgtest-jammy-dannf-test/jammy/amd64/l/livecd-rootfs/20240123_035147_6470b@/log.gz

It does appear that systemd-udevd is trying to scan partitions at the same time as losetup:

1599s ++ losetup --show -f -P -v binary/boot/disk-uefi.ext4
1600s + loop_device=/dev/loop0
1600s + '[' '!' -b /dev/loop0 ']'
1600s + rootfs_dev_mapper=/dev/loop0p1
1600s + '[' '!' -b /dev/loop0p1 ']'
1600s + echo '/dev/loop0p1 is not a block device'
1600s /dev/loop0p1 is not a block device
1600s + echo '=== dmesg ==='
1600s === dmesg ===
1600s + dmesg -c
1600s [ 986.014824] EXT4-fs (loop0p1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
1600s [ 992.684380] EXT4-fs (loop0p1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
1600s [ 1043.171603] loop0: detected capacity change from 0 to 4612096
1600s [ 1043.171924] loop0: p1 p14 p15
1600s [ 1043.190421] loop0: p1 p14 p15
1600s + cat /sys/kernel/debug/tracing/trace
1600s # tracer: function
1600s #
1600s # entries-in-buffer/entries-written: 2/2 #P:4
1600s #
1600s # _-----=> irqs-off
1600s # / _----=> need-resched
1600s # | / _---=> hardirq/softirq
1600s # || / _--=> preempt-depth
1600s # ||| / _-=> migrate-disable
1600s # |||| / delay
1600s # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
1600s # | | | ||||| | |
1600s losetup-50167 [002] ..... 1043.176845: bdev_disk_changed <-loop_reread_partitions
1600s systemd-udevd-321 [000] ..... 1043.195003: bdev_disk_changed <-blkdev_get_whole
1600s + echo 0
1600s + ls -l /dev/loop0p1
1600s brw------- 1 root root 259, 3 Jan 23 03:51 /dev/loop0p1
1600s + exit 1
1600s + clean_loops

Maybe we just need something like this?

- /* enable and uncork uevent now that we are done */
- dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 0);
-
        loop_global_unlock(lo, is_loop);
        if (partscan)
                loop_reread_partitions(lo);

+ /* enable and uncork uevent now that we are done */
+ dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 0);
+
if (!(mode & FMODE_EXCL))
bd_abort_claiming(bdev, loop_configure);

I ran the above test:
  https://autopkgtest.ubuntu.com/results/autopkgtest-jammy-dannf-test/jammy/amd64/l/livecd-rootfs/20240123_035147_6470b@/log.gz

It does appear that systemd-udevd is trying to scan partitions at the same time as losetup:

1599s ++ losetup --show -f -P -v binary/boot/disk-uefi.ext4
1600s + loop_device=/dev/loop0
1600s + '[' '!' -b /dev/loop0 ']'
1600s + rootfs_dev_mapper=/dev/loop0p1
1600s + '[' '!' -b /dev/loop0p1 ']'
1600s + echo '/dev/loop0p1 is not a block device'
1600s /dev/loop0p1 is not a block device
1600s + echo '=== dmesg ==='
1600s === dmesg ===
1600s + dmesg -c
1600s [  986.014824] EXT4-fs (loop0p1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
1600s [  992.684380] EXT4-fs (loop0p1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
1600s [ 1043.171603] loop0: detected capacity change from 0 to 4612096
1600s [ 1043.171924]  loop0: p1 p14 p15
1600s [ 1043.190421]  loop0: p1 p14 p15
1600s + cat /sys/kernel/debug/tracing/trace
1600s # tracer: function
1600s #
1600s # entries-in-buffer/entries-written: 2/2   #P:4
1600s #
1600s #                                _-----=> irqs-off
1600s #                               / _----=> need-resched
1600s #                              | / _---=> hardirq/softirq
1600s #                              || / _--=> preempt-depth
1600s #                              ||| / _-=> migrate-disable
1600s #                              |||| /     delay
1600s #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
1600s #              | |         |   |||||     |         |
1600s          losetup-50167   [002] .....  1043.176845: bdev_disk_changed <-loop_reread_partitions
1600s    systemd-udevd-321     [000] .....  1043.195003: bdev_disk_changed <-blkdev_get_whole
1600s + echo 0
1600s + ls -l /dev/loop0p1
1600s brw------- 1 root root 259, 3 Jan 23 03:51 /dev/loop0p1
1600s + exit 1
1600s + clean_loops

Maybe we just need something like this?

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 48c530b83000e..52fda87f5d674 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1366,13 +1366,13 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
        if (partscan)
                lo->lo_disk->flags &= ~GENHD_FL_NO_PART_SCAN;
 
-       /* enable and uncork uevent now that we are done */
-       dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 0);
-
        loop_global_unlock(lo, is_loop);
        if (partscan)
                loop_reread_partitions(lo);
 
+       /* enable and uncork uevent now that we are done */
+       dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 0);
+
        if (!(mode & FMODE_EXCL))
                bd_abort_claiming(bdev, loop_configure);

dann frazier (dannf) on 2024-01-23

summary:

livecd-rootfs uses losetup -P for theoretically reliable/synchronous
- partition setup but it's not reliable in noble
+ partition setup but it's not reliable

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2024-01-24:

#12

Amazing debugging Dann. Until we can get a kernel fix, what's the way forward here? Run losetup without -P, run udevadm settle, run partprobe on the device (then maybe run udevadm settle again??)

Revision history for this message

dann frazier (dannf) wrote on 2024-01-24: Re: [Bug 2045586] Re: livecd-rootfs uses losetup -P for theoretically reliable/synchronous partition setup but it's not reliable

#13

On Tue, Jan 23, 2024 at 8:55 PM Michael Hudson-Doyle
<email address hidden> wrote:
>
> Amazing debugging Dann. Until we can get a kernel fix, what's the way
> forward here? Run losetup without -P, run udevadm settle, run partprobe
> on the device (then maybe run udevadm settle again??)

That seems logical. I don't see the need for the 2nd udevadm settle -
but I also don't see the harm.

I've a test fix kernel building in a PPA. I think Łukasz mentioned
that there's a way to have an autopkgtest run on a specific kernel?
Can someone give me those steps?

-dann

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2024-01-24:

#14

Here's my workaround then https://code.launchpad.net/~mwhudson/livecd-rootfs/+git/livecd-rootfs/+merge/459380

Revision history for this message

Launchpad Janitor (janitor) wrote on 2024-01-25:

#15

This bug was fixed in the package livecd-rootfs - 24.04.21

---------------
livecd-rootfs (24.04.21) noble; urgency=medium

* live-build/functions: avoid losetup -P as it appears to race with udev and
do it a bit more by-hand instead. (LP: #2045586)

-- Michael Hudson-Doyle <email address hidden> Thu, 25 Jan 2024 10:28:38 +1300

Changed in livecd-rootfs (Ubuntu):
status:	New → Fix Released

Revision history for this message

dann frazier (dannf) wrote on 2024-01-26:

#16

That kernel change doesn't fix the issue:

https://autopkgtest.ubuntu.com/results/autopkgtest-noble-dannf-loop/noble/amd64/l/livecd-rootfs/20240125_203808_8b5c9@/log.gz

Which actually didn't surprise me after thinking about it. systemd-udevd is going to ask for a partition reread when it gets the uevent for loop0 at some point - it doesn't matter if we hold it off until losetup's reread completes. There will always be a window where the partition device files disappear and reappear.

Perhaps we just need a `udevadm wait $rootfs_dev_mapper --settle`.

Revision history for this message

dann frazier (dannf) wrote on 2024-01-28:

#17

I was bothered by the fact that we only sometimes see the double partition rescan in dmesg. If udev always rescans the partitions of a full block devices, shouldn't we always see those messages twice?

It turns out that udev doesn't rescan partitions just because a new block device appears. When it gets a uevent for a full block device, it registers an inotify watch on the device file for any process closing a file that was open in write mode. Presumably that process could've changed the partitions so, when that inotify event fires, it requests a partition rescan.

In our case, losetup has opened /dev/loop0 in write mode. When it calls into the LOOP_CONFIGURE ioctl(), a uevent will fire for /dev/loop0. If udev processes that event and adds the inotify watch before losetup closes the device, then closing the device will trigger the inotify event and udev will rescan.

I don't see a way to order operations to prevent this race. And I don't think we can "settle" our way out of it, because its not a uevent we're racing with, its the inotify event. But it looks like systemd may already have a solution for that:
https://systemd.io/BLOCK_DEVICE_LOCKING/

I think as long as we run all of our commands that access the partition device files under `udevadm lock -d <fulldevice>`, then systemd should not trigger a partition reread underneath us. Here's what I'm thinking:

https://code.launchpad.net/~dannf/livecd-rootfs/+git/livecd-rootfs/+merge/459549

Of course, udevadm lock didn't exist until after jammy. And all versions that do have it, including the one in v255 in noble-proposed, are broken and need this patch backport:

https://github.com/systemd/systemd/commit/ba340e2a75a0a16031fcb7efa05cfd250e859f17

Revision history for this message

dann frazier (dannf) wrote on 2024-01-29:

#18

I updated my livecd-rootfs PPA test package that runs this section of code in a loop to use this pattern, and it survived until the autopkgtest timeout - 304 iterations:

https://autopkgtest.ubuntu.com/results/autopkgtest-noble-dannf-loop/noble/amd64/l/livecd-rootfs/20240128_061640_5c909@/log.gz

...when it previously was failing reliably within a few iterations.

They `udevadm lock` fix is now in unstable, so the next sync should pull it in.

However, I don't know that we really need any features of `udevadm lock` beyond the actual flock'ing of the device. We could presumably do that with `flock` for older releases. Or we could consider SRU'ing back the udevadm lock command to jammy as a low priority SRU. The way it was introduced looks very non-invasive:
https://github.com/systemd/systemd/commit/8b12a516e9304f720e07eeec5d25c5b7f7104bc9
But I haven't gone through the bug fixes since to ensure it remained that way.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2024-01-31:

#19

Oh wait, we've been through something very like this before https://bugs.launchpad.net/cloud-initramfs-tools/+bug/1834875. I suspect a judicious application of flock may be the most correct solution available.

Revision history for this message

dann frazier (dannf) wrote on 2024-01-31:

#20

On Tue, Jan 30, 2024 at 6:21 PM Michael Hudson-Doyle
<email address hidden> wrote:
>
> Oh wait, we've been through something very like this before
> https://bugs.launchpad.net/cloud-initramfs-tools/+bug/1834875. I suspect
> a judicious application of flock may be the most correct solution
> available.

ACK - I'll switch over to flock and run another test.

Revision history for this message

dann frazier (dannf) wrote on 2024-01-31:

#21

The test passed w/ flock as well:
https://autopkgtest.ubuntu.com/results/autopkgtest-noble-dannf-loop/noble/amd64/l/livecd-rootfs/20240131_104633_df869@/log.gz

274 successful iterations before the timeout killed it.

I've updated the MP:
https://code.launchpad.net/~dannf/livecd-rootfs/+git/livecd-rootfs/+merge/459549

Revision history for this message

dann frazier (dannf) wrote on 2024-02-05:

#22

While the above MP is now merged, I still see additional potential races in the code. For example, anything calling mount_partition() for the first partition, and also maybe bug 2030771. I don't see a good way to solve that w/o `udevadm lock` because these functions don't currently know what the full block device - and I don't think we want to implement that logic ourselves.

I suggest we move from flock to `udevadm lock` once it is available in noble. If that proves to be stable, then perhaps we consider backporting that to jammy's systemd.

Revision history for this message

Catherine Redfield (catred) wrote on 2024-02-09:

#23

I've been seeing this bug as far back as losetup is used (jammy) and the previously merged fix has a significant improvement on successful builds (0/3 succeeded without the patch applied vs 5/5 succeeded with the patch applied this week) in jammy. Is someone already looking to SRU that patch back to mantic and jammy or should I work on it? Even if it's not perfectly, I think it would be extremely benefitial.

Revision history for this message

dann frazier (dannf) wrote on 2024-02-09:

#24

I'm glad to hear you are seeing an improvement! The current implementation is still racy as mentioned in Comment #22. I did a search of the logs I downloaded, and I believe this one shows that the mount_partition() race isn't just theoretical:

https://autopkgtest.ubuntu.com/results/autopkgtest-mantic/mantic/amd64/l/livecd-rootfs/20230928_132941_a6800@/log.gz

So my suggestion is that we:

1) wait for systemd 255.3 to be merged in noble to fix the issue w/ `udevadm lock`
2) convert livecd-rootfs to use udevadm lock in noble, addressing the remaining known races.
3) backport `udevadm lock` to jammy (likely as a low-priority SRU)
4) backport the`udevadm lock` support to livecd-rootfs

I realize that'll take some time, so if you want to SRU the partial fix as a stop-gap, feel free.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2024-02-18:

#25

Here is a backport of the partial fix to jammy https://code.launchpad.net/~mwhudson/livecd-rootfs/+git/livecd-rootfs/+merge/460729

I'm not sure I am best placed to update this bug description to match the SRU template though -- I'm not sure which builds are being affected in practice and so should be incorporated into the test plan.

Michael Hudson-Doyle (mwhudson) on 2024-02-18

description:

updated

Revision history for this message

Łukasz Zemczak (sil2100) wrote on 2024-02-19: Please test proposed package

#26

Hello Steve, or anyone else affected,

Accepted livecd-rootfs into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/livecd-rootfs/2.765.39 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in livecd-rootfs (Ubuntu Jammy):
status:	New → Fix Committed
tags:	added: verification-needed verification-needed-jammy

Revision history for this message

Łukasz Zemczak (sil2100) wrote on 2024-02-19:

#27

Releasing early as we're in the middle of a point-release. Autopkgtests look good, the actual testing will happen on the RCs.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2024-02-19:

#28

This bug was fixed in the package livecd-rootfs - 2.765.39

---------------
livecd-rootfs (2.765.39) jammy; urgency=medium

  [ dann frazier ]
  * Use flock to avoid races with systemd-udevd that cause loop device
    partitions to briefly disappear. (LP: #2045586)

-- Michael Hudson-Doyle <email address hidden> Mon, 19 Feb 2024 09:25:09 +1300

Changed in livecd-rootfs (Ubuntu Jammy):
status:	Fix Committed → Fix Released

Revision history for this message

Catherine Redfield (catred) wrote on 2024-02-21:

#29

I tested the flock-based solution with some of the CPC pipelines in jammy and saw consistently clean builds (30 successful images built yesterday). Thank you very much for everyone's hard work debugging and fixing this race condition!

Revision history for this message

Brian Murray (brian-murray) wrote on 2024-07-16:

#30

Ubuntu 23.10 (Mantic Minotaur) has reached end of life, so this bug will not be fixed for that specific release.

Changed in linux (Ubuntu Mantic):
status:	New → Won't Fix
Changed in livecd-rootfs (Ubuntu Mantic):
status:	New → Won't Fix
Changed in util-linux (Ubuntu Mantic):
status:	New → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntuutil-linux package

livecd-rootfs uses losetup -P for theoretically reliable/synchronous partition setup but it's not reliable

Bug Description

Related branches

Other bug subscribers

Remote bug watches

Ubuntu
util-linux package