livecd-rootfs focal hangs during kpartx (ubuntu-cpc project)

Bug #1852518 reported by Robert C Jennings
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
launchpad-buildd
Fix Released
Undecided
Unassigned

Bug Description

Focal livecd-rootfs builds of the ubuntu-cpc project hang indefinitely when 'kpartx -av binary/boot/disk.ext4' is called. An strace shows that kpartx is sleeping waiting on netlink.

In the build container the kpartx process is in the sleeping state.
The container journalctl output has messages like:
  systemd-udevd[19698]: dm-0: Failed to process device, ignoring: No such file or directory
The host journalctl output has messages like:
  kernel: blk_update_request: I/O error, dev loop0, sector 0

I'll attach the build output up to the hang as https://bugs.launchpad.net/launchpad-buildd/+bug/1852518/+attachment/5305276/+files/binary.log

The lxd environment is a privileged container and udev won't create /dev/dm-# devices in /dev because devtmpfs is used in the host but tmpfs is used in the container. There is some change in behavior after Oct 30th in Focal that exposes this failure. In our image manifests we see the following changes between the working and failing build:

(MAN is the version of the package in the manifest for the working build while ARCHIVE is the version in the archive on the day of the first failing build)

bzip2:amd64 (MAN: 1.0.6-9.2) (ARCHIVE: 1.0.8-2)
gcc-9-base:amd64 (MAN: 9.2.1-12ubuntu1) (ARCHIVE: 9.2.1-15ubuntu1)
libbz2-1.0:amd64 (MAN: 1.0.6-9.2) (ARCHIVE: 1.0.8-2)
libgcc1:amd64 (MAN: 1:9.2.1-12ubuntu1) (ARCHIVE: 1:9.2.1-15ubuntu1)
libnss-systemd:amd64 (MAN: 242-7ubuntu3) (ARCHIVE: 243-2ubuntu1)
libpam-systemd:amd64 (MAN: 242-7ubuntu3) (ARCHIVE: 243-2ubuntu1)
libsoup2.4-1:amd64 (MAN: 2.68.2-0ubuntu1) (ARCHIVE: 2.68.2-1)
libstdc++6:amd64 (MAN: 9.2.1-12ubuntu1) (ARCHIVE: 9.2.1-15ubuntu1)
libsystemd0:amd64 (MAN: 242-7ubuntu3) (ARCHIVE: 243-2ubuntu1)
libudev1:amd64 (MAN: 242-7ubuntu3) (ARCHIVE: 243-2ubuntu1)
netcat-openbsd:amd64 (MAN: 1.203-1) (ARCHIVE: 1.203-2)
python3-distupgrade:amd64 (MAN: 1:19.10.15) (ARCHIVE: 1:20.04.1)
systemd:amd64 (MAN: 242-7ubuntu3) (ARCHIVE: 243-2ubuntu1)
systemd-sysv:amd64 (MAN: 242-7ubuntu3) (ARCHIVE: 243-2ubuntu1)
ubuntu-release-upgrader-core:amd64 (MAN: 1:19.10.15) (ARCHIVE: 1:20.04.1)
udev:amd64 (MAN: 242-7ubuntu3) (ARCHIVE: 243-2ubuntu1)

I tried back-leveling just udev:

# dpkg -i udev_242-7ubuntu3_amd64.deb libudev1_242-7ubuntu3_amd64.deb
(Reading database ... 20790 files and directories currently installed.)
Preparing to unpack udev_242-7ubuntu3_amd64.deb ...
Unpacking udev (242-7ubuntu3) over (242-7ubuntu3) ...
dpkg: warning: downgrading libudev1:amd64 from 243-3ubuntu1 to 242-7ubuntu3
Preparing to unpack libudev1_242-7ubuntu3_amd64.deb ...
Unpacking libudev1:amd64 (242-7ubuntu3) over (243-3ubuntu1) ...
Setting up libudev1:amd64 (242-7ubuntu3) ...
Setting up udev (242-7ubuntu3) ...
update-initramfs: deferring update (trigger activated)
Processing triggers for systemd (243-3ubuntu1) ...
Processing triggers for libc-bin (2.30-0ubuntu2) ...
Processing triggers for initramfs-tools (0.133ubuntu11) ...
# systemctl restart systemd-udevd.service

The problem still recreated with kpartx hanging in the build with this change but maybe still be close. However, I am able to address the kpartx hang by making the /dev/dm-X nodes in launchpad-buildd or changing the livecd-rootfs hooks to call 'kpartx -anv binary/boot/disk.ext4; dmsetup --noudevsync mknodes' because the -n flag will return before the partitions are created.

On that kpartx invocation change, it doesn't appear that kpartx has changed in focal so a behavior it depends on appears to have changes. Also I'd like to avoid changing the pattern for calling kpartx due to use in various locations of livecd-rootfs.

So I'm going to propose a patch to launchpad-buildd to add the /dev/dm-X nodes so we can have focal builds again, but we also should dig into the underlying cause of kpartx waiting indefinitely as that isn't excellent behavior.

Related branches

Revision history for this message
Robert C Jennings (rcj) wrote :
description: updated
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

I managed to reproduce this in a xenial vm on my machine. The problem appears with udev 243 it seems (installing udev and libudev 242-7ubuntu3 makes it work again).

When kpartx works it prints this message:

root@c:~# kpartx -av image.img
/dev/mapper/loop0p1 not set up by udev: Falling back to direct node creation.
add map loop0p1 (252:0): 0 2088960 linear 7:0 4096

so for some reason it's not getting to that fallback. FWIW, that fallback code is in libdevmapper (src:lvm2).

tags: added: id-5dc15a912d7f90886388bfd5
Revision history for this message
Robert C Jennings (rcj) wrote :

I have been using https://github.com/chrisglass/ubuntu-old-fashioned to accurately recreate the lxd profile/environment as done in the buildds. This is a privileged container with particular settings https://bazaar.launchpad.net/~canonical-launchpad-branches/launchpad-buildd/trunk/view/head:/lpbuildd/target/lxd.py#L287 and manually created device nodes https://bazaar.launchpad.net/~canonical-launchpad-branches/launchpad-buildd/trunk/view/head:/lpbuildd/target/lxd.py#L438 that may be hard to duplicate accurately by hand.

I've edited /usr/bin/old-fashioned-image-build further to hold the builder environment between setup and build. That gives you the ability to 'sudo lxc exec lp-focal-amd64 bash' and experiment with backleveling packages to help bisect or to run other manual experiments to recreate without the build.

--- ubuntu-old-fashioned/old-fashioned-image-build 2019-11-13 21:25:10.296000000 +0000
+++ /usr/bin/old-fashioned-image-build 2019-11-14 13:56:24.897891670 +0000
@@ -209,6 +209,12 @@
 # buildlivefs (below) overwriting changes
 lxc exec lp-$SERIES-${ARCH} -- apt-get install -y livecd-rootfs

+touch /tmp/hold
+while [ -f /tmp/hold ] ; do
+ echo "Waiting for /tmp/hold to be removed"
+ sleep 30
+done
+
 # Remove and recreate the livecd-rootfs code we will be replacing with code
 # in our current directory
 lxc exec lp-$SERIES-${ARCH} -- rm -rf /usr/share/livecd-rootfs

My steps to recreate in the environment are:

truncate foo 4M
gdisk foo
# Create a partition
kpartx -av foo
# recreate hang

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

I'm attaching a more self-contained test script that I've used to reproduce this on a xenial VM (it doesn't work with the lxd in eoan, for reasons trivial or deep, I have no idea).

Revision history for this message
Balint Reczey (rbalint) wrote :

This is the end of the strace interleaved with journalct output:

...
ioctl(4, DM_DEV_STATUS, {version=4.0.0, data_size=16384, name="loop0p1", flags=DM_EXISTS_FLAG|DM_SKIP_BDGET_FLAG} => {version=4.34.0, data_size=16384, name="loop0p1", flags=DM_EXISTS_FLAG|DM_SKIP_BDGET_FLAG}) = -1 ENXIO (No such device or address)
ioctl(4, DM_DEV_STATUS, {version=4.0.0, data_size=16384, uuid="part1-devnode_7:0_Wh5pYvM", flags=DM_EXISTS_FLAG} => {version=4.34.0, data_size=16384, uuid="part1-devnode_7:0_Wh5pYvM", flags=DM_EXISTS_FLAG}) = -1 ENXIO (No such device or address)
openat(AT_FDCWD, "/dev/urandom", O_RDONLY) = 5
read(5, "r\5", 2) = 2
semget(0xd4d0572, 1, IPC_CREAT|IPC_EXCL|0600) = 0
semctl(0, 0, SETVAL, 0x7ffd00000001) = 0
semctl(0, 0, GETVAL, NULL) = 1
close(5) = 0
semop(0, [{0, 1, 0}], 1) = 0
semctl(0, 0, GETVAL, NULL) = 2
ioctl(4, DM_DEV_CREATE, {version=4.0.0, data_size=16384, name="loop0p1", uuid="part1-devnode_7:0_Wh5pYvM", flags=DM_EXISTS_FLAG} => {version=4.34.0, data_size=305, dev=makedev(0xfc, 0), name="loop0p1", uuid="part1-devnode_7:0_Wh5pYvM", target_count=0, open_count=0, event_nr=0, flags=DM_EXISTS_FLAG}) = 0
ioctl(4, DM_TABLE_LOAD, {version=4.0.0, data_size=16384, data_start=312, name="loop0p1", target_count=1, flags=DM_EXISTS_FLAG, ...} => {version=4.34.0, data_size=305, data_start=312, dev=makedev(0xfc, 0), name="loop0p1", uuid="part1-devnode_7:0_Wh5pYvM", target_count=0, open_count=0, event_nr=0, flags=DM_EXISTS_FLAG|DM_INACTIVE_PRESENT_FLAG}) = 0
ioctl(4, DM_DEV_SUSPEND, {version=4.0.0, data_size=16384, name="loop0p1", event_nr=4195698, flags=DM_EXISTS_FLAG|DM_SKIP_BDGET_FLAG} => {version=4.34.0, data_size=305, dev=makedev(0xfc, 0), name="loop0p1", uuid="part1-devnode_7:0_Wh5pYvM", target_count=1, open_count=0, event_nr=0, flags=DM_EXISTS_FLAG|DM_ACTIVE_PRESENT_FLAG|DM_SKIP_BDGET_FLAG|DM_UEVENT_GENERATED_FLAG}) = 0
semget(0xd4d0572, 1, 000) = 0
semctl(0, 0, GETVAL, NULL) = 2
semop(0, [{0, -1, IPC_NOWAIT}], 1) = 0
semop(0, [{0, 0, 0}], 1Nov 14 19:31:53 workable-sunfish-ubuntu-bartender systemd-udevd[19340]: dm-0: Failed to process device, ignoring: No such file or directory
Nov 14 19:31:53 workable-sunfish-ubuntu-bartender systemd-udevd[19339]: dm-0: Failed to process device, ignoring: No such file or directory

@rcj, @mwhudson: I suggest applying the workaround in livecd-rootfs ('kpartx -anv binary/boot/disk.ext4; dmsetup --noudevsync mknodes') to get the builds going quickly (@rcj as I saw there is only one other script in use to fix) and I bisect the issue in systemd in parallel and we will see where this should be fixed.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

So in my (parallel) debugging it seems to me that notification from udev that processing of the new device is not getting through. If you let kpartx -sav hang in one window and run "dmsetup udevcomplete_all" in another, the kpartx completes successfully.

The notification is supposed to be being sent by the "dmsetup udevcomplete" command in 95-dm-notify.rules. udev seems to _say_ it's running this rule:

dm-0: /usr/lib/udev/rules.d/95-dm-notify.rules:12 RUN '/sbin/dmsetup udevcomplete $env{DM_COOKIE}'

appears in systemd-udevd --debug output, but when I hack the rule to have side effects, it's clearly not being run (it _is_ run when you run kpartx -d). So why would udev print the above line but not actually run the command?

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Ah hah, it's this upstream commit: https://github.com/systemd/systemd/commit/99058cd66ad30707d8eda32d6b256e5c05bbf46a

commit 99058cd66ad30707d8eda32d6b256e5c05bbf46a
Author: Yu Watanabe <email address hidden>
Date: Wed Jun 5 03:09:20 2019 +0900

    udev: propagate errors in udev_event_execute_rules()

    And do not set initialized flag to the device.

This means that because /dev/dm-0 does not get created in this situation, udev gives up before running the RUN command that was accumulated during rule processing.

The upstream change seems to make sense really but the failure mode is pretty unfortunate. I think for the purpose of getting livecd-rootfs builds going again we could/should switch to a wrapper around mount -o loop,offset=xxx and avoid all this devicemapper complexity...

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Oh, I think we should cherry-pick 67acde4869a9505f9721e31fa5167c82445e0e12:

commit 67acde4869a9505f9721e31fa5167c82445e0e12
Author: Yu Watanabe <email address hidden>
Date: Thu Sep 26 21:33:59 2019 +0900

    udevadm trigger: do not propagate EACCES and ENODEV

    Inside container, writing file returns EACCESS. Moreover, some devices
    return ENODEV rather than EACCES. So, let's also ignore these two
    error causes.

    Closes #13652.

Fixing an issue reported by our very own ddstreet...

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Wait that last comment is clearly nuts.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :
Revision history for this message
Balint Reczey (rbalint) wrote :

Systemd upstream's position is that udev is not supported in containers which makes it practically impossible to support it in Ubuntu only. Nor should livecd-rootfs collect all the trick to keep image building tools operable in containers somehow.

I suggest applying the workaround in LP to get image builds going now with the current LXD in VM setup and going forward implement running livecd-rootfs in VMs instead of in privileged containers in VMs (running a different release of Ubuntu compared to the container).

I also suggest adopting the policy of not accepting any component to the Ubuntu images that can't be installed in a chroot. This also helps in better isolation of the built image from the build environment.

Revision history for this message
Balint Reczey (rbalint) wrote :

Also livecd-rootfs is continuously tested to work in VMs in its autopkgtest, but it is not tested in an LXD running on an other version of Ubuntu scenario when dependencies change.

Revision history for this message
Balint Reczey (rbalint) wrote :

The test of no udev in containers can be followed here: https://bileto.ubuntu.com/#/ticket/3840

Revision history for this message
Colin Watson (cjwatson) wrote :

Using VMs for this is unlikely to happen: trying to use nested VMs would be ... optimistic, and anything else would involve a complete redesign of the build farm.

Revision history for this message
Colin Watson (cjwatson) wrote :

Fixed in launchpad-buildd 183, thanks to Robert C Jennings and Michael Hudson-Doyle.

Changed in launchpad-buildd:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.