LXD build of edubuntu raspberry pi images fail

Bug #2049695 reported by Erich Eickmeyer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Image
Triaged
Critical
Unassigned

Bug Description

With our first test build of Edubuntu Raspberry Pi images using the Launchpad infrastructure, while the call to ubuntu-image was successful, the build was unsuccessful. Logs at https://launchpadlibrarian.net/709847987/buildlog_ubuntu_noble_arm64_raspi_edubuntu-preinstalled_BUILDING.txt.gz

Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hey Erich! Can you share your image definition yaml with us?

Paul Mars (upils)
Changed in ubuntu-image:
assignee: nobody → Paul Mars (upils)
importance: Undecided → High
importance: High → Critical
Revision history for this message
Erich Eickmeyer (eeickmeyer) wrote :

Hi Lukasz! It's in https://git.launchpad.net/ubuntu-images/tree/edubuntu-pi-arm64.yaml?h=noble, so already part of the infra.

Paul Mars (upils)
tags: added: foundations-todo
Paul Mars (upils)
Changed in ubuntu-image:
status: New → Triaged
Revision history for this message
Paul Mars (upils) wrote :

As mentioned by Steve, some investigation were already done when such bug occurred in curtin.

mwhudson explanation on the matter https://git.launchpad.net/curtin/tree/curtin/util.py#n534

At the end, the method using "mount --make-rprivate" followed by a call to "umount --recursive" was dismissed because curtin relied on not using this behavior.

Currently this is the method we use in ubuntu-image. The error is a bit misleading because the function trying to unmount /chroot/dev fails but then a cleanup function tries to remove the workdir, thus trying to remove /chroot/dev and then failing because the unmounting failed.

So the interesting bit is "umount: /tmp/ubuntu-image-a610cb9f-2939-4dca-91d5-592ad944f29e/chroot/dev: target is busy".

I am trying to reproduce while avoiding the 22min package installation but so far the "--make-rprivate" method is working properly on my machine.

Revision history for this message
Paul Mars (upils) wrote :

It looks like mwhudson solved a similar issue back in December. See https://git.launchpad.net/livecd-rootfs/tree/live-build/functions#n132

The changelog entry https://git.launchpad.net/livecd-rootfs/tree/debian/changelog#n83 does not mention the specific issue though.

Revision history for this message
Paul Mars (upils) wrote :

Using `sudo ubuntu-image classic --debug --workdir ~/scratch/edubuntu-pi-arm64 projects/ubuntu-images/edubuntu-pi-arm64.yaml` I was unable to reproduce on my amd64 machine (the build took almost 3h so I could only test once for now). The `install_packages` state completed without issue.

I do not have an arm64 machine at hand for now so I will try with canonistack.

Revision history for this message
Paul Mars (upils) wrote :

It looks also very similar to LP: #2033582. But the fix should also cover issues /dev so I guess this fix was either incomplete or there is another specific issue with something mounted/used in /dev.

Looking at the livecd-rootfs code, a similar issue occurred and now the mountpoints are setup with:

    mount dev-live -t devtmpfs "$mountpoint/dev"
    mount devpts-live -t devpts -o nodev,nosuid "$mountpoint/dev/pts"
    mount proc-live -t proc "$mountpoint/proc"
    mount sysfs-live -t sysfs "$mountpoint/sys"

Two things to note here:

- /dev was previously mounted with "mount --rbind /dev "$mountpoint/dev"" but it is now mounted directly
- dev-live and depts-live are mounted instead of /dev

I was able to reproduce the issue in a canonistack VM. But once the build failed, I was able to execute the same commands ubuntu-image executes to unmount /dev (and other mountpoints) and it worked. So I suspect a race condition somewhere preventing from unmounting /dev just after the package installation.

Again looking at the livecd-rootfs code I notice that `udevadm settle` is called in different places when dealing with partition umounting to make sure udev scripts are done.

I will try calling this before umounting /dev.

Revision history for this message
Paul Mars (upils) wrote :

The canonistack VM is slow. A second try took more than 6h.
I am now trying using GCP VMs see [0] but this is also long so far.

[0] https://github.com/canonical/ubuntu-image/pull/178

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

My experience has certainly been that mounting another copy of these filesystems works better than bind mounting things. I notice that you say livecd-rootfs used to use --rbind and switched to direct mount -- but ubuntu-image uses --bind, not --rbind. Please let me know how much coffee is required to understand the difference between --bind and --rbind while reading the mount(8) man page.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

FWIW, here is the code I worked my way to for livefs-editor: https://github.com/mwhudson/livefs-editor/blob/main/livefs_edit/context.py#L157 -- it works most of the time! I think there might be some kernel refcounting bugs around overlayfs that prevent clean unmounts in some cases but I really hope ubuntu-image isn't running into those (yet?!)

Revision history for this message
Paul Mars (upils) wrote :

Thank you for the link to livefs-editor.

I think I understand the difference between --bind and --rbind (and yes, I needed to read it twice) and so I think this should probably not solve the current bug.

So I will implement the solution adopted in livecd-rootfs and livefs-editor (directly mount the various filesystems with the appropriate type). At least we will have the same behavior everywhere.

Paul Mars (upils)
Changed in ubuntu-image:
status: Triaged → In Progress
Revision history for this message
Paul Mars (upils) wrote :

I notice you fallback to calling "umount -l" if "umount -R" failed when tearing down (see https://github.com/mwhudson/livefs-editor/blob/main/livefs_edit/context.py#L278)

This option was very tempting from the beginning but I fear, in the image building context, it will hide some hard to debug issues in the future.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Yes, I guess we might need to fall back to that but I think we should try as hard as we can not to!

Revision history for this message
Paul Mars (upils) wrote :

I was able to reproduce the issue again, even with my fix.

And this time unmounting the /dev of the chroot is not even possible after the build. So I ran fuser and based on the result it looks like at some point the real /dev of my machine was mounted.

 % sudo fuser -v -m /home/paul/scratch/edubuntu-pi-arm64/chroot/dev/
                     USER PID ACCESS COMMAND
/home/paul/scratch/edubuntu-pi-arm64/chroot/dev:
                     root kernel swap /dev/dm-1
                     root kernel mount /dev
                     root 110 .rc.. kdevtmpfs
                     root 6159 f.... lxd
                     paul 6996 ....m pulseaudio
                     paul 7125 ....m Xorg
                     paul 7270 ....m gsd-xsettings
                     paul 7274 ....m gsd-power
                     paul 7285 ....m gsd-color
                     paul 7288 ....m gnome-flashback
                     paul 7315 ....m gsd-keyboard
                     paul 7316 ....m gsd-wacom
                     paul 7319 ....m gsd-media-keys
                     paul 7691 ....m picom
                     paul 7937 ....m mattermost-desk
                     paul 7954 ....m xdg-desktop-por
                     paul 10083 ....m Vivaldi-Gpu
                     messagebus 490522 F.... dbus-daemon
                     messagebus 490530 F.... gvfsd
                     messagebus 490539 F.... gvfsd-fuse
                     messagebus 490551 F.... dconf-service

I do not understand how this is possible.

Revision history for this message
Paul Mars (upils) wrote :

Well, I added edubuntu build in our CI test suite and now this is failing because the builds are using all the disk space.

See https://github.com/canonical/ubuntu-image/actions/runs/7665245789/job/20891977541?pr=178

I will merge my solution anyway because it seems to improve things and to let you try Erich once the new version is available in latest/edge while I try to have these builds work properly together.

Revision history for this message
Paul Mars (upils) wrote :

One build succeeded (on jammy) and 2 failed (mantic and focal) without clear reason but it does not seem related to this bug. I am not claiming victory yet but Erich I confirm you could try the last release in latest/edge (3.2+snap1).

Revision history for this message
Erich Eickmeyer (eeickmeyer) wrote :

Thanks, Paul. The idea is to get this running on Noble daily builds automated with the Launchpad system which the Release Team can trigger (not myself yet), and Steve was helping me with at this point. I'm not 100% sure how that works, but I believe it requires some sort of SSH access to the Launchpad infrastructure to try.

Revision history for this message
Erich Eickmeyer (eeickmeyer) wrote :
Revision history for this message
Paul Mars (upils) wrote (last edit ):

During a build I monitored the mountpoints. The package installation only created one more mountpoint than what we setup before chrooting: /chroot/var/lib/gdm3/.gvfs

We do need to clean it. I will fix u-i to detect and unmount "new" mounts, but this is not the source of our problem I think.

No submounts where added under /dev (at least at the end I cannot see any).

I found 3 processes still using /chroot/dev/ (and more precisely /chroot/dev/null):

message+ 618909 0.0 0.0 800284 24684 ? Sl 13:07 0:00 /usr/libexec/qemu-binfmt/aarch64-binfmt-P /usr/libexec/gvfsd /usr/libexec/gvfsd
message+ 618930 0.0 0.0 649844 20528 ? Sl 13:07 0:00 /usr/libexec/qemu-binfmt/aarch64-binfmt-P /usr/libexec/dconf-service /usr/libexec/dconf-service
message+ 618901 0.0 0.0 230484 14156 ? Ssl 13:07 0:00 /usr/libexec/qemu-binfmt/aarch64-binfmt-P /usr/bin/dbus-daemon /usr/bin/dbus-daemon --syslog --fork --print-pid 4 --print-address 6 --session

These 3 processes are children of "/lib/systemd/systemd --user".

I do not understand why, and I do not understand why they are not executed from the chroot. I expected paths like /chroot/usr/libexec/dconf-service

I suppose these are leftover processes following a desktop related package install.

I see several ways to solve this:

- list and kill every processes still using files in the chroot before unmounting. It feels a bit brutal and may cause issues later.
- understand which pkg installation is starting these processes. Given how many pkgs were are installing it feels like looking for a needle in a haystack. I do not even know where to begin (but AFAIK the ubuntu raspi image is not affected so this is a priori in one of the packages added for edubuntu)
- after reading again the umount man I am still convinced using the --lazy option is a bad idea, even though this is the "easy" solution I keep finding everywhere.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

It appears ubuntu-image does not create a policy-rc.d file to prevent services from starting when installing packages into the chroot. I'm not sure anything about policy-rc.d is very well documented but see

https://people.debian.org/~hmh/invokerc.d-policyrc.d-specification.txt
https://git.launchpad.net/ubuntu/+source/live-build/tree/scripts/build/lb_chroot_sysv-rc?h=applied/ubuntu/noble

Revision history for this message
Paul Mars (upils) wrote :

A PR https://github.com/canonical/ubuntu-image/pull/183 is in progress now. The policy-rc.d part should now be ready and I am working on detecting and umounting "new" mounts.

Revision history for this message
Paul Mars (upils) wrote :

The build is failing the same way with the policy-rc.d file :/.

I will keep this change because I suppose this is still beneficial and will prevent future bugs, but either the package responsible is not respecting the invoke-rc.d spec or this is a "daemon/service" at fault but something executed in the background as a side effect of a package installation.

Revision history for this message
Steve Langasek (vorlon) wrote :

> These 3 processes are children of "/lib/systemd/systemd --user".

Well, that would definitely not be a consequence of missing policy-rc.d since the issue here is that there's somehow a user session inside the chroot rather than there being system services running.

> message+ 618909 0.0 0.0 800284 24684 ? Sl 13:07 0:00 /usr/libexec/qemu-binfmt/aarch64-binfmt-P /usr/libexec/gvfsd /usr/libexec/gvfsd

The username is truncated here, but this matches the earlier ps output you had showing 'messagebus'. Which is also the only matching username on my system (which is not edubuntu).

This is a system user. I don't know why there would be a user session for this user.

Revision history for this message
Paul Mars (upils) wrote (last edit ):

While digging in mmdebstrap code, I discovered that not only policy-rc.d is set to prevent daemon/services starting but also:

- /sbin/start-stop-daemon
- /sbin/initctl (found in deboostrap code)

Even though initctl is supposed to be specific to Upstart, I found several references to it in the resulting chroot (and to start-stop-daemon too). So some packages seem to still rely on them to start some daemon/services.

I will handle them like policy-rc.d and see if it makes any difference.

Revision history for this message
Paul Mars (upils) wrote :

Again, no luck, and again I will keep this to make the build more robust.

Looking at the log I see dbus-x11 and other dbus related packages are installed during the build.

I am investigating dbus init and postinst scripts to understand if maybe we are ending in a corner case where the daemon is starting. So far some cases are using start-stop-daemon so they should be correctly handled now, but some are doing custom stuff (like calling /usr/bin/dbus-launch).

I am also investigating mmdebstrap/debootstrap/livecd-rootfs code to understand why they are not affected.

Revision history for this message
Paul Mars (upils) wrote :
Download full text (3.5 KiB)

I did not find anything specific to dbus in mmdebstrap/debootstrap/livecd-rootfs.

I also checked that dbus is in the minimal seed, so it is already installed during other builds. Thus this is not the installation of dbus itself that is problematic.

I am now trying to build the edubuntu image after removing some extra-packages or seed to try pinpointing the problematic package.

I also noticed in the build log the following:

Setting up plymouth-theme-edubuntu (24.04.4) ...
update-alternatives: using /usr/share/plymouth/themes/edubuntu-logo/edubuntu-logo.plymouth to provide /usr/share/plymouth/themes/default.plymouth (default.plymouth) in auto mode
update-alternatives: using /usr/share/plymouth/themes/edubuntu-text/edubuntu-text.plymouth to provide /usr/share/plymouth/themes/text.plymouth (text.plymouth) in auto mode
update-initramfs: deferring update (trigger activated)
Authorization required, but no authorization protocol specified

(process:950787): dconf-CRITICAL **: 14:33:12.491: unable to create directory '/var/lib/gdm3/.cache/dconf': Permission denied. dconf will not work properly.

(process:950787): dconf-CRITICAL **: 14:33:12.501: unable to create directory '/var/lib/gdm3/.cache/dconf': Permission denied. dconf will not work properly.

(process:950787): dconf-CRITICAL **: 14:33:12.501: unable to create directory '/var/lib/gdm3/.cache/dconf': Permission denied. dconf will not work properly.

(process:950787): dconf-CRITICAL **: 14:33:12.509: unable to create directory '/var/lib/gdm3/.cache/dconf': Permission denied. dconf will not work properly.

I compared with the ubuntu-desktop-raspi-arm64 build (the closest one configuration wise) and such error are not visible.

Looking at plymouth-theme-edubuntu.postinst from the edubuntu-artwork source package, I see:

if id -u gdm >/dev/null 2>&1; then
  sudo -u gdm dbus-launch gsettings set \
      org.gnome.login-screen logo '/usr/share/plymouth/edubuntu-logo.png' || true
fi

Some things are starting to make sense:

- the "Authorization required, but no authorization protocol specified" error is probably caused by the gdm user not having access to the X server (in fact I do not really know if in the chroot a X server is visible). It may be solved with "xhost si:localuser:gdm"
- gsettings is starting dconf, and since the home of gdm is "/var/lib/gdm3", dconf tries to create .cache/dconf in it. But at this point this directory is own by root. I checked on my machine and this directory is supposed to be owned by gdm.

Looking at the gdm3 package, creating and setting the right permissions on this directory is handled by debian/generate-config, which is supposed to be called by systemd when the service is starting. See gdm.service.in:

[Service]
ExecStartPre=/usr/share/gdm/generate-config
ExecStart=${sbindir}/gdm3
KillMode=mixed
Restart=always
RestartSec=1s
IgnoreSIGPIPE=no
BusName=org.gnome.DisplayManager
EnvironmentFile=-${LANG_CONFIG_FILE}
ExecReload=/usr/share/gdm/generate-config
ExecReload=/bin/kill -SIGHUP $MAINPID
KeyringMode=shared

But we now explicitly prevent services from starting, so this cannot be executed. Moreover, looking at the logs provided by Erich at the beginning...

Read more...

Revision history for this message
Erich Eickmeyer (eeickmeyer) wrote :

> - For edubuntu-artwork, in plymouth-theme-edubuntu.postinst, do not rely on the dbus + gsettings way to set the logo. I do not know what would be a better solution though.

Right now it fails gracefully for the explicit reason that it fails during Raspi builds (notice the `|| true`). For that reason, I created edubuntu-raspi-firstboot to run that same code the first time it boots and then disable itself from running ever again.

That said, I doubt that's the holdup here as it basically returns an error code > 1, and it certainly wouldn't cause something to be mounted and fail to unmount.

Paul Mars (upils)
tags: removed: foundations-todo
Changed in ubuntu-image:
assignee: Paul Mars (upils) → nobody
status: In Progress → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.