netbooting the bionic live CD over NFS goes straight to maintenance mode :

Bug #1755863 reported by Eric Desrochers on 2018-03-14
206
This bug affects 42 people
Affects Status Importance Assigned to Milestone
systemd (Ubuntu)
Status tracked in Disco
Xenial
Medium
Victor Tapia
Bionic
Medium
Victor Tapia
Cosmic
Medium
Victor Tapia
Disco
Medium
Victor Tapia

Bug Description

[Impact]

Mounting manually a network share (NFS) and masking it breaks the state of other units (and their dependencies).
Casper is masking a mounted NFS share, blocking the normal boot process as described in the original description, but the issue comes from systemd.

[Test Case]

- NFS mount point at /media
root@iscsi-bionic:/home/ubuntu# mount | grep media
10.230.56.72:/tmp/mnt on /media type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.122.127,local_lock=none,addr=10.230.56.72)

- Test mount point (/test) defined in /etc/fstab:
root@iscsi-bionic:/home/ubuntu# cat /etc/fstab |grep test
tmpfs /test tmpfs nosuid,nodev 0 0

1. If media.mount is not masked, everything works fine:

root@iscsi-bionic:/home/ubuntu# mount | grep test
root@iscsi-bionic:/home/ubuntu# systemctl status media.mount | grep Active
   Active: active (mounted) since Thu 2018-11-15 16:03:59 UTC; 3 weeks 6 days ago
root@iscsi-bionic:/home/ubuntu# systemctl status test.mount | grep Active
   Active: inactive (dead) since Thu 2018-12-13 10:33:52 UTC; 4min 11s ago
root@iscsi-bionic:/home/ubuntu# systemctl start test.mount
root@iscsi-bionic:/home/ubuntu# systemctl status test.mount | grep Active
   Active: active (mounted) since Thu 2018-12-13 10:38:13 UTC; 3s ago
root@iscsi-bionic:/home/ubuntu# mount | grep test
tmpfs on /test type tmpfs (rw,nosuid,nodev,relatime)
root@iscsi-bionic:/home/ubuntu# systemctl stop test.mount
root@iscsi-bionic:/home/ubuntu# systemctl status test.mount | grep Active
   Active: inactive (dead) since Thu 2018-12-13 10:38:32 UTC; 3s ago
root@iscsi-bionic:/home/ubuntu# mount | grep test

2. If media.mount is masked, other mounts are failing:

root@iscsi-bionic:/home/ubuntu# systemctl mask media.mount
Created symlink /etc/systemd/system/media.mount → /dev/null.
root@iscsi-bionic:/home/ubuntu# systemctl start test.mount
Job for test.mount failed.
See "systemctl status test.mount" and "journalctl -xe" for details.
root@iscsi-bionic:/home/ubuntu# systemctl status test.mount | grep Active
   Active: failed (Result: protocol) since Thu 2018-12-13 10:40:06 UTC; 10s ago
root@iscsi-bionic:/home/ubuntu# mount | grep test
tmpfs on /test type tmpfs (rw,nosuid,nodev,relatime)
root@iscsi-bionic:/home/ubuntu# systemctl stop test.mount
root@iscsi-bionic:/home/ubuntu# systemctl status test.mount | grep Active
   Active: failed (Result: protocol) since Thu 2018-12-13 10:40:06 UTC; 25s ago
root@iscsi-bionic:/home/ubuntu# mount | grep test
tmpfs on /test type tmpfs (rw,nosuid,nodev,relatime)
root@iscsi-bionic:/home/ubuntu# systemctl start test.mount
Job for test.mount failed.
See "systemctl status test.mount" and "journalctl -xe" for details.
root@iscsi-bionic:/home/ubuntu# mount | grep test
tmpfs on /test type tmpfs (rw,nosuid,nodev,relatime)
tmpfs on /test type tmpfs (rw,nosuid,nodev,relatime)
root@iscsi-bionic:/home/ubuntu# systemctl stop test.mount
root@iscsi-bionic:/home/ubuntu# mount | grep test
tmpfs on /test type tmpfs (rw,nosuid,nodev,relatime)
tmpfs on /test type tmpfs (rw,nosuid,nodev,relatime)

[Regression potential]

Minimal. Originally, one failing mount point blocked the processing of the rest due to how the return codes were handled for every line in /proc/self/mountinfo. This patch removes this "dependency" and keeps the failure local to the affected mount point, allowing the rest to be processed normally.

[Other Info]

Upstream bug: https://github.com/systemd/systemd/issues/10874
Fixed upstream with commit: https://github.com/systemd/systemd/commit/c165888426ef99440418592a8cdbaff4b7c319b3

[Original Description]

netbooting the bionic live CD[1] over NFS goes straight to maintenance mode :

[1] http://cdimage.ubuntu.com/daily-live/current/

# casper.log
Begin: Adding live session user... ... dbus-daemon[568]: [session uid=999 pid=568] Activating service name='org.gtk.vfs.Daemon' requested by ':1.0' (uid=999 pid=569 comm="" label="unconfined")
dbus-daemon[568]: [session uid=999 pid=568] Successfully activated service 'org.gtk.vfs.Daemon'
dbus-daemon[568]: [session uid=999 pid=568] Activating service name='org.gtk.vfs.Metadata' requested by ':1.0' (uid=999 pid=569 comm="" label="unconfined")
fuse: device not found, try 'modprobe fuse' first
dbus-daemon[568]: [session uid=999 pid=568] Successfully activated service 'org.gtk.vfs.Metadata'

(gvfsd-metadata:580): GUdev-CRITICAL **: 16:28:56.270: g_udev_device_has_property: assertion 'G_UDEV_IS_DEVICE (device)' failed

(gvfsd-metadata:580): GUdev-CRITICAL **: 16:28:56.270: g_udev_device_has_property: assertion 'G_UDEV_IS_DEVICE (device)' failed
A connection to the bus can't be made
done.
Begin: Setting up init... ... done.

Eric Desrochers (slashd) wrote :

Attaching log from maintenance mode

summary: - fuse: device not found, try 'modprobe fuse' first
+ netbooting the bionic live CD over NFS goes straight for maintenance
+ mode :
summary: - netbooting the bionic live CD over NFS goes straight for maintenance
- mode :
+ netbooting the bionic live CD over NFS goes straight to maintenance mode
+ :
description: updated
Jean-Baptiste Lallement (jibel) wrote :

From the journal

[ 20.311413] ubuntu systemd[1]: sys-kernel-config.mount: Directory /sys/kernel/config to mount over is not empty, mounting anyway.
[ 20.311594] ubuntu systemd[1]: Mounting Kernel Configuration File System...
[ 20.313502] ubuntu mount[813]: mount: /sys/kernel/config: configfs already mounted on /sys/kernel/config.
[ 20.313793] ubuntu systemd[1]: sys-kernel-config.mount: Mount process exited, code=exited status=32
[ 20.313842] ubuntu systemd[1]: sys-kernel-config.mount: Failed with result 'exit-code'.
[ 20.314013] ubuntu systemd[1]: Failed to mount Kernel Configuration File System.
[ 20.325702] ubuntu systemd[1]: Mounting FUSE Control File System...
[ 20.325777] ubuntu mount[814]: mount: /sys/fs/fuse/connections: fusectl already mounted on /sys/fs/fuse/connections.
[ 20.326038] ubuntu systemd[1]: sys-fs-fuse-connections.mount: Mount process exited, code=exited status=32
[ 20.326089] ubuntu systemd[1]: sys-fs-fuse-connections.mount: Failed with result 'exit-code'.
[ 20.326248] ubuntu systemd[1]: Failed to mount FUSE Control File System.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in casper (Ubuntu):
status: New → Confirmed
Eric Desrochers (slashd) on 2018-03-20
Changed in casper (Ubuntu):
importance: Undecided → High
Eric Desrochers (slashd) wrote :

The sequence of failure seems to be the following:

-- Unit dev-mqueue.mount has failed.
-- Unit sys-kernel-debug.mount has failed.
-- Unit dev-hugepages.mount has failed.
-- Unit sys-kernel-config.mount has failed.
-- Unit sys-fs-fuse-connections.mount has failed.
-- Unit tmp.mount has failed.
-- Unit local-fs.target has failed.
-- Unit dns-clean.service has failed.
-- Unit systemd-resolved.service has failed.
-- Unit systemd-timesyncd.service has failed.
-- Unit sys-kernel-config.mount has failed.
-- Unit sys-fs-fuse-connections.mount has failed.

# journal -xb:
-- The limits controlling how much disk space is used by the journal may
-- be configured with SystemMaxUse=, SystemKeepFree=, SystemMaxFileSize=,
-- RuntimeMaxUse=, RuntimeKeepFree=, RuntimeMaxFileSize= settings in
-- /etc/systemd/journald.conf. See journald.conf(5) for details.
Apr 03 12:59:15 ubuntu systemd-modules-load[758]: Inserted module 'lp'
Apr 03 12:59:15 ubuntu systemd[1]: Failed to set up mount unit: Device or resource busy
Apr 03 12:59:15 ubuntu systemd[1]: dev-mqueue.mount: Mount process finished, but there is no mount.
Apr 03 12:59:15 ubuntu systemd[1]: dev-mqueue.mount: Failed with result 'protocol'.
Apr 03 12:59:15 ubuntu systemd[1]: Failed to mount POSIX Message Queue File System.
-- Subject: Unit dev-mqueue.mount has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit dev-mqueue.mount has failed.

Eric Desrochers (slashd) wrote :

Seems like this recipe is enough to start gnome (as a potential workaround until this get fix) :

# systemctl mask tmp.mount
# ctrl-d

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in systemd (Ubuntu):
status: New → Confirmed
Changed in casper (Ubuntu):
status: Confirmed → Fix Released
Changed in systemd (Ubuntu):
status: Confirmed → Fix Released

is the status "Fix Released" from 2018-04-27 mean,
that the fix is already included in release ubuntu 18.04,
or will it be included to 18.04.1 the first time?

Stephen Early (steve-greenend) wrote :

I've just checked: a NFS boot image built from the currently released Ubuntu 18.04 still has this bug. I can't see any relevant commits to casper or systemd, either.

Simone Scisciani (scisciani) wrote :

sorry, by mistake I put the tick "fix released" and I can not put it back in "confirmed". The bug has not yet been fixed

Eric Desrochers (slashd) wrote :

I set both back to 'Confirmed'.

Changed in casper (Ubuntu):
status: Fix Released → Confirmed
Changed in systemd (Ubuntu):
status: Fix Released → Confirmed
Woodrow Shen (woodrow-shen) wrote :

I can confirm that the ubiquity can finish the installation via appending the string of systemd mask services "systemd.mask=dev-hugepages.mount systemd.mask=dev-mqueue.mount systemd.mask=sys-fs-fuse-connections.mount systemd.mask=sys-kernel-config.mount systemd.mask=sys-kernel-debug.mount systemd.mask=tmp.mount" and enter normal Gnome session with user after reboot.

richud (richud.com) wrote :

Can confirm Woodrow Shen's workaround works for me too, (also tested ok with kubuntu and ubuntu-mate 18.04 automated deploys).

Woodrow Shen (woodrow-shen) wrote :

I keep doing some experiments (with Dell/HP laptops) and there are some conclusions currently:

1. why does the issue happen (not real root cause)
Due to local-fs.target, the fstab-generator automatically adds dependencies of type Before= to all mount units that refer to local mount points for this target unit. In addition, it adds dependencies of type Wants= to this target unit for those mounts listed in /etc/fstab that have the auto mount option set[1].
Therefore, the emergency shell is triggered by local-fs.target which is dependent on failures of several systemd mounts.

2. provides 2 approaches for workaround fix

1) append "systemd.mask=dev-hugepages.mount systemd.mask=dev-mqueue.mount systemd.mask=sys-fs-fuse-connections.mount systemd.mask=sys-kernel-config.mount systemd.mask=sys-kernel-debug.mount systemd.mask=tmp.mount" into kernel boot options.

2) add "toram" into kernel boot options.
it would completely decompress the filesystem into RAM, which requires 3-4x more RAM, and is hence undesired[2].

3. trade-off between workarounds
Until we find the solution, the better way to workaround is to use "toram" to fix the issue. The reason behind it is we not only get speed-up the installation but also avoid the unstable network with nfs, despite requiring more RAM.

4. The real solution concern
I think the solution may be more complicated if we really want to fix, and ideally we have to consider the cases of normal (e.g. from usb stick) and nfs mount to satisfy the conditions to avoid systemd failure with dependencies or protocol.

[1] https://www.freedesktop.org/software/systemd/man/systemd.special.html
[2] https://wiki.ubuntu.com/BootToRAM

David Coronel (davecore) wrote :

I confirm the "toram" workaround from Woodrow allows me to PXE netboot the most recent Ubuntu 18.04 Desktop amd64 ISO image.

Martin Bogomolni (martinbogo) wrote :

The "toram" workaround does not work for me attempting to boot on a SuperMicro X10DRW motherboard with 128GB of ram installed + SATA SSD. I have also added Woodrow Shen's workaround in the command line.

The difference is that I am attempting to install 18.04 server.

Environment is:

Debian "Wheezy" server runnig dnsmasq to provide DHCP and tftp service
Synced/mirrored Ubuntu 18.04 repository being served via nginx

-----

May 29 18:08:13 ubuntu systemd[1]: dev-mqueue.mount: Mount process finished, but there is no mount.
May 29 18:08:13 ubuntu systemd[1]: dev-mqueue.mount: Failed with result 'protocol'.
May 29 18:08:13 ubuntu systemd[1]: Failed to mount POSIX Message Queue File System.
-- Subject: Unit dev-mqueue.mount has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit dev-mqueue.mount has failed.

Woodrow Shen (woodrow-shen) wrote :

"toram" option only affected desktop with casper.

"toram" works to me and will also fix the hanging at shutdown/reboot.
but "toram" takes more that twice as long to boot into the desktop (~3minutes instead of ~1minute).

Jon K (drkatz) wrote :

Appears to still be affecting 18.04.1

Skye K (skyebuddha) wrote :

I am having a similar issue in 18.04 and 18.04.1. I am trying to netboot and retrieve the install files from nfs or http. I have seen a few others have similar issues: https://askubuntu.com/a/1031024/837404 . When I try to install, it says that I am missing files, but when I check the apache logs, nothing has been requested and also no errors show in the console

thh (thh01217) wrote :

Based on in-depth analysis, I found the cause of the error:

“Apr 03 12:59:15 ubuntu systemd[1]: Failed to set up mount unit: Device or resource busy”

call tree on systemd mount.c & unit.c:
 mount_dispatch_io
  -> mount_load_proc_self_mountinfo
     -> mount_setup_unit
         -> mount_setup_existing_unit
             -> mount_add_extras
                 -> unit_set_default_slice:
                      -> unit_set_slice:
                         if (unit_active_state(u) != UNIT_INACTIVE)
                              return -EBUSY;

"unit_set_slice" return EBUSY always, because of nfsroot always active state in netbooting,

"mount_dispatch_io" give up updating mount state when "mount_load_proc_self_mountinfo" return the error.

finally, all systemd mount service failed and then goto emergency shell.

This bug is still affecting me, downloaded yesterday iso images from ubuntu,kubuntu,xubuntu and can't boot from pxe server saying that iam in emergency mode.

Lukas (lukas-wringer) wrote :

is this bug lost or not assigned to someone? It is still broken, of course there are workarounds but no one knows the exact consequences of them?

Lukas (lukas-wringer) on 2018-09-22
no longer affects: ubiquity (Ubuntu)
Eric Desrochers (slashd) wrote :

I started to look at this problem from scratch since it's been a while since I have reported it....

It seems to go into emergency mode due to a failed attempt to start unit "tmp.mount" :

# /var/log/boot.log
65 emergency.target: Enqueued job emergency.target/start as 159
66 tmp.mount: Unit entered failed state.

Adding "systemd.mask=tmp.mount" in /tftpboot/pxelinux.cfg/default as a parameter did the trick to workaround the behaviour :

APPEND initrd=bionic-desktop-amd64/initrd root=/dev/nfs boot=casper netboot=nfs nfsroot=192.168.100.2:/bionic-desktop-amd64 splash systemd.mask=tmp.mount systemd.debug-shell=1 systemd.log_level=debug systemd.log_target=console console=ttyS0,38400 console=tty1 --

Note:
- I did the test by curiosity w/ Artful/17.10 (systemd-234) and it works, so it's possibly something between v234 and v237 which introduced the behaviour for tmp.mount, a change in mount, ...
- Problem is also reproducible in Cosmic, and journalctl was a little bit more verbose in Cosmic than it was for Bionic in my testing :

$ journalctl -a -u tmp.mount
-- Logs begin at Wed 2018-10-10 20:15:36 UTC, end at Wed 2018-10-10 20:15:43 UTC. --
Oct 10 20:15:36 ubuntu systemd[1]: tmp.mount: Directory /tmp to mount over is not empty, mounting anyway.
Oct 10 20:15:36 ubuntu systemd[1]: Mounting /tmp...
Oct 10 20:15:36 ubuntu systemd[1]: tmp.mount: Mount process finished, but there is no mount.
Oct 10 20:15:36 ubuntu systemd[1]: tmp.mount: Failed with result 'protocol'.
Oct 10 20:15:36 ubuntu systemd[1]: Failed to mount /tmp.

# src/core/mount.c
802 static void mount_enter_dead(Mount *m, MountResult f) {
803 assert(m);
804
805 if (m->result == MOUNT_SUCCESS)
806 m->result = f;
807
808 if (m->result != MOUNT_SUCCESS)
809 log_unit_warning(UNIT(m), "Failed with result '%s'.", mount_result_to_string(m->result));
...
1282 switch (m->state) {
1283
1284 case MOUNT_MOUNTING:
1285 /* Our mount point has not appeared in mountinfo. Something went wrong. */
1286
1287 if (f == MOUNT_SUCCESS) {
1288 /* Either /bin/mount has an unexpected definition of success,
1289 * or someone raced us and we lost. */
1290 log_unit_warning(UNIT(m), "Mount process finished, but there is no mount.");
1291 f = MOUNT_FAILURE_PROTOCOL;
1292 }

and m->result is indeed equalt to "MOUNT_FAILURE_PROTOCOL" ^

1955 [MOUNT_FAILURE_PROTOCOL] = "protocol",

I'll try to instrument things and create a custom ISO for further debugging/testing. This is where am at the moment.

- Eric

Eric Desrochers (slashd) wrote :

So far I highly suspect this commit[1] to be the possible offending one and it would "fit" with the bionic systemd version in Ubuntu[2] vs upstream introduction of the change :

$ git describe --contains 006aabbd05
v237~47^2~2

[1] 006aabbd0 mount: mountinfo event is supposed to always arrive before SIGCHLD

[2] rmadison
 systemd | 237-3ubuntu10 | bionic | source, amd64, arm64, armhf, i386, ppc64el, s390x
 systemd | 237-3ubuntu10.3 | bionic-updates | source, amd64, arm64, armhf, i386, ppc64el, s390x

Eric Desrochers (slashd) wrote :

With systemd on Xenial & Artful, there is no instruction in the case of MOUNT_MOUNTING which re-enforce why it is working with these releases :

# src/core/mount.c
1182 case MOUNT_MOUNTING:
1183 case MOUNT_MOUNTING_DONE:
1184 case MOUNT_MOUNTING_SIGKILL:
1185 case MOUNT_MOUNTING_SIGTERM:
1186
1187 if (f == MOUNT_SUCCESS)
1188 mount_enter_mounted(m, f);
1189 else if (m->from_proc_self_mountinfo)
1190 mount_enter_mounted(m, f);
1191 else
1192 mount_enter_dead(m, f);
1193 break;

So most likely, falls in the MOUNT_MOUNTING_SIGTERM and then break the case statement flow.

# src/basic/unit-def.h
MOUNT_MOUNTING, /* /usr/bin/mount is running, but the mount is not done yet. */

Eric Desrochers (slashd) wrote :

Additionally,

tmp.mount unit configuration :
https://github.com/systemd/systemd/blob/master/units/tmp.mount

# tmp.mount
--
..
ConditionPathIsSymbolicLink=!/tmp
..
--

ConditionPathIsSymbolicLink = Verifies whether a certain path exists and is a symbolic link.
When there is an exclamation mark ("!"), the validation is negated

For "ConditionPathIsSymbolicLink=!/tmp" the unit is making sure /tmp doesn't exist and is not a symbolink link, if it exist and is a symbolic link like then it will fail and create the actual situation.

Eric Desrochers (slashd) wrote :

The "existing" /tmp come from casper code :

bionic/casper-1.394/scripts/casper-bottom/12fstab
...
cat > $FSTAB <<EOF
${UNIONFS} / ${UNIONFS} rw 0 0
tmpfs /tmp tmpfs nosuid,nodev 0 0
EOF
...

Eric Desrochers (slashd) wrote :

By reading this article : https://www.freedesktop.org/wiki/Software/systemd/APIFileSystems/

I really start thinking the easiest way to fix it is as describe here :

...
/tmp as location for volatile, temporary userspace file system objects (X)
...
It is possible to disable the automatic mounting of some (but not all) of these file systems, if that is required. These are marked with (X) in the list above. You may disable them simply by masking them:
systemctl mask dev-hugepages.mount
...

I have tested by masking "tmp.mount", but the official documentation recommend "dev-hegepages.mount" instead.

pxe configuration line :
"
APPEND initrd=bionic-desktop-amd64/initrd root=/dev/nfs boot=casper netboot=nfs nfsroot=192.168.100.2:/bionic-desktop-amd64 splash systemd.mask=tmp.mount systemd.debug-shell=1 systemd.log_level=debug systemd.log_target=console console=ttyS0,38400 console=tty1 --
"

I'm starting to think that this may become the final solution to the new systemd behaviour as indicated in official documentation above and not only a workaround.

Thoughts ?

Eric Desrochers (slashd) wrote :

Does someone impacted can test the dev-hugepages.mount masking within their PXE configuration and let me know how it works ?

systemd.mask=dev-hugepages.mount

Eric Desrochers (slashd) on 2018-10-11
tags: added: sts
Eric Desrochers (slashd) wrote :

According to the documentation, the recommendation is to mask dev-hugepages.mount, but in my current test doing it was preventing the tmp.mount failure, but not preventing to go into "Emergency mode". In my lab masking tmp.mount had better result.

Feel free to try both and let me know the outcome, results may varies from one setup to another.

- Eric

Marcel Partap (empee584) wrote :

mmh #32 not enough.. I had to convert it /tmp to an overlay mount, as regardless how early, it seems there'll be some important file (custom_mounts.list, see /var/log/live/boot) on it already..

Brian Nelson (bhnelson) wrote :

Eric,

The 'recommendation' for masking dev-hugepages you site from that wiki page is clearly just an example of how you could disable one of the various mounts described there. I don't think it's a recommendation to fix anything in particular.

FWIW: Masking dev-hugepages doesn't seem to help much for me. Masking tmp seems to let the system boot up, but still the other mount services fail and systemd status is red 'degraded.'

I've ended up masking all affected mounts (per comment 12 and 14) with the addition of masking run-rpc_pipefs.mount too. This lets the systemd boot up to green 'running' state.

I'm still having problems logging into Gnome with a user with NFS home. I'm not sure if that's related to this issue or something else though. Still looking at that.

I think you're on the right track in comment 27. I get the feeling that somewhere along the line a result of 'this is already mounted' changed from a success to a failure in systemd, possibly due to the change in mount.c you pasted.

@Eric,
i just tested what you suggested at your comment #31 with Ubuntu 18.10 release ...
    KERNEL http://pxe-server/nfs/ubuntu-x64/casper/vmlinuz
    INITRD http://pxe-server/nfs/ubuntu-x64/casper/initrd
    APPEND nfsroot=192.168.1.1:/srv/nfs/ubuntu-x64 ro netboot=nfs file=/cdrom/preseed/ubuntu.seed boot=casper systemd.mask=dev-hugepages.mount -- debian-installer/language=de console-setup/layoutcode=de keyboard-configuration/layoutcode=de keyboard-configuration/variant=German

... but then i again run straight into emergency console.
(not sure, if that information still is helpful)

at the moment "systemd.mask=tmp.mount" still is the best solution.
but with the issues of:
- having few red "FAILED" messages at boot time;
- at reboot/poweroff often run into "stop jobs" are running endless at shutting down;
- or hangs for ever at "Starting Shuts down the 'live' preinstalled system clearly...";

the second best solution is "toram", there i never observed the issues of any "FAILED" message and never hat the issues at reboot/poweroff (maybe there are race conditions - "toram" nothing has to be loaded fron nfs via network).
but "toram" takes a lot of time, because the whole squashfs-image has to be loaded into ram first, before it can be mounted.

BTW: lununtu 18.10 shows the same behavior as ubuntu 18.10, but it does not show up the issues with reboot/poweroff hanging or endles running stop jobs.

Brian Nelson (bhnelson) wrote :

So I've found a complete work-around for this. I also found that this issue is NOT new in 18.04 as it also affects 16.x (and likely 15 and 17 too). However it is DIFFERENT in 18.04. More details below.

TL;DR:
You need to netboot with an initramfs that doesn't have 'scripts/casper-bottom/25disable_cdrom.mount' in it. This script masks the dynamically-generated cdrom.mount systemd unit (where the NFS mount goes). That causes all the issues described in this bug.

From whatever machine where netboot initramfs is created:

# Disable/block the problem script
mkdir -p /etc/initramfs-tools/scripts/casper-bottom
touch /etc/initramfs-tools/scripts/casper-bottom/25disable_cdrom.mount

# rebuild initramfs
update-initramfs -u

# Move/copy the new file to the netboot server

The issue here is that systemd isn't able to update its mount status properly. In the case of 18.04, all of the 'failed' mounts are actually successfully mounted. This includes /tmp. BUT systemd doesn't recognize that fact and marks them all as red/failed.

In 16.04 this issue is a bit different. When booting, all of the same mounts are again mounted successfully AND systemd shows them all as green/active. BUT if you try to stop/unmount any of them you will see a similar situation. The unmount will actually succeed, but systemd will report an unmount failure and continue to show the unit as green/active.

Per the call trace thh noted in comment #21:
From what I can tell, mount_load_proc_self_mountinfo iterates through every active mount on the system (some perhaps more than once). When it gets to the nfs-mount on /cdrom, it does fail in unit_set_slice and generate the "Failed to set up mount unit: Device or resource busy" error. For whatever reason, that failure seems to completely bork systemd's ability to update its mount status. Thus mounts get 'stuck' either mounted or not from systemd's perspective.

The failure seems to be caused by the fact that the cdrom.mount unit (NFS mount) is masked. Once it's unmasked the failure doesn't occur and all mounts work as expected. You can actually observe this from within a 'broken' boot at the emergency prompt:
rm /lib/systemd/system/cdrom.mount
systemctl daemon-reload
umount /tmp (ensure it's gone, there may be multiple mounts)
systemctl reset-failed tmp.mount
systemctl start tmp.mount
..and it will succeed

I did verify this issue by actually booting from a 'real' DVD and the problem doesn't happen there. It's something specific to having the image mounted over NFS and masking it's unit.

For reference, the disable_cdrom.mount script was the solution for this bug
https://bugs.launchpad.net/ubuntu/+source/casper/+bug/1436715

Robert Giles (rgiles) wrote :

Brian,

Thanks for sleuthing out a fix for this; I wanted to add that this also seems to work for netbooting 18.10.

This issue also affected Linux Mint 19.2, and Brian's workaround and fix in #36 work for that distribution as well.

I used the workaround steps to resume the broken boot, then followed the fix steps and rebuilt the initramfs using:

# rebuild initramfs
/usr/sbin/update-initramfs.distrib -u

then copied the resulting .img to my netboot server.

Victor Tapia (vtapia) on 2018-12-13
description: updated
Victor Tapia (vtapia) on 2018-12-13
no longer affects: casper (Ubuntu)

hello Victor (vtapia),
does it mean the next comming ubuntu Live release 18.04.2 sould PXE boot without any tweak?
is it already in the "Ubuntu 19.04 (Disco Dingo) Daily Build"?

Victor Tapia (vtapia) wrote :

Hi,

I'm working on the patches for Disco, Cosmic, Bionic and Xenial. It's not in Disco yet, but the fix process will be tracked in this bug.

The attachment "disco-stop-mount-error-propagation.debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
Dan Streetman (ddstreet) on 2018-12-18
Changed in systemd (Ubuntu Disco):
assignee: nobody → Victor Tapia (vtapia)
Changed in systemd (Ubuntu Cosmic):
assignee: nobody → Victor Tapia (vtapia)
Changed in systemd (Ubuntu Bionic):
assignee: nobody → Victor Tapia (vtapia)
Changed in systemd (Ubuntu Xenial):
assignee: nobody → Victor Tapia (vtapia)
Changed in systemd (Ubuntu Disco):
importance: Undecided → Medium
Changed in systemd (Ubuntu Cosmic):
importance: Undecided → Medium
Changed in systemd (Ubuntu Xenial):
importance: Undecided → Medium
Changed in systemd (Ubuntu Disco):
status: Confirmed → In Progress
Changed in systemd (Ubuntu Bionic):
importance: Undecided → Medium
status: New → In Progress
Changed in systemd (Ubuntu Cosmic):
status: New → In Progress
Changed in systemd (Ubuntu Xenial):
status: New → In Progress
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers