zfs-import-cache.service slows boot by 60 seconds

Bug #1571761 reported by Scott Moser on 2016-04-18
32
This bug affects 4 people
Affects Status Importance Assigned to Milestone
cloud-init (Ubuntu)
High
Unassigned
Xenial
High
Scott Moser
zfs-linux (Ubuntu)
High
Unassigned
Xenial
High
Unassigned

Bug Description

Fresh uvt-kvm guest, then
$ sudo apt-get install zfsutils-linux
$ sudo reboot

The reboot will show on console waiting for tasks, then after ~ 60 seconds it continues boot.
Logging in shows:

$ sudo systemd-analyze critical-chain zfs-mount.service
The time after the unit is active or started is printed after the "@" character.
The time the unit takes to start is printed after the "+" character.

zfs-mount.service +81ms
└─zfs-import-cache.service @1min 786ms +272ms
  └─systemd-udev-settle.service @494ms +1min 275ms
    └─systemd-udev-trigger.service @415ms +55ms
      └─systemd-udevd-control.socket @260ms
        └─-.mount @124ms
          └─system.slice @129ms
            └─-.slice @124ms

Seems possibly related / or some discussion at https://github.com/zfsonlinux/zfs/issues/2368

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: zfsutils-linux 0.6.5.6-0ubuntu8
ProcVersionSignature: User Name 4.4.0-18.34-generic 4.4.6
Uname: Linux 4.4.0-18-generic x86_64
NonfreeKernelModules: zfs zunicode zcommon znvpair zavl
ApportVersion: 2.20.1-0ubuntu1
Architecture: amd64
Date: Mon Apr 18 16:42:35 2016
ProcEnviron:
 TERM=screen.xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: zfs-linux
UpgradeStatus: No upgrade log present (probably fresh install)
modified.conffile..etc.sudoers.d.zfs: [deleted]

Related branches

Scott Moser (smoser) wrote :
Scott Moser (smoser) on 2016-04-18
tags: added: systemd-boot
Scott Moser (smoser) wrote :

So:

a.) zfs-import-cache.service is 'After systemd-udev-settle.service'
b.) cloud-init installs /lib/udev/rules.d/79-cloud-init-net-wait.rules which uses IMPORT of /lib/udev/cloud-init-wait to block network device add events until cloud-init-local has written networking configuration

I'm pretty sure, then what is happening is:
1.) network device gets added and cloud-init-wait starts its 60 second wait for cloud-init-local to finish.
2.) systemd-udev-settle runs, which waits until queued all events are processed
3.) deadlock until timeout from 1

Scott Moser (smoser) on 2016-04-19
Changed in cloud-init (Ubuntu):
status: New → Confirmed
importance: Undecided → High
Martin Pitt (pitti) wrote :

There is some angles of attack here:

└─zfs-import-cache.service @1min 786ms +272ms
  └─systemd-udev-settle.service @494ms +1min 275ms

This is a case of "you asked for it, you got it", really. udev-settle is a workaround for old software that wasn't written with hotplugging in mind. It's 2016, and we really should avoid introducing new dependencies to udev-settle, as it's conceptually broken and can't work. This totally does not work for any hotplugged device, or a device which just gets detected by the kernel late, or which needs to be unwrapped from building an LVM or cryptsetup device, etc.

I think it would be much more elegant to instead create udev rules (or systemd units) which load the ZFS cache as soon as a ZFS block device is detected. Would that be possible/reasonable? I don't know anything about what import-cache does, but I hope/suppose it has some way of doing this on a per-device basis?

    └─systemd-udev-trigger.service @415ms +55ms

This is cloud-init's rules for blocking udev rules until network devices get detected and cloud-init's naming rules get applied. This is a hairy topic and Scott and I already discussed this at length. I think there's a better solution here, but I doubt we'll completely turn this around by the release. This just illustrates one of the problems of the "waiting for all hardware just to single out a particular device" approach from above -- this waits on stuff which is completely unrelated to zfs.

no longer affects: systemd (Ubuntu)
Richard Laager (rlaager) wrote :

@pitti: zfs-import-cache.service doesn't "load the ZFS cache". It imports zpools which are listed in the /etc/zfs/zpool.cache file. It is conditioned (ConditionPathExists) on the existence of /etc/zfs/zpool.cache. It seems to me that upstream ZoL is tending to deprecate the zpool.cache file.

In contrast is zfs-import-scan.service, which imports pools by device scanning. This is conditioned on /etc/zfs/zpool.cache NOT existing. So one or the other service runs, depending on whether you have a zpool.cache file.

Longer-term, I agree that we need some sort of solution that isn't "wait for all devices". Is there any prior art with btrfs or MD that's applicable here? (I know MD devices are assembled by the kernel in some circumstances. I think mdadm is involved in other cases.)

As a side note, zfs-import-scan.service hardcodes /dev/disk/by-id, which is probably not ideal. The reason for that is likely to work-around that `zpool import` is preferring /dev. See also: https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1571241/comments/8

Hello Richard,

thanks for the explanation what these units do.

Richard Laager [2016-04-19 16:29 -0000]:
> @pitti: zfs-import-cache.service doesn't "load the ZFS cache". It
> imports zpools which are listed in the /etc/zfs/zpool.cache file. It is
> conditioned (ConditionPathExists) on the existence of
> /etc/zfs/zpool.cache. It seems to me that upstream ZoL is tending to
> deprecate the zpool.cache file.
>
> In contrast is zfs-import-scan.service, which imports pools by device
> scanning. This is conditioned on /etc/zfs/zpool.cache NOT existing. So
> one or the other service runs, depending on whether you have a
> zpool.cache file.

I'm curious, how are these actually being used? They can't be relevant
for the root file system as that already needs to be set up in the
initrd. This unit does not sort itself before local-fs-pre.target
that, i. e. you also can't rely on having these done when stuff in
/etc/fstab is being mounted. The only thing that I see is that
zfs-mount.service.in is Before=local-fs.target, i. e the zfs-import-*
bits will run in parallel with mounting fstab.

> Longer-term, I agree that we need some sort of solution that isn't "wait
> for all devices"

Indeed. The concept of "wait for all devices" hasn't been well-defined
for the past 20 years at least. Is zpool import idempotent in any way,
i. e. would it be safe (and also performant) to run it whenever a
block device is seen, instead of once ever in the boot sequence? If
not, could it be made to accept a device name, and then limit its
actions to just that? Then it would be straightforward to put this
into an udev rule and become hotpluggable.

Does this mean that ZFS isn't currently supported on hotpluggable
storage? (As there's nothing that would call zpool import on them)

> Is there any prior art with btrfs or MD that's applicable here? (I
> know MD devices are assembled by the kernel in some circumstances. I
> think mdadm is involved in other cases.)

Yes, there's plenty. E. g. mdadm ships
/lib/udev/rules.d/64-md-raid-assembly.rules which adds/removes devices
from a RAID array as they come and go. LVM is a bit more complicated,
its udev rules mostly just tell a daemon (lvmetad) about new devices
which then decides when it has enough pieces (PVs) to bring up a VG.

There's also some really simple stuff like 85-hdparm.rules which just
calls a script for every encountered block device.

Thanks,
Martin

Richard Laager (rlaager) wrote :

[quoted blocks trimmed and reordered for ease of reply]

> I'm curious, how are these actually being used? They can't be relevant
> for the root file system as that already needs to be set up in the
> initrd.

Correct. The root pool (with all of its filesystems) is handled in the initrd.
These units are for regular (non-root) pools.

> This unit does not sort itself before local-fs-pre.target
> that, i. e. you also can't rely on having these done when stuff in
> /etc/fstab is being mounted. The only thing that I see is that
> zfs-mount.service.in is Before=local-fs.target, i. e the zfs-import-*
> bits will run in parallel with mounting fstab.

In parallel with mounting fstab sounds reasonable, since mounting filesystems
from fstab and mounting filesystems from ZFS (which happens by default upon
import) are analogous.

> Does this mean that ZFS isn't currently supported on hotpluggable
> storage? (As there's nothing that would call zpool import on them)

In short, yes. If (after boot) you plug in a USB drive with a zpool on it, you
would need to manually import it. Personally, I'd like to see this improved in
the future.

> Is zpool import idempotent in any way,
> i. e. would it be safe (and also performant) to run it whenever a
> block device is seen, instead of once ever in the boot sequence?

I'm not super confident, but I think it would be idempotent, at least assuming
it is not run in parallel. But I doubt it would be performant. Running
zpool import examines all devices (all zfs_member devices if compiled with
libblkid support, assuming I'm understand the code correctly).

Additionally, if you're responding to device events one-by-one, for pools with
more than one disk, there's a question of whether you should import the pool
when disks are missing (within the level of redundancy where you *can* import
the pool, of course).

> LVM is a bit more complicated,
> its udev rules mostly just tell a daemon (lvmetad) about new devices
> which then decides when it has enough pieces (PVs) to bring up a VG.

This might be the right approach for ZFS long-term. I think the best behavior
is to import a pool:
  once all devices are present ||
  (once sufficient devices are present &&
   a short timeout has expired with no new disks arriving)

Richard Laager (rlaager) wrote :

@smoser: If this is a fresh install with no zpools, you shouldn't have a zpool.cache file, and so your chain should have zfs-import-scan.service, not zfs-import-cache.service. You'd still have the same problem, I expect, but that detail is a bit worrisome.

Martin Pitt (pitti) wrote :

Hello Richard,

Richard Laager [2016-04-20 0:47 -0000]:
> In parallel with mounting fstab sounds reasonable, since mounting filesystems
> from fstab and mounting filesystems from ZFS (which happens by default upon
> import) are analogous.

Ah, I see. So do file systems from ZFS have a separate equivalent to
/etc/fstab?

> But I doubt it would be performant. Running
> zpool import examines all devices
> [...]
> Additionally, if you're responding to device events one-by-one, for pools with
> more than one disk, there's a question of whether you should import the pool
> when disks are missing

I think these two make it sufficiently "device unspecific" to make a
simple "call this in an udev rule" impractical.

> I think the best behavior
> is to import a pool:
> once all devices are present ||

It's really best to completely drop the idea of "all devices". It just
doesn't work in a world where you even can (and do) hotplug PCI
devices (e. g. to replace a faulty disk in a server). Thinking in
these terms also quickly gets you down into a rabbit hole of race
conditions and too strong locks.

> (once sufficient devices are present &&

That sounds like the right approach. That's more or less what LVM or
mdadm are doing. For some RAID modes in mdadm it's possible that you
already assemble a device before all members have been detected (e. g.
RAID-1 with 3 redundant disks), but LVM gets along with the third
member being added afterwards.

> a short timeout has expired with no new disks arriving)

That again is a part I strongly recommend to drop, as it's not
well-defined.

Thanks!

Richard Laager (rlaager) wrote :

> Ah, I see. So do file systems from ZFS have a separate equivalent to
> /etc/fstab?

Yes, each filesystem has a "mountpoint" property. This can be set directly, and is also inherited automatically. For example, I have rpool's mountpoint set to / and then rpool/home automatically inherits /home as its mountpoint. I also have rpool/home/root with a custom mountpoint set of /root. For a more complete list, see section 3.3 of my HOWTO:
https://github.com/rlaager/zfs/wiki/HOWTO-install-Ubuntu-to-a-Native-ZFS-Root-Filesystem

> It's really best to completely drop the idea of "all devices".

I think I might have done a bad job of explaining my point. Let me try again.

Imagine I have a raidz2 ("RAID-6", can survive two disks missing) with 6 disks. I want the daemon to import that pool as soon as all 6 disks are present. I also want it to import the pool if 4 (or 5) disks are present, but only after a short timeout where no new disks have been detected. That is, I do not want the pool to import in a degraded state when the first 4 disks are present, only to have the 5th and 6th show up milliseconds later.

The daemon would handle each disk one-by-one. It would track the information about discovered pools in memory. After each new disk is discovered and we've identified the pool it belongs to (C/glib pseudo-code follows):
handle_disk_arrival(disk) {
    /* TODO: read details from disk, initialize some pool object */

    if (has_sufficient_disks(pool)) {
        g_source_remove_by_funcs_user_data(import_pool, pool);
        if (has_all_disks(pool)) {
           g_timeout_add(0, import_pool, pool);
        } else {
           g_timeout_add(500, import_pool, pool);
       }
    }
}

import_pool(zpool *pool) {
    /* TODO: import the pool */
    return FALSE;
}

Martin Pitt (pitti) wrote :

Hello Richard,

Richard Laager [2016-04-20 17:59 -0000]:
> Yes, each filesystem has a "mountpoint" property.

I see, thanks. (Sorry, I'm just familiar with btrfs and subvolumes,
and they work differently). So I see how putting this into
local-fs.target in parallel with the fstab mounts makes sense.

> I want the daemon to import that pool as soon as all 6 disks are
> present. I also want it to import the pool if 4 (or 5) disks are
> present, but only after a short timeout where no new disks have been
> detected.

Aah, that makes much more sense than "wait for all devices", so that
was just a misunderstanding.

> That is, I do not want the pool to import in a degraded state when
> the first 4 disks are present, only to have the 5th and 6th show up
> milliseconds later.
>
> The daemon would handle each disk one-by-one. It would track the
> information about discovered pools in memory. After each new disk is
> discovered and we've identified the pool it belongs to (C/glib
> pseudo-code follows):

Doing this with a daemon would certainly work, it can do arbitrary
state keeping and timeouts. OTOH this is a bit of a central point of
failure/you need to write a new daemon/etc.

I haven't thought this through very thorouhgly, but I have a gut
feeling this could even work in a decentral and asynchronous way:
Whenever we get an uevent for a block device we call a helper (through
udev's RUN) which checks which zpool this belongs to and if there's at
least the minimal number of members available (but not all yet); if
so, it can start a systemd timer unit [1]
zpool-assemble@<devicename>.timer which will trigger some
zpool-assemble@<devicename>.service after, say, 500ms. But when that
helper sees (on the next uevent) that this was the last missing member
of the zpool, i. e. we now have all devices, it could immediately
start zpool-assemble@<devicename>.service. The timer then goes off
into nothing as the corresponding .service is already running (i. e.
it's harmless), or the udev helper could also explicitly stop it.

This is just a sketch, but if you are interested in exploring this I
recommend that we have a more synchronous chat (IRC, I'm pitti on
Freenode) or a hangout or so.

[1] Triggering a systemd unit from a udev rule is trivial,
    e. g. ENV{SYSTEMD_WANTS}+="zpool-assemble@%k.service" (with the
    appropriate checks for "is this a zfs block device)

Richard Laager (rlaager) wrote :

You make an excellent point about single-point-of-failure. It wouldn't necessarily have to be a new daemon. ZFS already has a daemon, so this could just be additional functionality.

I've pointed Brian Behlendorf (the upstream ZFS-on-Linux lead developer) to this bug report. It'd be nice to get some official comments from upstream.

As I said, I definitely want to see something done, to make removable media work correctly. (Though how to unmount it is a different question.)

Richard Laager (rlaager) wrote :

The relevant issue upstream seems to be:
https://github.com/zfsonlinux/zfs/issues/4178

There's no answer there yet, but I wanted to get that link posted here.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.7~bzr1227-0ubuntu1

---------------
cloud-init (0.7.7~bzr1227-0ubuntu1) yakkety; urgency=medium

  * New upstream snapshot.
    - fix one more unit test to run inside buildd.

 -- Scott Moser <email address hidden> Sat, 04 Jun 2016 20:55:07 -0400

Changed in cloud-init (Ubuntu):
status: Confirmed → Fix Released
Scott Moser (smoser) wrote :

Hello,
An SRU upload of cloud-init for 16.04 that contains a fix for this bug has been made under bug 1595302. Please track that bug if you are interested.

Changed in cloud-init (Ubuntu Xenial):
status: New → Fix Released
status: Fix Released → In Progress
importance: Undecided → High
assignee: nobody → Scott Moser (smoser)
Mathew Hodson (mhodson) on 2016-07-08
Changed in zfs-linux (Ubuntu Xenial):
importance: Undecided → High
Scott Moser (smoser) wrote :

fix is now released to xenial under bug 1595302. daily cloud-images with this newer version of cloud-init should appear in the next few days. Any image with a serial number *newer* than 20160707 should have cloud-init at 0.7.7~bzr1246-0ubuntu1~16.04.1 .

Changed in cloud-init (Ubuntu Xenial):
status: In Progress → Fix Released
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in zfs-linux (Ubuntu Xenial):
status: New → Confirmed
Colin Ian King (colin-king) wrote :

Since this seems to be closed against bug 1595302, I'm going to mark this bug as fix released too.

Changed in zfs-linux (Ubuntu):
status: Confirmed → Fix Released
Changed in zfs-linux (Ubuntu Xenial):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers