snaps appear broken when /var/lib/snapd is a zfs dataset

Bug #1750059 reported by Sam Van den Eynde
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
snapd (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

ProblemType: Bug
ApportVersion: 2.20.7-0ubuntu3.7
Architecture: amd64
CurrentDesktop: ubuntu:GNOME
Date: Fri Feb 16 22:00:59 2018
DistroRelease: Ubuntu 17.10
Package: snapd 2.29.4.2+17.10
PackageArchitecture: amd64

I have a separate ZFS dataset mounted at /var/lib/snapd. After a reboot, snaps appear broken:

snap list
Name Version Rev Developer Notes
core 3887 canonical broken
skype 13 skype broken
slack 4 slack broken

The only solution I have found to repair this (until the next reboot) is to uninstall and reinstall all snaps.

I have a system with /var/lib/snapd being a regular directory on the root filesystem, that does not show this behavior.

Both /var/lib/snapd and /snap seem consistent between both systems.

description: updated
description: updated
Gustavo Silva (gsilvapt)
tags: added: amd64 artful snapd ubuntu-17.10
Revision history for this message
John Lenton (chipaca) wrote :

how is the mounting of /var/lib/snapd ordered in relation to starting of snapd?

Revision history for this message
Sam Van den Eynde (samvde) wrote :

I assume it has to do something with the mount times as well.

It's stock/default behaviour. I see no direct link between snapd and zfs looking at the unit files. Both are wanted by the multi-user target but that's about it.

I have created a local copy (in /etc/systemd/system) of the snapd service, and made it dependent on zfs-mount.service and zfs-share.service, without success.

However, I am unfortunately not a systemd guru by far... If you could point out a simple directive e.g. to make snapd wait 30 seconds before starting I think we can pin this down fast.

Revision history for this message
John Lenton (chipaca) wrote :

Having snapd have an After= on the .mount unit that mounts /var/lib/snapd should be enough? I think. I haven't tested this (obviously).

Revision history for this message
Sam Van den Eynde (samvde) wrote :

I'll see if I have that available. ZFS scripts might do things a bit differently, I know there are a few outstanding actions wrt. systemd mount compatibility ("native generators").

Revision history for this message
Sam Van den Eynde (samvde) wrote :

So far no luck. I added a After=local-fs.target to the snapd unit file, but it still appears broken:

[Unit]
Description=Snappy daemon
Requires=snapd.socket
After=local-fs.target

[Service]
# Disabled because it breaks lxd
# (https://bugs.launchpad.net/snapd/+bug/1709536)
#Nice=-5
OOMScoreAdjust=-900
ExecStart=/usr/lib/snapd/snapd
EnvironmentFile=-/etc/environment
Restart=always
Type=notify

[Install]
WantedBy=multi-user.target

ZFS systemd mount scripts have Before=local-fs.target, so that can't be it:

[Unit]
Description=Mount ZFS filesystems
DefaultDependencies=no
After=systemd-udev-settle.service
After=zfs-import-cache.service
After=zfs-import-scan.service
After=systemd-remount-fs.service
Before=local-fs.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/sbin/zfs mount -a
WorkingDirectory=-/sbin/

[Install]
WantedBy=zfs-share.service
WantedBy=zfs.target

Revision history for this message
John Lenton (chipaca) wrote :

Could you run «systemd-analyze plot > /tmp/startup.svg» and attach the resulting /tmp/startup.svg?

Revision history for this message
Sam Van den Eynde (samvde) wrote :

I just pinned it down.

ZFS does not use systemd mount units. Both zfs(.target) and snapd(.service) are wanted by multi-user.target.

Snaps are mounted before the zfs dataset is mounted by zfs-mount.service, resulting in non-mounted snaps and therefore broken snaps.

I have not yet found a working solution to adapt the unit files to stop this from happening. But this definitely is the issue.

summary: - snaps are broken when /var/lib/snapd is a mounted directory
+ snaps appear broken when /var/lib/snapd is a zfs dataset
Revision history for this message
John Lenton (chipaca) wrote :

How about: create a OneShot service that just polls to see if the zfs mount is done, and have it be RequiredBy and Before the snap mount units.

Revision history for this message
Sam Van den Eynde (samvde) wrote :

The problem is this becomes quite breakable when snaps get updated and all. The mount units are specific to a snap version.

I was thinking of the following workaround (to test the basic principle). I could create a mount unit for /var/lib/snapd. That should tick all boxes: it should be taken into account by systemd in the mount hierarchy, it just needs the root filesystem to work (in my case) and that is mounted in the initrd, it should not break zfs-mount.service, and the mount command provides zfs compatibility via zfsutil.

Still a manual workaround if it works, but nevertheless quite manageable indeed, and it has to be done only once.

Revision history for this message
John Lenton (chipaca) wrote :

ah! if that'd work, perfect.

If not, my question about the oneshot service was that, if that worked, we could make all snap units have an After: on some agreed-upon name, which would make this future-proof.

It might be a good idea independently; I'll discuss it with the team. But please let me know how it goes.

Revision history for this message
Sam Van den Eynde (samvde) wrote :

Hi, I tried this over noon and I have some good news. The workaround seems successful (I did a few reboots, and snaps did not break again).

However, I can imagine this to be a difficult call to make: not everybody has zfs installed, let alone be /var/lib/snapd be a zfs dataset (although that actually makes a lot of sense ;-)).

So I only did one action: I added to following mount unit /etc/systemd/system/var-lib-snapd.mount:

[Unit]
Description=Mount unit for snapd
Before=snapd.service

[Mount]
What=zpool/var-lib-snapd
Where=/var/lib/snapd
Type=zfs
Options=zfsutil

[Install]
WantedBy=multi-user.target

Some impressions:
- this does not break zfs-mount.service (it will skip the already mounted dataset)
- this does not require changes to the mount units for installed snaps nor snapd.service
- this method uses zfsutil, which IGNORES the 'canmount' property of the zfs dataset
- if /var or /var/lib are zfs datasets as well, this will probably break again

Let me know if you want to test other scenarios or get more info, I'll set this up in my VM so I can test more easily.

Revision history for this message
John Lenton (chipaca) wrote :

I expect that if /var or /var/lib are zfs datasets, and you created the homologous .mount units, it would still work.

Is _that_ something you can test?

Either way, it would be good to have this documented over in the forum. I'll see if I can't write something up this evening.

Thank you for all your testing!

Revision history for this message
Sam Van den Eynde (samvde) wrote :

Yep, I will test that. /var can be tricky because of /var/run, so I'll use a VM for it.

Revision history for this message
Sam Van den Eynde (samvde) wrote :

Small update: I can't get the workaround to reproduce consistently in a VM. While the directories are all mounted as they should and snapd is started and running, snaps still appear "broken". I'm investigating further.

With everything I tried so far I'm not sure this makes a lot of sense though. If zfs is not integrated better in the native systemd scripts it seems this will stay unpredictable.

Revision history for this message
Sam Van den Eynde (samvde) wrote :

The workaround keeps on breaking in a VM. I'm on the point I can't assist further without guidance.

Revision history for this message
Sam Van den Eynde (samvde) wrote :

Closing remark: the new systemd generators from upstream zfsonlinux fix this. Version 0.8.x has it (Ubuntu kernel 5.2.0 needed or compile zfs separately).

Changed in snapd (Ubuntu):
status: New → Fix Released
Revision history for this message
Andreas Wolf (andreaswolf) wrote :

I use Ubuntu 20.04 w/ ZFS for the whole system, and for me this still broke despite using zfsutils 0.8. I only have a ZFS dataset for /var, not for /var/lib/snapd—not sure if this contributes to the problem.

The .mount unit from #11 fixed it for me though, after I adjusted it to cover /var instead of /var/lib/snapd.

Revision history for this message
Sam Van den Eynde (samvde) wrote :

This should no longer be the case? Everything in /etc/zfs/zfs-list.cache/* should be dynamically handled by systemd, does it contain entries for your pool?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.