snap refresh while command is running may cause issues

Bug #1616650 reported by Jamie Strandboge on 2016-08-24
56
This bug affects 11 people
Affects Status Importance Assigned to Milestone
snapd
Medium
Zygmunt Krynicki

Bug Description

In testing a desktop snap that saves state in $HOME on close, I noticed that if I snap refresh the snap while the command is running that it will try to save its state to the previous snap version's data directory. For the snap I was testing (a browser), this resulted in a very poor user experience (the browser on restart complained about an improper shutdown).

What is happening is that:
1. on launch the snap's HOME is set to SNAP_USER_DATA, which is something like /home/user/snap/foo/x1. The security policy correctly allows writes to SNAP_USER_DATA
2. on snap refresh to 'x2', the security policy for the snap is updated for the running process such that /home/user/snap/foo/x1 is readonly and /home/user/snap/foo/x2 is read/write
3. the command in '1's environment is not changed and HOME (as well as SNAP_USER_DATA and SNAP_DATA) are all still using 'x1' in the path
4. the command tries to shutdown gracefully and save state to the 'x1' HOME and security policy blocks it

Snappy's design for rollbacks relies on the previous SNAP_DATA and SNAP_USER_DATA directories not being writable and IMHO we should not change the policy to make other snap version's data dirs writable.

The design of the snappy state engine ensures (among other things) that there is only ever one security policy in place for the snap. In snappy 15.04 this problem was (intentionally) avoided because we used snap security policy that was versioned such that the new policy would not apply until the next app invocation.

Gustavo and Zygmunt, you both advocated strongly for only one version of the policy on disk and loaded in the kernel and I recall bringing up this type of bug as a counter-argument, and if IIRC for daemons we said that snapd could simply restart them (makes perfect sense). Have you thought of the mechanism for restarting non-daemons?

Jamie Strandboge (jdstrand) wrote :

Perhaps this was already in the design of the state engine? Eg, the state engine will periodically try to uninstall an app if it is unable to unmount the squashfs. I wonder if the security policy load (or even install) could be deferred until there were no processes running under that security label (easy to determine by examining /proc (see ps -Z).

description: updated
Jamie Strandboge (jdstrand) wrote :

I had this happen again. The snap refreshed while I was offline and I ended up with 890531 sandbox denials and syslog at 288M before I came back online and stopped the snap in question. Marking confirmed.

Changed in snappy:
status: New → Confirmed
Michael Vogt (mvo) wrote :

This should be fixed now. We do a lazy unmount of mounted snaps now, so anythign still running will get removed in a delayed fashion.

Changed in snappy:
status: Confirmed → Incomplete
Launchpad Janitor (janitor) wrote :

[Expired for Snappy because there has been no activity for 60 days.]

Changed in snappy:
status: Incomplete → Expired
Jamie Strandboge (jdstrand) wrote :

@mvo - you marked this as Incomplete stating "This should be fixed now. We do a lazy unmount of mounted snaps now, so anythign still running will get removed in a delayed fashion." but that is not what this bug is talking about. This bug is talking about the fact that a running application's environment points to versioned data directories that when started, the application has write access to, but after a refresh does not because the application is not restarted after the refresh.

See https://forum.snapcraft.io/t/bug-saves-are-blocked-to-snap-user-data-if-snap-updates-when-it-is-already-running/3226/1

What we should probably do during a refresh is look in the freezer cgroup to see is there are any non-daemon running processes (daemons are already handled due to systemd unit restarts). If so, delay the refresh (perhaps with pop up allowing the user to stop the application).

affects: snappy → snapd
Changed in snapd:
status: Expired → Confirmed
Michael Vogt (mvo) on 2018-01-02
Changed in snapd:
status: Confirmed → Triaged
importance: Undecided → Medium
Ernst Sjöstrand (ernstp) wrote :

My workstation usually have many weeks uptime, and I leave my editors open so I can just pick up where I left them.
My editors happens to be snaps like Atom, Intellij, Pycharm etc.
So I run into this bug very frequently.

Zygmunt Krynicki (zyga) on 2018-07-19
Changed in snapd:
assignee: nobody → Zygmunt Krynicki (zyga)
Olivier Tilloy (osomon) wrote :

This is especially visible with the chromium snap. As described by Jamie, if the chromium snap is refreshed while running, at the next restart the application will complain that it wasn't shut down properly. Also, the GNOME dock looses track of the running application when the refresh is done.

This is regularly being reported by users on forums.

Zygmunt Krynicki (zyga) wrote :

I'm starting to work on this feature.

The general idea is that a package will only refresh when there are no processes running. I will document the feature separately on the forum and link it back here when ready. The feature will be based on the existing freezer cgroup that snapd manages for each snap.

Changed in snapd:
status: Triaged → In Progress
Zygmunt Krynicki (zyga) wrote :

We've re-designed the feature to be much simpler to understand.

<<<__DOC__

Simple approach first, covers the most common desktop case:

When user requests a refresh interactively we perform synchronous server side soft verification that the request can go forward, for details about the verification, see below. If verification fails we return a synchronous error response, otherwise we create the usual change and return the async response
In the refresh chain, before stopping services we perform another soft verification. If the verification fails but is below the grace period of postponed refreshes we fail the change (due to lanes only the affected snaps will fail to refresh). If the last refresh time is no longer in the grace period we remember to kill all processes and carry on.
After stopping services we perform hard verification, if that fails but we are still within the grace period we restart services we’ve stopped and fail the change, as above. If the grace period has elapsed we kill all processes belonging to the snap and proceed with the refresh as usual.

Soft verification - the set of processes belonging to non-service applications is non-empty
Hard verification - the set of processes belonging to a given snap is non-empty

We can compute those sets by examining our freezer cgroup process list (/sys/fs/cgroup/freezer/snap.$SNAP_NAME/cgroup.procs) and set of processes belonging to all the services that exist in the snap (by looking at /sys/fs/cgroup/systemd/system.slice/snap.$SNAP_NAME.*.service/cgroup.procs)

Once simple approach is implemented, we can consider several improvements.

- We can initiate the refresh process instantly after the last application process terminates using cgroup v1 or v2 notification mechanism.
- We can introduce new hooks that notify an application about a pending update
- We can introduce session-level hooks via snapd and snap-userd to deliver messages to the session of users that have logged in
- We can pre-download the snap and perform the update in a special boot mode, matching similar work on recent desktop and server systems.
- A mechanism that allows applications to grab refresh inhibit locks for critical operations for a bound amount of time (independent of the logic above)

__DOC__

Zygmunt Krynicki (zyga) wrote :

I've targeted this to 2.39 where it should be available behind a feature flag. It may be available earlier but 2.38 is likely to branch for release soon so I think that's unrealistic.

Changed in snapd:
milestone: none → 2.39
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers