Error during refresh of snapd leads to unusable system
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
snapd |
Fix Released
|
High
|
Unassigned | ||
snapd (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
$ snap version
snap 2.47.1+20.10
snapd 2.47.1+20.10
series 16
ubuntu 20.10
kernel 5.8.0-22-generic
System fully up to date
Symptom: I cannot launch any snap. Running from a terminal shows for example:
$ spotify
cannot stat /var/lib/
snap list returns the list of snaps
snap changes reports one error:
ID Status Spawn Ready Summary
65 Error today at 13:04 CEST today at 13:06 CEST Auto-refresh snaps "snapd", "squirrelsql"
Details for this changes report:
$ snap change 65
Status Spawn Ready Summary
Done today at 13:04 CEST today at 13:06 CEST Ensure prerequisites for "snapd" are available
Undone today at 13:04 CEST today at 13:06 CEST Download snap "snapd" (9607) from channel "latest/stable"
Done today at 13:04 CEST today at 13:06 CEST Fetch and check assertions for snap "snapd" (9607)
Undone today at 13:04 CEST today at 13:06 CEST Mount snap "snapd" (9607)
Undone today at 13:04 CEST today at 13:06 CEST Run pre-refresh hook of "snapd" snap if present
Undone today at 13:04 CEST today at 13:06 CEST Stop snap "snapd" services
Undone today at 13:04 CEST today at 13:04 CEST Remove aliases for snap "snapd"
Undone today at 13:04 CEST today at 13:04 CEST Make current revision for snap "snapd" unavailable
Undone today at 13:04 CEST today at 13:04 CEST Copy snap "snapd" data
Error today at 13:04 CEST today at 13:04 CEST Setup snap "snapd" (9607) security profiles
Hold [...]
Full log from that time attached to the bug report in journal.log
$ ls -ld /snap/snapd/
-rwxr-xr-x 1 root root 2306928 sept. 4 18:33 /snap/snapd/
-rwxr-xr-x 1 root root 2306928 sept. 30 06:59 /snap/snapd/
-rwxr-xr-x 1 root root 2306928 sept. 30 06:59 /snap/snapd/
There is no version 8790
$ sudo ls -l /proc/70906/exe
[...]
lrwxrwxrwx 1 root root 0 oct. 14 15:02 /proc/70906/exe -> /usr/lib/
refreshing snapd manually "fixes" the issue.
As discussed on IRC with jibel, there was a package update happening around from 07:01:25 to 07:03:53 on Oct 12.
Looking at the logs, I suspect the following happened:
- a reexeced snapd was running service kicks in, calls systemctl reset-failed on snapd.service TO_REV, but that has no meaning, because on the last run, there was no update of snapd happening
- snapd package update requested systemctl restart snapd.service
- the snapd service hit a kill timeout (unclear why, there is a log that indicates a graceful shutdown did not happen after 25s, then systemd proceeds to kill the process, which exits due to SIGTERM)
- snapd.failure.
- snapd.failure starts 'previous' snapd with SNAPD_REVERT_
- package update calls systemctl start snapd
<at this point there are 2 snapd processes running, one under snap-failure, another started by postinst, state gets corrupted with both snapd processes modifying it>
- some time later (oct. 14 13:04:58) a refresh occurs in the process run from snap-failure, snapd exits, snapd.failure service finishes successfully
- snapd service is restarted and operates for a while
- reboot
- snapd service attempts to refresh the snapd snap, but is confused by the state not matching actual situation in the system
The problems we need to address is:
- make sure that is only one instance of snapd running
- investigate why snapd shutdown gets killed
- make sure that snapd started by snap-failure acts correctly when there was no snapd update