snapd became inactive, total system failure

Bug #1898934 reported by Roger Peppe
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
snapd
Confirmed
High
Unassigned

Bug Description

After a power cut, my previously stable Ubuntu Core raspberry pi failed to restart the service that it was responsible for.

I could log in remotely, but the snap command wasn't working.
On further inspection, /usr/bin/snap links to /snap/snapd/current, but there is no "current" file in that directory.

Last time it was known to be working was on 2020-10-07.

I've attached a copy of /var/lib/snapd/state.json with the sensitive-looking parts redacted.

Other info that might be helpful:

output of "systemctl status -l core18.start-snapd.service": https://paste.ubuntu.com/p/5xTNkyqdqg/

output of "journalctl --list-boots ": https://paste.ubuntu.com/p/2tRSFgP7bB/

output of "cd /snap/snapd; for i in *; do; systemctl status snap-snapd-$i.mount; done": https://paste.ubuntu.com/p/mwjp9bJ3Tt/

output of "df": https://paste.ubuntu.com/p/McFB3WDSbP/

output of "systemctl --failed": https://paste.ubuntu.com/p/WJf9vCyKWm/

Revision history for this message
Roger Peppe (rogpeppe) wrote :
Zygmunt Krynicki (zyga)
information type: Public → Private
Revision history for this message
Michael Vogt (mvo) wrote :

Thanks for your bugreport. The output of:
$ journalctl -u snapd.service
$ journalctl -u snapd.failure.service
would be super interessting.

Revision history for this message
Paweł Stołowski (stolowski) wrote :

So, two quick observations that may be useful:

- "snapd" in the state doesn't have active: true flag, that means missing symlink kinda matches the state. That could be a problem reboot in unexpected moment, but could also be a problem with undo handling somewhere. I'm currently investigating an issue of missing with missing undo on snap remove/disable (which also restore symlinks and active flag), but it's probably not exactly this particular case, and only manual remove/disable.

- attached state.json doesn't have changes anymore to look at which is a bit unfortunate, but we have timings (yay). Those may give some clue about some ops happening recently.

Revision history for this message
Paweł Stołowski (stolowski) wrote :

I've analyzed timings data from the state, it seems the last critical change happened on 2020-10-05 12:58:27, and it also matches one of the (scheduled) reboots (-3 d15798f39bcf4221a3ed50ebf65fd5b1 Mon 2020-09-07 15:09:58 UTC—Mon 2020-10-05 12:58:15 UTC).

This change (change-id = 37) was a refresh of 4 snaps, including snapd, it failed for some reason and was undone, and undo included undoing of unlink-current-snap for snapd and generating snapd wrappers. Part of the undo (in particular for unlink-current-snap) was done after the reboot on 2020-10-05 12:58:15. It seems that undoing succeeded and the system was operational, there were two more reboots after this date:

-1 18476449d48e4ee5975a7e1fe1fa0d61 Tue 2020-10-06 08:10:01 UTC—Tue 2020-10-06 15:20:32 UTC
 0 1bbf6ec4670845ccaf94f2d788e2bb89 Tue 2020-10-06 15:25:51 UTC—Wed 2020-10-07 20:57:24 UTC

although they don't seem to be scheduled by snapd to me, because the aforementioned change 37 is the last one (confirmed by "last-change-id": 37).

With a bit of jq magic, timings for this change can be sorted and distilled like this (attaching the result for convenience):

$ jq '.data.timings|=sort_by(."start-time") | .data["timings"][] | select ((.tags["change-id"] == "37") and (.tags["task-status"] == "Undoing"))' < state.json

Snapd survived these two reboots and worked till 2020-10-07 as mentioned by Roger, and timings confirm that; the only activity in these two days was from ensure loops (refresh-hints, refresh-catalogs).
All this suggests that till the fatal power loss, snapd would start after reboot, meaning symlink was present. It's evident from the state.json though, that we lost active:true flag from snapd at some moment.

Zygmunt Krynicki (zyga)
Changed in snapd:
status: New → Confirmed
importance: Undecided → High
Zygmunt Krynicki (zyga)
summary: - snapd lost current symlink, total system failure
+ snapd became inactive, total system failure
Zygmunt Krynicki (zyga)
information type: Private → Public
Revision history for this message
Roger Peppe (rogpeppe) wrote :

OK, here's the output of the journalctl commands you requested:

Revision history for this message
Paweł Stołowski (stolowski) wrote :

Attaching some extra info as I got access to the affected system.

Two quick observations (both included in the attached document):
- fsck.vfat is not on PATH (apparently it's an old core18 image that didn't manage to refresh to have a fix for this), it's only present on the core image.
- snapd service is running (but no current symlink is present as noted earlier).
- boot parition is not mounted because it was manually unmounted in one of the previous debugging sessions.

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

There are derivative bugs that we filed as a consequence of this analysis:

- https://bugs.launchpad.net/snapd/+bug/1899665 (snapd becoming inactive)
- https://bugs.launchpad.net/snapd/+bug/1899664 (snapd fails to install on old core18 revision)

We will have to file one more, related to bootloader discovery failure we noticed later.

Revision history for this message
Paweł Stołowski (stolowski) wrote :

One other issue from this session: https://bugs.launchpad.net/snapd/+bug/1899769

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers