Comment 4 for bug 1898934

Revision history for this message
Paweł Stołowski (stolowski) wrote : Re: snapd lost current symlink, total system failure

I've analyzed timings data from the state, it seems the last critical change happened on 2020-10-05 12:58:27, and it also matches one of the (scheduled) reboots (-3 d15798f39bcf4221a3ed50ebf65fd5b1 Mon 2020-09-07 15:09:58 UTC—Mon 2020-10-05 12:58:15 UTC).

This change (change-id = 37) was a refresh of 4 snaps, including snapd, it failed for some reason and was undone, and undo included undoing of unlink-current-snap for snapd and generating snapd wrappers. Part of the undo (in particular for unlink-current-snap) was done after the reboot on 2020-10-05 12:58:15. It seems that undoing succeeded and the system was operational, there were two more reboots after this date:

-1 18476449d48e4ee5975a7e1fe1fa0d61 Tue 2020-10-06 08:10:01 UTC—Tue 2020-10-06 15:20:32 UTC
 0 1bbf6ec4670845ccaf94f2d788e2bb89 Tue 2020-10-06 15:25:51 UTC—Wed 2020-10-07 20:57:24 UTC

although they don't seem to be scheduled by snapd to me, because the aforementioned change 37 is the last one (confirmed by "last-change-id": 37).

With a bit of jq magic, timings for this change can be sorted and distilled like this (attaching the result for convenience):

$ jq '.data.timings|=sort_by(."start-time") | .data["timings"][] | select ((.tags["change-id"] == "37") and (.tags["task-status"] == "Undoing"))' < state.json

Snapd survived these two reboots and worked till 2020-10-07 as mentioned by Roger, and timings confirm that; the only activity in these two days was from ensure loops (refresh-hints, refresh-catalogs).
All this suggests that till the fatal power loss, snapd would start after reboot, meaning symlink was present. It's evident from the state.json though, that we lost active:true flag from snapd at some moment.