tests/core/snapd-failover randomly fails
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
snapd |
Fix Released
|
Undecided
|
Alberto Mardegan |
Bug Description
I've been noticing rare but consistent failure of the spread test "tests/
- https:/
I've created a PR to disable the test (https:/
I can reproduce the failure once in about 20 test runs, which I execute like this:
spread -debug -repeat 40 google:
I've added some more debugging to snapd (my changes: https:/
=======
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: taskrunner.go:439: DEBUG: Running task 101 on Undo: Copy snap "snapd" data
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: taskrunner.go:439: DEBUG: Running task 100 on Undo: Make current revision for snap "snapd" unavailable
Nov 26 08:40:20 nov260653-845046 snapd[3084]: taskrunner.go:439: DEBUG: Running task 103 on Undoing: Make snap "snapd" (unset) available to the system
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: link.go:69: updateCurrentSy
Nov 26 08:40:20 nov260653-845046 snapd[3084]: link.go:69: updateCurrentSy
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: link.go:58: Running stat for "/snap/
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: link.go:64: IsDir: false, Mode: Lrwxrwxrwx - 134218239
Nov 26 08:40:20 nov260653-845046 snapd[3084]: link.go:58: Running stat for "/snap/
Nov 26 08:40:20 nov260653-845046 snapd[3084]: link.go:61: Stat of "/snap/
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: link.go:58: Running stat for "/snap/
Nov 26 08:40:20 nov260653-845046 snapd[3084]: link.go:74: Cannot remove "/snap/
Nov 26 08:40:20 nov260653-845046 snapd[3084]: link.go:58: Running stat for "/snap/
Nov 26 08:40:20 nov260653-845046 snapd[3084]: link.go:61: Stat of "/snap/
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: link.go:61: Stat of "/snap/
Nov 26 08:40:20 nov260653-845046 snapd[3084]: link.go:58: Running stat for "/snap/
Nov 26 08:40:20 nov260653-845046 snapd[3084]: link.go:64: IsDir: false, Mode: Lrwxrwxrwx - 134218239
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: task.go:337: DEBUG: 2021-11-
Nov 26 08:40:20 nov260653-845046 snapd[3084]: task.go:337: DEBUG: 2021-11-
Nov 26 08:40:20 nov260653-845046 snapd[3084]: taskrunner.go:271: [change 9 "Make snap \"snapd\" (unset) available to the system" task] failed: symlink x1 /snap/snapd/
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: daemon.go:509: gracefully waiting for running hooks
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: daemon.go:511: done waiting for running hooks
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: taskrunner.go:439: DEBUG: Running task 99 on Undo: Remove aliases for snap "snapd"
=======
As the logs show, there are two processes, snapd and snap-failure, running the very same updateCurrentSy
It looks like these two processes are not meant to be run at the same time; I need to investigate whether it's the test to be wrong, or if we need to add some locking into snapd.
tags: | added: flaky-tests |
Changed in snapd: | |
assignee: | nobody → Alberto Mardegan (mardy) |
status: | New → In Progress |
Changed in snapd: | |
status: | In Progress → Fix Committed |
Changed in snapd: | |
status: | Fix Committed → Fix Released |
Trying to read the attached logs, and reconstruct what happened (I'm omitting most lines, and copying only those that seem essential to understand the flow):
- the tests install a broken release of snapd (x2):
snapd[2511]: link.go:69: updateCurrentSy mlinks mountdir "/snap/snapd/x2" 26T10:28: 39Z INFO Requested daemon restart (snapd snap). x2/usr/ lib/snapd/ snapd: Exec format error
snapd[2511]: task.go:337: DEBUG: 2021-11-
systemd[1]: snapd.service: Service hold-off time over, scheduling restart.
systemd[1]: Stopped Snap Daemon.
systemd[1]: Starting Snap Daemon...
systemd[2907]: snapd.service: Failed at step EXEC spawning /snap/snapd/
- the above is expected, this revision of snapd is intentionally broken to test the snapd recovery
systemd[1]: snapd.service: Unit entered failed state.
systemd[1]: snapd.service: Triggering OnFailure= dependencies.
systemd[1]: Starting Failure handling of the snapd snap...
- snap-failure[2912] starts:
snap-failure[2912]: cmd_snapd.go:136: stopping snapd socket
systemd[1]: snapd.service: Service hold-off time over, scheduling restart.
- I wonder if the line above is a sign that the systemd restart mechanism is conflicting with our recovery. Anyway, stopping the socket seems to cause systemd do some crazy stuff:
systemd[1]: Stopped Snap Daemon. service] x1/usr/ lib/snapd/ snapd
systemd[1]: Starting Snap Daemon...
systemd[1]: Stopped Wait until snapd is fully seeded [this is snapd.seeded.
systemd[1]: Stopped Snap Daemon.
systemd[1]: Closed Socket activation for snappy daemon.
snap-failure[2912]: cmd_snapd.go:143: restoring invoking snapd from: /snap/snapd/
- so, all the above happens when snap-failure just issued the "systemctl stop snapd.socket"; but it does not seem to be worrying, because it is not really causing snapd to start.
snap-failure[2912]: activation.go:64: DEBUG: socket "/run/snapd.socket" was not activated; listening snap.socket" was not activated; listening
snap-failure[2912]: activation.go:64: DEBUG: socket "/run/snapd-
- the ephimeral snapd started by snap-failure realized that we are recovering from a failure:
snap-failure[2912]: task.go:337: DEBUG: 2021-11- 26T10:28: 40Z ERROR there was a snapd rollback across the restart
snap-failure[2912]: taskrunner.go:439: DEBUG: Running task 103 on Undo: Make snap "snapd" (unset) available to the system
systemd[1]: Reloading.
- At this point the logs show quite a deal of activity from systemd, which reloads the apparmor profiles (and from their path, we can see that there are still some leftovers from the "x2" revision),
systemd[1]: Stopped Load AppArmor profiles managed internally by snapd.
systemd[1]: Started Load AppArmor profiles managed internally by snapd.
- snapd.core- fixup.service is restarted,
systemd[1]: Stopped Automatically repair incorrect owner/permissions on core devices.
systemd[1]: Started Automatically repair incorrect owner/permissions on core devices.
- the snapd.socket unit is started (not clear by whom),
systemd[1]: Starting Socket activation for snappy daemon.
systemd[1]: Listening on Socket activation for snappy daemon.
- and, more worryingly, snapd is started as well:
systemd[1]: Starting Snap Daemon...
snapd[3092]: AppArm...