tests/core/snapd-failover randomly fails

Bug #1952404 reported by Alberto Mardegan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
snapd
Fix Released
Undecided
Alberto Mardegan

Bug Description

I've been noticing rare but consistent failure of the spread test "tests/core/snapd-failover" in our CI:
- https://github.com/snapcore/snapd/runs/4289056946?check_suite_focus=true (in case the log expires, I've reposted it here: https://pastebin.ubuntu.com/p/qz3hrCGh24/)

I've created a PR to disable the test (https://github.com/snapcore/snapd/pull/11108), and this bug report is to collect some information about it and hopefully fix the failure.

I can reproduce the failure once in about 20 test runs, which I execute like this:
spread -debug -repeat 40 google:ubuntu-core-16-64:tests/core/snapd-failover

I've added some more debugging to snapd (my changes: https://pastebin.ubuntu.com/p/pVXhRCqGjH/) and it's now clear that the error occurs in the updateCurrentSymlinks() function, in the last line, where the symbolic link is created: https://github.com/snapcore/snapd/blob/master/overlord/snapstate/backend/link.go#L91

=======
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: taskrunner.go:439: DEBUG: Running task 101 on Undo: Copy snap "snapd" data
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: taskrunner.go:439: DEBUG: Running task 100 on Undo: Make current revision for snap "snapd" unavailable
Nov 26 08:40:20 nov260653-845046 snapd[3084]: taskrunner.go:439: DEBUG: Running task 103 on Undoing: Make snap "snapd" (unset) available to the system
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: link.go:69: updateCurrentSymlinks mountdir "/snap/snapd/x1"
Nov 26 08:40:20 nov260653-845046 snapd[3084]: link.go:69: updateCurrentSymlinks mountdir "/snap/snapd/x1"
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: link.go:58: Running stat for "/snap/snapd/current"
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: link.go:64: IsDir: false, Mode: Lrwxrwxrwx - 134218239
Nov 26 08:40:20 nov260653-845046 snapd[3084]: link.go:58: Running stat for "/snap/snapd/current"
Nov 26 08:40:20 nov260653-845046 snapd[3084]: link.go:61: Stat of "/snap/snapd/current" failed: lstat /snap/snapd/current: no such file or directory
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: link.go:58: Running stat for "/snap/snapd/current"
Nov 26 08:40:20 nov260653-845046 snapd[3084]: link.go:74: Cannot remove "/snap/snapd/current": remove /snap/snapd/current: no such file or directory
Nov 26 08:40:20 nov260653-845046 snapd[3084]: link.go:58: Running stat for "/snap/snapd/current"
Nov 26 08:40:20 nov260653-845046 snapd[3084]: link.go:61: Stat of "/snap/snapd/current" failed: lstat /snap/snapd/current: no such file or directory
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: link.go:61: Stat of "/snap/snapd/current" failed: lstat /snap/snapd/current: no such file or directory
Nov 26 08:40:20 nov260653-845046 snapd[3084]: link.go:58: Running stat for "/snap/snapd/current"
Nov 26 08:40:20 nov260653-845046 snapd[3084]: link.go:64: IsDir: false, Mode: Lrwxrwxrwx - 134218239
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: task.go:337: DEBUG: 2021-11-26T08:40:20Z INFO Requested daemon restart (snapd snap).
Nov 26 08:40:20 nov260653-845046 snapd[3084]: task.go:337: DEBUG: 2021-11-26T08:40:20Z ERROR symlink x1 /snap/snapd/current: file exists
Nov 26 08:40:20 nov260653-845046 snapd[3084]: taskrunner.go:271: [change 9 "Make snap \"snapd\" (unset) available to the system" task] failed: symlink x1 /snap/snapd/current: file exists
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: daemon.go:509: gracefully waiting for running hooks
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: daemon.go:511: done waiting for running hooks
Nov 26 08:40:20 nov260653-845046 snap-failure[2905]: taskrunner.go:439: DEBUG: Running task 99 on Undo: Remove aliases for snap "snapd"
=======

As the logs show, there are two processes, snapd and snap-failure, running the very same updateCurrentSymlinks() function at the same time. So, one of the process removes the /snap/snapd/current symlink, the other process creates it, and then the first process again tries to create it and fails.

It looks like these two processes are not meant to be run at the same time; I need to investigate whether it's the test to be wrong, or if we need to add some locking into snapd.

Tags: flaky-tests
Revision history for this message
Alberto Mardegan (mardy) wrote :
Revision history for this message
Alberto Mardegan (mardy) wrote :
Download full text (5.7 KiB)

Trying to read the attached logs, and reconstruct what happened (I'm omitting most lines, and copying only those that seem essential to understand the flow):

- the tests install a broken release of snapd (x2):

snapd[2511]: link.go:69: updateCurrentSymlinks mountdir "/snap/snapd/x2"
snapd[2511]: task.go:337: DEBUG: 2021-11-26T10:28:39Z INFO Requested daemon restart (snapd snap).
systemd[1]: snapd.service: Service hold-off time over, scheduling restart.
systemd[1]: Stopped Snap Daemon.
systemd[1]: Starting Snap Daemon...
systemd[2907]: snapd.service: Failed at step EXEC spawning /snap/snapd/x2/usr/lib/snapd/snapd: Exec format error

- the above is expected, this revision of snapd is intentionally broken to test the snapd recovery

systemd[1]: snapd.service: Unit entered failed state.
systemd[1]: snapd.service: Triggering OnFailure= dependencies.
systemd[1]: Starting Failure handling of the snapd snap...

- snap-failure[2912] starts:

snap-failure[2912]: cmd_snapd.go:136: stopping snapd socket
systemd[1]: snapd.service: Service hold-off time over, scheduling restart.

- I wonder if the line above is a sign that the systemd restart mechanism is conflicting with our recovery. Anyway, stopping the socket seems to cause systemd do some crazy stuff:

systemd[1]: Stopped Snap Daemon.
systemd[1]: Starting Snap Daemon...
systemd[1]: Stopped Wait until snapd is fully seeded [this is snapd.seeded.service]
systemd[1]: Stopped Snap Daemon.
systemd[1]: Closed Socket activation for snappy daemon.
snap-failure[2912]: cmd_snapd.go:143: restoring invoking snapd from: /snap/snapd/x1/usr/lib/snapd/snapd

- so, all the above happens when snap-failure just issued the "systemctl stop snapd.socket"; but it does not seem to be worrying, because it is not really causing snapd to start.

snap-failure[2912]: activation.go:64: DEBUG: socket "/run/snapd.socket" was not activated; listening
snap-failure[2912]: activation.go:64: DEBUG: socket "/run/snapd-snap.socket" was not activated; listening

- the ephimeral snapd started by snap-failure realized that we are recovering from a failure:

snap-failure[2912]: task.go:337: DEBUG: 2021-11-26T10:28:40Z ERROR there was a snapd rollback across the restart
snap-failure[2912]: taskrunner.go:439: DEBUG: Running task 103 on Undo: Make snap "snapd" (unset) available to the system
systemd[1]: Reloading.

- At this point the logs show quite a deal of activity from systemd, which reloads the apparmor profiles (and from their path, we can see that there are still some leftovers from the "x2" revision),

systemd[1]: Stopped Load AppArmor profiles managed internally by snapd.
systemd[1]: Started Load AppArmor profiles managed internally by snapd.

- snapd.core-fixup.service is restarted,

systemd[1]: Stopped Automatically repair incorrect owner/permissions on core devices.
systemd[1]: Started Automatically repair incorrect owner/permissions on core devices.

- the snapd.socket unit is started (not clear by whom),

systemd[1]: Starting Socket activation for snappy daemon.
systemd[1]: Listening on Socket activation for snappy daemon.

- and, more worryingly, snapd is started as well:

systemd[1]: Starting Snap Daemon...
snapd[3092]: AppArm...

Read more...

tags: added: flaky-tests
Alberto Mardegan (mardy)
Changed in snapd:
assignee: nobody → Alberto Mardegan (mardy)
status: New → In Progress
Revision history for this message
Alberto Mardegan (mardy) wrote :
Changed in snapd:
status: In Progress → Fix Committed
Michael Vogt (mvo)
Changed in snapd:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.