Sophisticated snap remove retries a lot and becomes very slow

Bug #1806447 reported by Paweł Stołowski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
snapd
Fix Released
High
Paweł Stołowski

Bug Description

This issue can be reproduced with the following sequence of ops (requires experimental snap instances feature ON):

$ sudo snap install brave && sudo snap install brave_test && sudo snap remove brave brave_test

(this snap has ~12 interfaces connected to core which greatly increases chances of conflicts).

The above scenario often leads to the situation where "auto-disconnect" finds a conflict, returns Retry with 5 seconds timeout, but instead of trying to run any of the tasks that could unblock it, snapd waits another 5 seconds before re-considering all tasks.
After 5 second pass, task manager (randomly) picks another task - if it happens to be one of the "disconnect" tasks, then we're good and its run, but if it's "auto-disconnect" again, tough luck, we waste another 5 seconds, slowing down snap removal substantially (with "this task needs to be retried because of ... conflict appearing in the logs a couple of times).

Changed in snapd:
assignee: nobody → Paweł Stołowski (stolowski)
importance: Undecided → High
status: New → Triaged
description: updated
Revision history for this message
Paweł Stołowski (stolowski) wrote :

I think the problem is around scheduling of next ensure. After Retry we do EnsureBefore(0) because we found blocked tasks, which is expected; but after ConsiderTask loop we also call EnsureBefore with nextTaskTime (which happens to be taken from the auto-disconnect task). So, the call to EnsureBefore after the big loop resets EnsureTime to +5s from now.

The ensureBefore() code in overlord has a condition to prevent this:

now := time.Now()
next := now.Add(d)
if next.Before(o.ensureNext) || o.ensureNext.Before(now) {
 o.ensureTimer.Reset(d)
 o.ensureNext = next
}

I suspect the second part of the `if` conditional is to handle waking up from sleep, but I think it's allowing this bad scenario to happen, as it sets ensureNext to +5s. I think we need to split the above conditionals into two and waking from sleep should simply set ensureNext to ~now(), not next.

summary: - ifacemgr "block" predicate is not aware of retry
+ Sophisticated snap remove retries a lot and becomes very slow
description: updated
Changed in snapd:
status: Triaged → Confirmed
Changed in snapd:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.