Sophisticated snap remove retries a lot and becomes very slow
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
snapd |
Fix Released
|
High
|
Paweł Stołowski |
Bug Description
This issue can be reproduced with the following sequence of ops (requires experimental snap instances feature ON):
$ sudo snap install brave && sudo snap install brave_test && sudo snap remove brave brave_test
(this snap has ~12 interfaces connected to core which greatly increases chances of conflicts).
The above scenario often leads to the situation where "auto-disconnect" finds a conflict, returns Retry with 5 seconds timeout, but instead of trying to run any of the tasks that could unblock it, snapd waits another 5 seconds before re-considering all tasks.
After 5 second pass, task manager (randomly) picks another task - if it happens to be one of the "disconnect" tasks, then we're good and its run, but if it's "auto-disconnect" again, tough luck, we waste another 5 seconds, slowing down snap removal substantially (with "this task needs to be retried because of ... conflict appearing in the logs a couple of times).
Changed in snapd: | |
assignee: | nobody → Paweł Stołowski (stolowski) |
importance: | Undecided → High |
status: | New → Triaged |
description: | updated |
Changed in snapd: | |
status: | Confirmed → Fix Released |
I think the problem is around scheduling of next ensure. After Retry we do EnsureBefore(0) because we found blocked tasks, which is expected; but after ConsiderTask loop we also call EnsureBefore with nextTaskTime (which happens to be taken from the auto-disconnect task). So, the call to EnsureBefore after the big loop resets EnsureTime to +5s from now.
The ensureBefore() code in overlord has a condition to prevent this:
now := time.Now() o.ensureNext) || o.ensureNext. Before( now) { Reset(d)
next := now.Add(d)
if next.Before(
o.ensureTimer.
o.ensureNext = next
}
I suspect the second part of the `if` conditional is to handle waking up from sleep, but I think it's allowing this bad scenario to happen, as it sets ensureNext to +5s. I think we need to split the above conditionals into two and waking from sleep should simply set ensureNext to ~now(), not next.