Snapd refresh doesn't recover from connection interruption

Bug #1912615 reported by Onno Steenbergen
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
snapd
Fix Committed
High
Unassigned

Bug Description

As discussed with Ondrej Kubik:

Multiple devices in our production fleet have been failing to upgrade. After issuing the refresh command snapd started to download kernel, core and other application downloads. Devices are connected via 4G connections of varying quality and intermittent interruptions. After 24 hours we established a remote connection to check the state of the machine and spotted the stuck download operations. The only way to recover from this state was to issue "snap abort", causing the system to revert all updates/downloads and rebooting. As rebooting hasn't been reliable this isn't desired behavior.

System:
- Dell Gateway 3002
- Ubuntu Core 16 with recent updates

Steps:
- `snap refresh`
- Interrupt connection during download
- Re-establish connection
- Snapd doesn't continue/restart download

Expected solution:
- Keep retrying downloads on interruptions
- Optional: Timeout to automatically abort tasks if resources can't be downloaded in X minutes/hours
- Optional: System applying partial refresh of packages.

Snap task:
Doing yesterday at 15:05 UTC - Download snap "caracalla-kernel" (144) from channel "latest/stable" (11.63%)
Doing yesterday at 15:05 UTC - Download snap "uefi-fw-tools" (22) from channel "latest/stable" (61.64%)

Snap task error:
2020-12-17T15:12:59Z ERROR Get https://api.snapcraft.io/api/v1/snaps/assertions/snap-revision/erhlpB3PgxPjMmFMSTgbhSXpNj7AwsTXS9i3kRVCNNnZIN97L_Ac_wlW_69tok0n?max-format=0: dial tcp: lookup api.snapcraft.io on [::1]:53: read udp [::1]:55637->[::1]:53: read: connection refused

Revision history for this message
Paweł Stołowski (stolowski) wrote :

Hello Onno,

Thank you for the report. This has been observed before and we hope to have a fix with https://github.com/snapcore/snapd/pull/9580 ; it was merged into development tree and should become available with snapd 2.49. Snapd does in fact implement various strategies for retrying (and bailing out) for network operations, but in this case it got stuck on the very low level (as you can see from the description of the PR I linked).

Changed in snapd:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Onno Steenbergen (osteenbergen) wrote :

Hi Paweł,

Thanks for the quick response. A quick read of that PR give the impression it cancels the download operation when the data rate over a 5 minutes is below a threshold of 4KB/s.
It seems reasonable defaults for us, maybe other customers would like some configuration options.

Can you tell me what happens if a device hits this timeout? When its updating 10 snaps and one fails this new threshold, will it abort, rollback and reboot (if needed) or just retry the download?
In case of rollback does the device store the successfully downloaded snaps so next attempt will only download the failed snap?

Thanks in advance,

Onno

Revision history for this message
Paweł Stołowski (stolowski) wrote :

With multi-snap update/install a failure of single snap doesn't revert entire operation, i.e. snaps that were successfully downloaded and installed remain installed.

Changed in snapd:
status: Confirmed → Fix Committed
no longer affects: snappy
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.