LXD snap auto refresh stuck at copy snap data phase for 16+ hours without making progress

Bug #1943419 reported by Trent Lloyd
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
snapd
Triaged
High
Unassigned

Bug Description

Automatic "snap refresh" was stuck at the 'copy snap "lxd" data' phase for 16 hours and still had not completed. This happened on one host out of a cluster of 4 and appears likely to be some kind of race and/or system state causing it to get stuck.

A host reboot caused it to finish refreshing and upgrade to the new version before we got back into the host - rapidly within a few minutes. So it appears likely it was truly stuck in some way (rather than being due to a large amount of data or similar).

I could not find any evidence that it was still actually copying data as both lsof and strace did not seem to show activity in this directory. lsof from the sosreport shows that the only files open in /snap/lxd/20330/ are various binaries and libraries being used by the [lxc monitor] processes that are left running for the active container even when the lxd itself is stopped (/snap/lxd/20330/bin/*, /snap/lxd/20330/lib/*). Nothing was open in the new version path (/snap/lxd/21032).

There is also no obvious logging from the syslog, snapd logs, etc.. that indicate anything about why it was stuck as best I can tell from a fairly comprehensive review. There is minimal logging at all from snapd about it's activity really. The host in general was working fine, no hardware or disk/filesystem issues, kernel errors, etc.

This issue or something similar seems to have been reported at least twice before that I could find. Once with LXD and once with the core snap:
https://bugs.launchpad.net/snappy/+bug/1843589
https://github.com/lxc/lxd/issues/6098

The following debug information was generated and available but do not wish to post it all publicly to the bug since it's from a production environment. Feel free to request specific info or otherwise reach out so I can provide the full data directly.
(1) 'sosreport' from when the faulty snapd change was still running (includes a lot of different outputs such as: syslog, lsof, ps, netstat, dmesg, etc that sosreport collects by default)
(2) strace of snapd for a small period of time
(3) 'gcore' of the snapd process that can be analysed with gdb as if it were a crashdump
(4) some extra info collected by juju-crashdump

Application: Canoncial Anbox Cloud (https://anbox-cloud.io/) - LXD Cluster node
Ubuntu: 18.04.5
Kernel: 5.4.0-1047-aws #49~18.04.1-Ubuntu
snapd: snapd 12397 on Arm64/aarch64 (Amazon Graviton)

[Log Excerpts]

root@ip-172-31-39-228:~# snap changes
ID Status Spawn Ready Summary
28 Doing yesterday at 20:26 UTC - Auto-refresh snap "lxd"

root@ip-172-31-39-228:~# snap tasks 28
Status Spawn Ready Summary
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Ensure prerequisites for "lxd" are available
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Download snap "lxd" (21032) from channel "4.0/stable"
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Fetch and check assertions for snap "lxd" (21032)
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Mount snap "lxd" (21032)
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Run pre-refresh hook of "lxd" snap if present
Done yesterday at 20:26 UTC yesterday at 20:31 UTC Stop snap "lxd" services
Done yesterday at 20:26 UTC yesterday at 20:31 UTC Remove aliases for snap "lxd"
Done yesterday at 20:26 UTC yesterday at 20:31 UTC Make current revision for snap "lxd" unavailable
Doing yesterday at 20:26 UTC - Copy snap "lxd" data
Do yesterday at 20:26 UTC - Setup snap "lxd" (21032) security profiles
Do yesterday at 20:26 UTC - Make snap "lxd" (21032) available to the system
Do yesterday at 20:26 UTC - Automatically connect eligible plugs and slots of snap "lxd"
Do yesterday at 20:26 UTC - Set automatic aliases for snap "lxd"
Do yesterday at 20:26 UTC - Setup snap "lxd" aliases
Do yesterday at 20:26 UTC - Run post-refresh hook of "lxd" snap if present
Do yesterday at 20:26 UTC - Start snap "lxd" (21032) services
Do yesterday at 20:26 UTC - Clean up "lxd" (21032) install
Do yesterday at 20:26 UTC - Run configure hook of "lxd" snap if present
Do yesterday at 20:26 UTC - Run health check of "lxd" snap
Doing yesterday at 20:26 UTC - Consider re-refresh of "lxd"

[relevant /var/log/syslog]
Jul 20 20:26:43 ip-172-31-39-228 snapd[9010]: storehelpers.go:551: cannot refresh: snap has no updates available: "amazon-ssm-agent", "core", "core18", "snapd"
Jul 20 20:26:43 ip-172-31-39-228 snapd[9010]: store_download.go:482: Download size for https://api.snapcraft.io/api/v1/snaps/download/J60k4JY0HppjwOjW8dZdYc8obXKxujRu_20330_21032_xdelta3.delta: 39426509
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Starting Daily apt download activities...
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Starting Message of the Day...
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Mounting Mount unit for lxd, revision 21032...
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Mounted Mount unit for lxd, revision 21032.
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Closed Socket unix for snap application lxd.daemon.
Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Started Message of the Day.
Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Stopping Service for snap application lxd.daemon...
Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Started Daily apt download activities.
Jul 20 20:26:47 ip-172-31-39-228 lxd.daemon[7346]: => Stop reason is: snap refresh
Jul 20 20:26:47 ip-172-31-39-228 lxd.daemon[7346]: => Stopping LXD
Jul 20 20:31:49 ip-172-31-39-228 lxd.daemon[27985]: => LXD exited cleanly
Jul 20 20:31:50 ip-172-31-39-228 lxd.daemon[7346]: ==> Stopped LXD
Jul 20 20:31:50 ip-172-31-39-228 systemd[1]: Stopped Service for snap application lxd.daemon.
Jul 20 20:31:50 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading.

Tags: sts
Trent Lloyd (lathiat)
tags: added: sts
summary: - LXD snap refresh stuck at copy snap data phase for 16+ hours without
- making progress
+ LXD snap auto refresh stuck at copy snap data phase for 16+ hours
+ without making progress
description: updated
Changed in snappy:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Paweł Stołowski (stolowski) wrote :

I looked at snapd's data copying code but don't see anything obvious - for snap refresh case it's actually rather simple and executes "cp -av .." followed by "sync" for snap's data dirs (/var/snap/lxd/<revision>, /home/root/snap/lxd/<revision> /home/<user>/snap/lxd/<revision> (in case of lxd only the first two have content on my system) to copy them to new-revision directories. Any errors would appear either in the system log or as task errors (resulting in overall Error of the task and refresh).

I wonder if there is anything in the /var/snap/lxd/<revision> directory that would confuse "cp". Can you try to run "cp -av /var/snap/lxd/<revision> /a/temporary/directory" manually (it may make sense to stop lxd services first, leaving only monitor running to mimic the real refresh). How much data do you have there?

Changed in snapd:
status: New → Triaged
importance: Undecided → High
no longer affects: snappy
Revision history for this message
Trent Lloyd (lathiat) wrote :

Pawel: Of note, I have the processlist for the system, and "cp" was not running. Does that change the diagnostic thoughts?

As the systems are since fixed (after reboot the process worked fine) I'm not sure I can really replciate the cp here. But as I know cp definitely wasn't running (or any other process under the snapd like sync etc) maybe we can rule this part out?

Also this has since happened again in another different environment.

Revision history for this message
AkitoLiu (akitoliu) wrote :
Download full text (9.8 KiB)

Hi all, the another correlative phenomenon I observed:

I restarted the snapd service for serveral times, it can't solve the problem;
Finally, I found out the primary cause, the sync of operating system is stucked, and the snap refresh is stucked by the sync operation.

root@dev-physical-0-19:~# systemctl status snapd
× snapd.service - Snap Daemon
     Loaded: loaded (/lib/systemd/system/snapd.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2022-06-01 18:09:47 CST; 36min ago
TriggeredBy: ○ snapd.socket
    Process: 951327 ExecStart=/usr/lib/snapd/snapd (code=exited, status=1/FAILURE)
   Main PID: 951327 (code=exited, status=1/FAILURE)
      Tasks: 7 (limit: 462584)
     Memory: 2.0M
     CGroup: /system.slice/snapd.service
             ├─ 732061 sync
             ├─ 783222 sync
             ├─ 868088 sync
             ├─1102884 sync
             ├─3440815 sync
             ├─3447357 sync
             └─4164420 sync

And the sync is I/O opertion, it can’t be killed:

root@dev-physical-0-19:~# kill -9 732061 783222 868088 1102884 3440815 3447357 4164420
root@dev-physical-0-19:~# ps aux | grep sync
systemd+ 107947 0.0 0.0 88004 6736 ? Ssl May10 0:05 /lib/systemd/systemd-timesyncd
guest 153721 0.6 0.0 1343048 22784 ? Sl May31 12:35 ./bin/sync_center
root 732061 0.0 0.0 5752 972 ? D 11:17 0:00 sync
root 767675 0.0 0.0 5752 948 ? D 11:19 0:00 sync
root 783222 0.0 0.0 5752 972 ? D 11:20 0:00 sync
root 849466 0.0 0.0 5752 976 ? D 18:04 0:00 sync
root 868088 0.0 0.0 5752 928 ? D 11:24 0:00 sync
root 1102884 0.0 0.0 5752 932 ? D May31 0:00 sync

I had manually release the cache, but the sync process still exists:

root@dev-physical-0-19:~# echo 1 > /proc/sys/vm/drop_caches
root@dev-physical-0-19:~# echo 2 > /proc/sys/vm/drop_caches
root@dev-physical-0-19:~# echo 3 > /proc/sys/vm/drop_caches
root@dev-physical-0-19:~#
root@dev-physical-0-19:~# systemctl status snapd
× snapd.service - Snap Daemon
     Loaded: loaded (/lib/systemd/system/snapd.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2022-06-01 18:09:47 CST; 35min ago
TriggeredBy: ○ snapd.socket
    Process: 951327 ExecStart=/usr/lib/snapd/snapd (code=exited, status=1/FAILURE)
   Main PID: 951327 (code=exited, status=1/FAILURE)
      Tasks: 7 (limit: 462584)
     Memory: 2.0M
     CGroup: /system.slice/snapd.service
             ├─ 732061 sync
             ├─ 783222 sync
             ├─ 868088 sync
             ├─1102884 sync
             ├─3440815 sync
             ├─3447357 sync
             └─4164420 sync

Jun 01 18:09:47 dev-physical-0-19 systemd[1]: snapd.service: Failed with result 'exit-code'.
Jun 01 18:09:47 dev-physical-0-19 systemd[1]: snapd.service: Unit process 1102884 (sync) remains running after unit stopped.
Jun 01 18:09:47 dev-physical-0-19 systemd[1]: snapd.service: Unit process 3440815 (sync) remains running after unit stopped.
Jun 01 18:09:47 dev-physical-0-19 systemd[1]: snapd.service: Unit process 3447357 (sync) remains runni...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.