snapd

LXD snap auto refresh stuck at copy snap data phase for 16+ hours without making progress

Bug #1943419 reported by Trent Lloyd on 2021-09-13

This bug affects 5 people

Affects		Status	Importance	Assigned to	Milestone
	snapd	Triaged	High	Unassigned

Bug Description

Automatic "snap refresh" was stuck at the 'copy snap "lxd" data' phase for 16 hours and still had not completed. This happened on one host out of a cluster of 4 and appears likely to be some kind of race and/or system state causing it to get stuck.

A host reboot caused it to finish refreshing and upgrade to the new version before we got back into the host - rapidly within a few minutes. So it appears likely it was truly stuck in some way (rather than being due to a large amount of data or similar).

I could not find any evidence that it was still actually copying data as both lsof and strace did not seem to show activity in this directory. lsof from the sosreport shows that the only files open in /snap/lxd/20330/ are various binaries and libraries being used by the [lxc monitor] processes that are left running for the active container even when the lxd itself is stopped (/snap/lxd/20330/bin/*, /snap/lxd/20330/lib/*). Nothing was open in the new version path (/snap/lxd/21032).

There is also no obvious logging from the syslog, snapd logs, etc.. that indicate anything about why it was stuck as best I can tell from a fairly comprehensive review. There is minimal logging at all from snapd about it's activity really. The host in general was working fine, no hardware or disk/filesystem issues, kernel errors, etc.

This issue or something similar seems to have been reported at least twice before that I could find. Once with LXD and once with the core snap:
https://bugs.launchpad.net/snappy/+bug/1843589
https://github.com/lxc/lxd/issues/6098

The following debug information was generated and available but do not wish to post it all publicly to the bug since it's from a production environment. Feel free to request specific info or otherwise reach out so I can provide the full data directly.
(1) 'sosreport' from when the faulty snapd change was still running (includes a lot of different outputs such as: syslog, lsof, ps, netstat, dmesg, etc that sosreport collects by default)
(2) strace of snapd for a small period of time
(3) 'gcore' of the snapd process that can be analysed with gdb as if it were a crashdump
(4) some extra info collected by juju-crashdump

Application: Canoncial Anbox Cloud (https://anbox-cloud.io/) - LXD Cluster node
Ubuntu: 18.04.5
Kernel: 5.4.0-1047-aws #49~18.04.1-Ubuntu
snapd: snapd 12397 on Arm64/aarch64 (Amazon Graviton)

[Log Excerpts]

root@ip-172-31-39-228:~# snap changes
ID Status Spawn Ready Summary
28 Doing yesterday at 20:26 UTC - Auto-refresh snap "lxd"

root@ip-172-31-39-228:~# snap tasks 28
Status Spawn Ready Summary
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Ensure prerequisites for "lxd" are available
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Download snap "lxd" (21032) from channel "4.0/stable"
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Fetch and check assertions for snap "lxd" (21032)
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Mount snap "lxd" (21032)
Done yesterday at 20:26 UTC yesterday at 20:26 UTC Run pre-refresh hook of "lxd" snap if present
Done yesterday at 20:26 UTC yesterday at 20:31 UTC Stop snap "lxd" services
Done yesterday at 20:26 UTC yesterday at 20:31 UTC Remove aliases for snap "lxd"
Done yesterday at 20:26 UTC yesterday at 20:31 UTC Make current revision for snap "lxd" unavailable
Doing yesterday at 20:26 UTC - Copy snap "lxd" data
Do yesterday at 20:26 UTC - Setup snap "lxd" (21032) security profiles
Do yesterday at 20:26 UTC - Make snap "lxd" (21032) available to the system
Do yesterday at 20:26 UTC - Automatically connect eligible plugs and slots of snap "lxd"
Do yesterday at 20:26 UTC - Set automatic aliases for snap "lxd"
Do yesterday at 20:26 UTC - Setup snap "lxd" aliases
Do yesterday at 20:26 UTC - Run post-refresh hook of "lxd" snap if present
Do yesterday at 20:26 UTC - Start snap "lxd" (21032) services
Do yesterday at 20:26 UTC - Clean up "lxd" (21032) install
Do yesterday at 20:26 UTC - Run configure hook of "lxd" snap if present
Do yesterday at 20:26 UTC - Run health check of "lxd" snap
Doing yesterday at 20:26 UTC - Consider re-refresh of "lxd"

[relevant /var/log/syslog]
Jul 20 20:26:43 ip-172-31-39-228 snapd[9010]: storehelpers.go:551: cannot refresh: snap has no updates available: "amazon-ssm-agent", "core", "core18", "snapd"
Jul 20 20:26:43 ip-172-31-39-228 snapd[9010]: store_download.go:482: Download size for https://api.snapcraft.io/api/v1/snaps/download/J60k4JY0HppjwOjW8dZdYc8obXKxujRu_20330_21032_xdelta3.delta: 39426509
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Starting Daily apt download activities...
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Starting Message of the Day...
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Mounting Mount unit for lxd, revision 21032...
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Mounted Mount unit for lxd, revision 21032.
Jul 20 20:26:46 ip-172-31-39-228 systemd[1]: Closed Socket unix for snap application lxd.daemon.
Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Started Message of the Day.
Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Stopping Service for snap application lxd.daemon...
Jul 20 20:26:47 ip-172-31-39-228 systemd[1]: Started Daily apt download activities.
Jul 20 20:26:47 ip-172-31-39-228 lxd.daemon[7346]: => Stop reason is: snap refresh
Jul 20 20:26:47 ip-172-31-39-228 lxd.daemon[7346]: => Stopping LXD
Jul 20 20:31:49 ip-172-31-39-228 lxd.daemon[27985]: => LXD exited cleanly
Jul 20 20:31:50 ip-172-31-39-228 lxd.daemon[7346]: ==> Stopped LXD
Jul 20 20:31:50 ip-172-31-39-228 systemd[1]: Stopped Service for snap application lxd.daemon.
Jul 20 20:31:50 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading.
Jul 20 20:31:51 ip-172-31-39-228 systemd[1]: Reloading.

See original description

Tags:

Trent Lloyd (lathiat) on 2021-09-13

tags:	added: sts
summary:	- LXD snap refresh stuck at copy snap data phase for 16+ hours without - making progress + LXD snap auto refresh stuck at copy snap data phase for 16+ hours + without making progress
description:	updated

Paweł Stołowski (stolowski) on 2021-09-13

Changed in snappy:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Paweł Stołowski (stolowski) wrote on 2021-09-13:

I looked at snapd's data copying code but don't see anything obvious - for snap refresh case it's actually rather simple and executes "cp -av .." followed by "sync" for snap's data dirs (/var/snap/lxd/<revision>, /home/root/snap/lxd/<revision> /home/<user>/snap/lxd/<revision> (in case of lxd only the first two have content on my system) to copy them to new-revision directories. Any errors would appear either in the system log or as task errors (resulting in overall Error of the task and refresh).

I wonder if there is anything in the /var/snap/lxd/<revision> directory that would confuse "cp". Can you try to run "cp -av /var/snap/lxd/<revision> /a/temporary/directory" manually (it may make sense to stop lxd services first, leaving only monitor running to mimic the real refresh). How much data do you have there?

Paweł Stołowski (stolowski) on 2021-09-13

Changed in snapd:
status:	New → Triaged
importance:	Undecided → High
no longer affects:	snappy

Revision history for this message

Trent Lloyd (lathiat) wrote on 2021-10-04:

Pawel: Of note, I have the processlist for the system, and "cp" was not running. Does that change the diagnostic thoughts?

As the systems are since fixed (after reboot the process worked fine) I'm not sure I can really replciate the cp here. But as I know cp definitely wasn't running (or any other process under the snapd like sync etc) maybe we can rule this part out?

Also this has since happened again in another different environment.

Revision history for this message

AkitoLiu (akitoliu) wrote on 2022-06-01:

Download full text (9.8 KiB)

Hi all, the another correlative phenomenon I observed:

I restarted the snapd service for serveral times, it can't solve the problem;
Finally, I found out the primary cause, the sync of operating system is stucked, and the snap refresh is stucked by the sync operation.

root@dev-physical-0-19:~# systemctl status snapd
× snapd.service - Snap Daemon
     Loaded: loaded (/lib/systemd/system/snapd.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2022-06-01 18:09:47 CST; 36min ago
TriggeredBy: ○ snapd.socket
    Process: 951327 ExecStart=/usr/lib/snapd/snapd (code=exited, status=1/FAILURE)
   Main PID: 951327 (code=exited, status=1/FAILURE)
      Tasks: 7 (limit: 462584)
     Memory: 2.0M
     CGroup: /system.slice/snapd.service
             ├─ 732061 sync
             ├─ 783222 sync
             ├─ 868088 sync
             ├─1102884 sync
             ├─3440815 sync
             ├─3447357 sync
             └─4164420 sync

And the sync is I/O opertion, it can’t be killed:

root@dev-physical-0-19:~# kill -9 732061 783222 868088 1102884 3440815 3447357 4164420
root@dev-physical-0-19:~# ps aux | grep sync
systemd+ 107947 0.0 0.0 88004 6736 ? Ssl May10 0:05 /lib/systemd/systemd-timesyncd
guest 153721 0.6 0.0 1343048 22784 ? Sl May31 12:35 ./bin/sync_center
root 732061 0.0 0.0 5752 972 ? D 11:17 0:00 sync
root 767675 0.0 0.0 5752 948 ? D 11:19 0:00 sync
root 783222 0.0 0.0 5752 972 ? D 11:20 0:00 sync
root 849466 0.0 0.0 5752 976 ? D 18:04 0:00 sync
root 868088 0.0 0.0 5752 928 ? D 11:24 0:00 sync
root 1102884 0.0 0.0 5752 932 ? D May31 0:00 sync

I had manually release the cache, but the sync process still exists:

root@dev-physical-0-19:~# echo 1 > /proc/sys/vm/drop_caches
root@dev-physical-0-19:~# echo 2 > /proc/sys/vm/drop_caches
root@dev-physical-0-19:~# echo 3 > /proc/sys/vm/drop_caches
root@dev-physical-0-19:~#
root@dev-physical-0-19:~# systemctl status snapd
× snapd.service - Snap Daemon
     Loaded: loaded (/lib/systemd/system/snapd.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2022-06-01 18:09:47 CST; 35min ago
TriggeredBy: ○ snapd.socket
    Process: 951327 ExecStart=/usr/lib/snapd/snapd (code=exited, status=1/FAILURE)
   Main PID: 951327 (code=exited, status=1/FAILURE)
      Tasks: 7 (limit: 462584)
     Memory: 2.0M
     CGroup: /system.slice/snapd.service
             ├─ 732061 sync
             ├─ 783222 sync
             ├─ 868088 sync
             ├─1102884 sync
             ├─3440815 sync
             ├─3447357 sync
             └─4164420 sync

Hi all, the another correlative phenomenon I observed:

And the sync is I/O opertion, it can’t be killed:

root@dev-physical-0-19:~# kill -9 732061 783222 868088 1102884 3440815 3447357 4164420 
root@dev-physical-0-19:~# ps aux | grep sync
systemd+  107947  0.0  0.0  88004  6736 ?        Ssl  May10   0:05 /lib/systemd/systemd-timesyncd
guest     153721  0.6  0.0 1343048 22784 ?       Sl   May31  12:35 ./bin/sync_center
root      732061  0.0  0.0   5752   972 ?        D    11:17   0:00 sync
root      767675  0.0  0.0   5752   948 ?        D    11:19   0:00 sync
root      783222  0.0  0.0   5752   972 ?        D    11:20   0:00 sync
root      849466  0.0  0.0   5752   976 ?        D    18:04   0:00 sync
root      868088  0.0  0.0   5752   928 ?        D    11:24   0:00 sync
root     1102884  0.0  0.0   5752   932 ?        D    May31   0:00 sync

I had manually release the cache, but the sync process still exists:

root@dev-physical-0-19:~# echo 1 > /proc/sys/vm/drop_caches
root@dev-physical-0-19:~# echo 2 > /proc/sys/vm/drop_caches
root@dev-physical-0-19:~# echo 3 > /proc/sys/vm/drop_caches
root@dev-physical-0-19:~# 
root@dev-physical-0-19:~# systemctl status snapd
× snapd.service - Snap Daemon
     Loaded: loaded (/lib/systemd/system/snapd.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2022-06-01 18:09:47 CST; 35min ago
TriggeredBy: ○ snapd.socket
    Process: 951327 ExecStart=/usr/lib/snapd/snapd (code=exited, status=1/FAILURE)
   Main PID: 951327 (code=exited, status=1/FAILURE)
      Tasks: 7 (limit: 462584)
     Memory: 2.0M
     CGroup: /system.slice/snapd.service
             ├─ 732061 sync
             ├─ 783222 sync
             ├─ 868088 sync
             ├─1102884 sync
             ├─3440815 sync
             ├─3447357 sync
             └─4164420 sync

Jun 01 18:09:47 dev-physical-0-19 systemd[1]: snapd.service: Failed with result 'exit-code'.
Jun 01 18:09:47 dev-physical-0-19 systemd[1]: snapd.service: Unit process 1102884 (sync) remains running after unit stopped.
Jun 01 18:09:47 dev-physical-0-19 systemd[1]: snapd.service: Unit process 3440815 (sync) remains running after unit stopped.
Jun 01 18:09:47 dev-physical-0-19 systemd[1]: snapd.service: Unit process 3447357 (sync) remains running after unit stopped.
Jun 01 18:09:47 dev-physical-0-19 systemd[1]: snapd.service: Unit process 4164420 (sync) remains running after unit stopped.
Jun 01 18:09:47 dev-physical-0-19 systemd[1]: snapd.service: Unit process 732061 (sync) remains running after unit stopped.
Jun 01 18:09:47 dev-physical-0-19 systemd[1]: snapd.service: Unit process 783222 (sync) remains running after unit stopped.
Jun 01 18:09:47 dev-physical-0-19 systemd[1]: snapd.service: Unit process 868088 (sync) remains running after unit stopped.
Jun 01 18:09:47 dev-physical-0-19 systemd[1]: Stopped Snap Daemon.
Jun 01 18:09:47 dev-physical-0-19 systemd[1]: snapd.service: Triggering OnFailure= dependencies.

root@dev-physical-0-19:~# free -h
               total        used        free      shared  buff/cache   available
Mem:           376Gi       130Gi       206Gi        19Gi        39Gi       225Gi
Swap:             0B          0B          0B

root@dev-physical-0-19:~# systemctl start snapd 
root@dev-physical-0-19:~# systemctl status snapd
● snapd.service - Snap Daemon
     Loaded: loaded (/lib/systemd/system/snapd.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2022-06-01 19:02:23 CST; 10s ago
TriggeredBy: ● snapd.socket
   Main PID: 2106649 (snapd)
      Tasks: 51 (limit: 462584)
     Memory: 96.1M
     CGroup: /system.slice/snapd.service
             ├─ 732061 sync
             ├─ 783222 sync
             ├─ 868088 sync
             ├─1102884 sync
             ├─2106649 /usr/lib/snapd/snapd
             ├─2111485 sync
             ├─3440815 sync
             ├─3447357 sync
             └─4164420 sync

Jun 01 19:02:19 dev-physical-0-19 snapd[2106649]: AppArmor status: apparmor is enabled and all features are available
Jun 01 19:02:20 dev-physical-0-19 snapd[2106649]: AppArmor status: apparmor is enabled and all features are available
Jun 01 19:02:20 dev-physical-0-19 snapd[2106649]: overlord.go:263: Acquiring state lock file
Jun 01 19:02:20 dev-physical-0-19 snapd[2106649]: overlord.go:268: Acquired state lock file
Jun 01 19:02:21 dev-physical-0-19 snapd[2106649]: patch.go:63: Patching system state level 6 to sublevel 1...
Jun 01 19:02:21 dev-physical-0-19 snapd[2106649]: patch.go:63: Patching system state level 6 to sublevel 2...
Jun 01 19:02:21 dev-physical-0-19 snapd[2106649]: patch.go:63: Patching system state level 6 to sublevel 3...
Jun 01 19:02:21 dev-physical-0-19 snapd[2106649]: daemon.go:247: started snapd/2.55.5 (series 16; classic) ubuntu/21.10 (amd64) linux/5.13.0-30-generic.
Jun 01 19:02:21 dev-physical-0-19 snapd[2106649]: daemon.go:340: adjusting startup timeout by 1m0s (pessimistic estimate of 30s plus 5s per snap)
Jun 01 19:02:23 dev-physical-0-19 systemd[1]: Started Snap Daemon.

root@dev-physical-0-19:~# snap logs lxd
2022-06-01T11:00:38+08:00 systemd[1]: Starting Service for snap application lxd.activate...
2022-06-01T11:00:38+08:00 lxd.activate[380563]: The LXD snap was unable to run aa-exec, this usually indicates a LXD sideload.
2022-06-01T11:00:38+08:00 lxd.activate[380563]: When sideloading, make sure to manually connect all interfaces.
2022-06-01T11:00:38+08:00 systemd[1]: snap.lxd.activate.service: Deactivated successfully.
2022-06-01T11:00:38+08:00 systemd[1]: Finished Service for snap application lxd.activate.
2022-06-01T11:23:23+08:00 systemd[1]: Starting Service for snap application lxd.activate...
2022-06-01T11:23:24+08:00 lxd.activate[844538]: The LXD snap was unable to run aa-exec, this usually indicates a LXD sideload.
2022-06-01T11:23:24+08:00 lxd.activate[844538]: When sideloading, make sure to manually connect all interfaces.
2022-06-01T11:23:24+08:00 systemd[1]: snap.lxd.activate.service: Deactivated successfully.
2022-06-01T11:23:24+08:00 systemd[1]: Finished Service for snap application lxd.activate.
root@dev-physical-0-19:~# snap changes
ID   Status  Spawn                   Ready               Summary
37   Undone  yesterday at 21:04 CST  today at 09:57 CST  Auto-refresh snap "lxd"
38   Done    today at 09:50 CST      today at 09:50 CST  Install "core" snap
39   Undone  today at 10:02 CST      today at 10:33 CST  Auto-refresh snap "lxd"
40   Done    today at 10:21 CST      today at 10:21 CST  Refresh all snaps: no updates
41   Undone  today at 10:40 CST      today at 10:59 CST  Refresh "lxd" snap
42   Done    today at 11:00 CST      today at 11:00 CST  Running service command
43   Done    today at 11:01 CST      today at 11:01 CST  Running service command
44   Undone  today at 11:14 CST      today at 11:23 CST  Refresh "lxd" snap
45   Doing   today at 11:24 CST      -                   Refresh "lxd" snap

root@dev-physical-0-19:~# snap tasks 45
Status  Spawn               Ready               Summary
Done    today at 11:24 CST  today at 11:24 CST  Ensure prerequisites for "lxd" are available
Done    today at 11:24 CST  today at 11:24 CST  Download snap "lxd" (23155) from channel "latest/stable/ubuntu-21.10"
Done    today at 11:24 CST  today at 11:24 CST  Fetch and check assertions for snap "lxd" (23155)
Done    today at 11:24 CST  today at 11:24 CST  Mount snap "lxd" (23155)
Done    today at 11:24 CST  today at 11:24 CST  Run pre-refresh hook of "lxd" snap if present
Done    today at 11:24 CST  today at 11:24 CST  Stop snap "lxd" services
Done    today at 11:24 CST  today at 11:24 CST  Remove aliases for snap "lxd"
Done    today at 11:24 CST  today at 11:24 CST  Make current revision for snap "lxd" unavailable
Doing   today at 11:24 CST  -                   Copy snap "lxd" data
Do      today at 11:24 CST  -                   Setup snap "lxd" (23155) security profiles
Do      today at 11:24 CST  -                   Make snap "lxd" (23155) available to the system
Do      today at 11:24 CST  -                   Automatically connect eligible plugs and slots of snap "lxd"
Do      today at 11:24 CST  -                   Set automatic aliases for snap "lxd"
Do      today at 11:24 CST  -                   Setup snap "lxd" aliases
Do      today at 11:24 CST  -                   Run post-refresh hook of "lxd" snap if present
Do      today at 11:24 CST  -                   Start snap "lxd" (23155) services
Do      today at 11:24 CST  -                   Remove data for snap "lxd" (23001)
Do      today at 11:24 CST  -                   Remove snap "lxd" (23001) from the system
Do      today at 11:24 CST  -                   Clean up "lxd" (23155) install
Do      today at 11:24 CST  -                   Run configure hook of "lxd" snap if present
Do      today at 11:24 CST  -                   Run health check of "lxd" snap
Doing   today at 11:24 CST  -                   Handling re-refresh of "lxd" as needed

All information here: 
https://discuss.linuxcontainers.org/t/snap-lxd-auto-refresh-stuck-on-copy-snap-lxd-data/14238/5

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.