snap run hangs on system-key mismatch due to reexec and shutdown
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
snapd |
Fix Released
|
Critical
|
Zygmunt Krynicki | ||
snapd (Ubuntu) |
Fix Released
|
Critical
|
Zygmunt Krynicki |
Bug Description
This bug leads to data loss and corrupted filesystems!
Over the past few months/years, we've had occasional reports that users see their systems stuck on a 10min "stopping LXD snap" type message from systemd.
We always thought it was our fault so added a lot of fallback logic and logging to our shutdown path to figure out what's going on. However in those cases, no sign of the shutdown logic was really reported, making it very odd.
Because that usually happens when no more shells or SSH connections are running, figuring out what was going on took a LONG time.
As mentioned in https:/
The process list looked like:
```
root 1 0.1 0.0 161464 8680 ? Ss 17:30 0:18 /sbin/init
root 262 0.0 0.0 26296 8384 ? S<s 17:30 0:00 /lib/systemd/
root 279 0.0 0.0 14476 3644 ? Ss 17:30 0:02 /lib/systemd/
systemd+ 300 0.0 0.0 11636 5176 ? Ss 17:30 0:00 /lib/systemd/
systemd+ 410 0.0 0.0 10436 4228 ? Ss 17:30 0:00 /lib/systemd/
systemd+ 411 0.0 0.0 83752 2512 ? Ssl 17:30 0:00 /lib/systemd/
root 412 0.0 0.1 1823860 17352 ? Ssl 17:30 0:09 /run/lxd_
root 10716 0.0 0.0 8468 3400 pts/0 Ss 20:28 0:00 \_ bash
root 10891 0.0 0.0 10420 3236 pts/0 R+ 20:34 0:00 \_ ps fauxww
message+ 419 0.0 0.0 6980 4076 ? Ss 17:30 0:03 /usr/bin/
root 7104 0.0 0.0 7680 4448 ? Ss 18:01 0:00 /usr/sbin/haveged --Foreground --verbose=1 -w 1024
root 7777 0.0 0.0 77944 1096 ? Ss 18:02 0:00 /sbin/lvmetad -f
root 7707 0.0 0.0 1872 1272 ? Ss 18:36 0:00 /bin/sh /snap/lxd/
root 7863 0.0 0.3 1631048 47756 ? Sl 18:36 0:03 \_ lxd --logfile /var/snap/
lxd 8189 0.0 0.0 6940 3028 ? Ss 18:36 0:00 \_ dnsmasq --keep-
root 7858 0.0 0.0 85040 1232 ? Sl 18:36 0:00 lxcfs /var/snap/
root 10729 0.4 0.1 1282204 20464 ? Ssl 20:28 0:01 /usr/bin/snap run --command=stop lxd.daemon
```
Note that LXD is still running and that "snap run --command=stop lxd.daemon" has been invoked.
Also note that no "snapd" processes are running.
That "snap run" will hang there for 10min until systemd kills everything, including any running containers, causing any unflushed data to be lost and in some cases the entire LXD partition to be corrupted.
Tracing the "snap run" process, it's attempting to connect to snapd through /run/snapd.socket.
This is simply impossible and will never succeed as snapd isn't running anymore, so it just hangs there indefinitely.
Now on the reproducing side of things, that's what took us ages, it's very hard to get a system in the right conditions AND have a shell when it happens.
LXD VMs are the way to retain that shell, thanks to our agent not going away on shutdown until the kernel kills it. This appears to need to be combined with another condition though, zyga on IRC suggests that this code path would hit if system-id changed and/or the kernel got updated.
In my case, our arm64 test VMs seem to be showing this behavior every time. This may be due to an issue with system-id in those, but it does make reproducing the issue and debugging it a fair bit easier.
Changed in snapd (Ubuntu): | |
assignee: | nobody → Zygmunt Krynicki (zyga) |
summary: |
- Daemon snaps not properly stopped in some cases + snap run hangs on sysystem-key mismatch due to reexec and shutdown |
summary: |
- snap run hangs on sysystem-key mismatch due to reexec and shutdown + snap run hangs on system-key mismatch due to reexec and shutdown |
Changed in snapd: | |
status: | Confirmed → In Progress |
Changed in snapd: | |
milestone: | none → 2.43.3 |
Changed in snapd: | |
status: | In Progress → Fix Committed |
Changed in snapd: | |
milestone: | 2.43.3 → 2.44.3 |
Changed in snapd: | |
status: | Fix Committed → Fix Released |
Changed in snapd (Ubuntu): | |
status: | Confirmed → Fix Released |
```
root@buildd08:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.4 LTS
Release: 18.04
Codename: bionic
root@buildd08:~# dpkg -l | grep snapd
ii snapd 2.42.1+18.04 arm64 Daemon and tooling that enable snap packages
root@buildd08:~# snap list
Name Version Rev Tracking Publisher Notes
core 16-2.44.1 8937 latest/stable canonical✓ core
lxd 4.0.0 14364 latest/candidate canonical✓ -
root@buildd08:~# snap version
snap 2.44.1
snapd 2.44.1
series 16
ubuntu 18.04
kernel 5.3.0-46-generic
root@buildd08:~#
```