snap run hangs on system-key mismatch due to reexec and shutdown

Bug #1871652 reported by Stéphane Graber on 2020-04-08
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
snapd
Critical
Zygmunt Krynicki
snapd (Ubuntu)
Critical
Zygmunt Krynicki

Bug Description

This bug leads to data loss and corrupted filesystems!

Over the past few months/years, we've had occasional reports that users see their systems stuck on a 10min "stopping LXD snap" type message from systemd.

We always thought it was our fault so added a lot of fallback logic and logging to our shutdown path to figure out what's going on. However in those cases, no sign of the shutdown logic was really reported, making it very odd.

Because that usually happens when no more shells or SSH connections are running, figuring out what was going on took a LONG time.

As mentioned in https://github.com/lxc/lxd-pkg-snap/issues/39, I've finally succeeded yesterday, having a system where this is reproducible AND where I can keep a shell.

The process list looked like:

```
root 1 0.1 0.0 161464 8680 ? Ss 17:30 0:18 /sbin/init
root 262 0.0 0.0 26296 8384 ? S<s 17:30 0:00 /lib/systemd/systemd-journald
root 279 0.0 0.0 14476 3644 ? Ss 17:30 0:02 /lib/systemd/systemd-udevd
systemd+ 300 0.0 0.0 11636 5176 ? Ss 17:30 0:00 /lib/systemd/systemd-networkd
systemd+ 410 0.0 0.0 10436 4228 ? Ss 17:30 0:00 /lib/systemd/systemd-resolved
systemd+ 411 0.0 0.0 83752 2512 ? Ssl 17:30 0:00 /lib/systemd/systemd-timesyncd
root 412 0.0 0.1 1823860 17352 ? Ssl 17:30 0:09 /run/lxd_config/9p/lxd-agent
root 10716 0.0 0.0 8468 3400 pts/0 Ss 20:28 0:00 \_ bash
root 10891 0.0 0.0 10420 3236 pts/0 R+ 20:34 0:00 \_ ps fauxww
message+ 419 0.0 0.0 6980 4076 ? Ss 17:30 0:03 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
root 7104 0.0 0.0 7680 4448 ? Ss 18:01 0:00 /usr/sbin/haveged --Foreground --verbose=1 -w 1024
root 7777 0.0 0.0 77944 1096 ? Ss 18:02 0:00 /sbin/lvmetad -f
root 7707 0.0 0.0 1872 1272 ? Ss 18:36 0:00 /bin/sh /snap/lxd/14335/commands/daemon.start
root 7863 0.0 0.3 1631048 47756 ? Sl 18:36 0:03 \_ lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd
lxd 8189 0.0 0.0 6940 3028 ? Ss 18:36 0:00 \_ dnsmasq --keep-in-foreground --strict-order --bind-interfaces --except-interface=lo --no-ping --interface=lxdbr0 --quiet-dhcp --quiet-dhcp6 --quiet-ra --listen-address=10.0.3.1 --dhcp-no-override --dhcp-authoritative --dhcp-leasefile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases --dhcp-hostsfile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts --dhcp-range 10.0.3.2,10.0.3.254,1h --listen-address=2001:470:b368:4242::1 --enable-ra --dhcp-range ::,constructor:lxdbr0,ra-stateless,ra-names -s lxd -S /lxd/ --conf-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw -u lxd
root 7858 0.0 0.0 85040 1232 ? Sl 18:36 0:00 lxcfs /var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid
root 10729 0.4 0.1 1282204 20464 ? Ssl 20:28 0:01 /usr/bin/snap run --command=stop lxd.daemon
```

Note that LXD is still running and that "snap run --command=stop lxd.daemon" has been invoked.
Also note that no "snapd" processes are running.

That "snap run" will hang there for 10min until systemd kills everything, including any running containers, causing any unflushed data to be lost and in some cases the entire LXD partition to be corrupted.

Tracing the "snap run" process, it's attempting to connect to snapd through /run/snapd.socket.
This is simply impossible and will never succeed as snapd isn't running anymore, so it just hangs there indefinitely.

Now on the reproducing side of things, that's what took us ages, it's very hard to get a system in the right conditions AND have a shell when it happens.

LXD VMs are the way to retain that shell, thanks to our agent not going away on shutdown until the kernel kills it. This appears to need to be combined with another condition though, zyga on IRC suggests that this code path would hit if system-id changed and/or the kernel got updated.

In my case, our arm64 test VMs seem to be showing this behavior every time. This may be due to an issue with system-id in those, but it does make reproducing the issue and debugging it a fair bit easier.

Stéphane Graber (stgraber) wrote :

```
root@buildd08:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.4 LTS
Release: 18.04
Codename: bionic
root@buildd08:~# dpkg -l | grep snapd
ii snapd 2.42.1+18.04 arm64 Daemon and tooling that enable snap packages
root@buildd08:~# snap list
Name Version Rev Tracking Publisher Notes
core 16-2.44.1 8937 latest/stable canonical✓ core
lxd 4.0.0 14364 latest/candidate canonical✓ -
root@buildd08:~# snap version
snap 2.44.1
snapd 2.44.1
series 16
ubuntu 18.04
kernel 5.3.0-46-generic
root@buildd08:~#
```

Stéphane Graber (stgraber) wrote :
Download full text (62.8 KiB)

Reproducer on arm64 at least:

```
root@c2400:~# lxc launch images:ubuntu/bionic test-snapd --vm -c security.secureboot=false -c limits.memory=16GB -c limits.cpu=5-12
Creating test-snapd
Starting test-snapd
root@c2400:~# lxc exec test-snapd bash
root@test-snapd:~# sed "s/ports.ubuntu/us.ports.ubuntu/g" -i /etc/apt/sources.list
root@test-snapd:~# apt update
Get:1 http://us.ports.ubuntu.com/ubuntu-ports bionic InRelease [242 kB]
Get:2 http://us.ports.ubuntu.com/ubuntu-ports bionic-updates InRelease [88.7 kB]
Get:3 http://us.ports.ubuntu.com/ubuntu-ports bionic-security InRelease [88.7 kB]
Get:4 http://us.ports.ubuntu.com/ubuntu-ports bionic/main arm64 Packages [975 kB]
Get:5 http://us.ports.ubuntu.com/ubuntu-ports bionic/main Translation-en [516 kB]
Get:6 http://us.ports.ubuntu.com/ubuntu-ports bionic/restricted arm64 Packages [664 B]
Get:7 http://us.ports.ubuntu.com/ubuntu-ports bionic/restricted Translation-en [3584 B]
Get:8 http://us.ports.ubuntu.com/ubuntu-ports bionic/universe arm64 Packages [8316 kB]
Get:9 http://us.ports.ubuntu.com/ubuntu-ports bionic/universe Translation-en [4941 kB]
Get:10 http://us.ports.ubuntu.com/ubuntu-ports bionic/multiverse arm64 Packages [126 kB]
Get:11 http://us.ports.ubuntu.com/ubuntu-ports bionic/multiverse Translation-en [108 kB]
Get:12 http://us.ports.ubuntu.com/ubuntu-ports bionic-updates/main arm64 Packages [650 kB]
Get:13 http://us.ports.ubuntu.com/ubuntu-ports bionic-updates/main Translation-en [314 kB]
Get:14 http://us.ports.ubuntu.com/ubuntu-ports bionic-updates/restricted arm64 Packages [1064 B]
Get:15 http://us.ports.ubuntu.com/ubuntu-ports bionic-updates/restricted Translation-en [10.9 kB]
Get:16 http://us.ports.ubuntu.com/ubuntu-ports bionic-updates/universe arm64 Packages [949 kB]
Get:17 http://us.ports.ubuntu.com/ubuntu-ports bionic-updates/universe Translation-en [330 kB]
Get:18 http://us.ports.ubuntu.com/ubuntu-ports bionic-updates/multiverse arm64 Packages [3304 B]
Get:19 http://us.ports.ubuntu.com/ubuntu-ports bionic-updates/multiverse Translation-en [4728 B]
Get:20 http://us.ports.ubuntu.com/ubuntu-ports bionic-security/main arm64 Packages [444 kB]
Get:21 http://us.ports.ubuntu.com/ubuntu-ports bionic-security/main Translation-en [220 kB]
Get:22 http://us.ports.ubuntu.com/ubuntu-ports bionic-security/restricted arm64 Packages [668 B]
Get:23 http://us.ports.ubuntu.com/ubuntu-ports bionic-security/restricted Translation-en [8804 B]
Get:24 http://us.ports.ubuntu.com/ubuntu-ports bionic-security/universe arm64 Packages [583 kB]
Get:25 http://us.ports.ubuntu.com/ubuntu-ports bionic-security/universe Translation-en [218 kB]
Get:26 http://us.ports.ubuntu.com/ubuntu-ports bionic-security/multiverse arm64 Packages [1848 B]
Get:27 http://us.ports.ubuntu.com/ubuntu-ports bionic-security/multiverse Translation-en [2764 B]
Fetched 19.1 MB in 7s (2803 kB/s)
Reading package lists... Done
Building dependency tree
Reading state information... Done
4 packages can be upgraded. Run 'apt list --upgradable' to see them.
root@test-snapd:~# apt dist-upgrade
Reading package lists... Do...

Zygmunt Krynicki (zyga) on 2020-04-08
Changed in snapd (Ubuntu):
assignee: nobody → Zygmunt Krynicki (zyga)
Zygmunt Krynicki (zyga) wrote :
Download full text (15.2 KiB)

I reproduced the problem and collected some interesting information:

backtrace from the hanging "snap run ..."

Apr 08 18:45:22 test-snapd lxd.daemon[1254]: SIGABRT: abort
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: PC=0xaaaad3c56de4 m=0 sigcode=0
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: goroutine 0 [idle]:
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: runtime.futex(0xaaaad47faa88, 0x0, 0x0, 0x0, 0xaaaa00000000, 0xaaaad3c1de38, 0xaaaad47fa5e8, 0x0, 0x0, 0xaaaad3c326e8, ...)
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: /usr/lib/go-1.10/src/runtime/sys_linux_arm64.s:321 +0x1c
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: runtime.futexsleep(0xaaaad47faa88, 0xaaaa00000000, 0xffffffffffffffff)
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: /usr/lib/go-1.10/src/runtime/os_linux.go:45 +0x3c
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: runtime.notesleep(0xaaaad47faa88)
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: /usr/lib/go-1.10/src/runtime/lock_futex.go:151 +0x84
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: runtime.stopm()
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: /usr/lib/go-1.10/src/runtime/proc.go:1952 +0xa0
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: runtime.findrunnable(0x442003c000, 0x0)
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: /usr/lib/go-1.10/src/runtime/proc.go:2415 +0x514
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: runtime.schedule()
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: /usr/lib/go-1.10/src/runtime/proc.go:2541 +0xe8
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: runtime.park_m(0x44203c8300)
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: /usr/lib/go-1.10/src/runtime/proc.go:2604 +0x90
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: runtime.mcall(0x0)
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: /usr/lib/go-1.10/src/runtime/asm_arm64.s:169 +0x50
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: goroutine 1 [select]:
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: github.com/snapcore/snapd/client.(*Client).do(0x44201149a0, 0xaaaad4116c66, 0x3, 0xaaaad41220ce, 0xf, 0x0, 0x0, 0x0, 0x0, 0xaaaad43dd140, ...)
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: /build/snapd-C2KffO/snapd-2.44.1/_build/src/github.com/snapcore/snapd/client/client.go:334 +0x1f8
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: github.com/snapcore/snapd/client.(*Client).doSync(0x44201149a0, 0xaaaad4116c66, 0x3, 0xaaaad41220ce, 0xf, 0x0, 0x0, 0x0, 0x0, 0xaaaad438ffa0, ...)
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: /build/snapd-C2KffO/snapd-2.44.1/_build/src/github.com/snapcore/snapd/client/client.go:374 +0x8c
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: github.com/snapcore/snapd/client.(*Client).SysInfo(0x44201149a0, 0x0, 0xaaaad44c1ee0, 0x44203e4430)
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: /build/snapd-C2KffO/snapd-2.44.1/_build/src/github.com/snapcore/snapd/client/client.go:655 +0x80
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: main.maybeWaitForSecurityProfileRegeneration(0x44201149a0, 0x442027b9c8, 0xaaaad3cf2768)
Apr 08 18:45:22 test-snapd lxd.daemon[1254]: /build/snapd-C2KffO/snapd-2.44.1/_build/src/github.com/snapcore/snapd/cmd/snap/cmd_run.go:149 +0xd4
Apr 08 18:45:22...

Zygmunt Krynicki (zyga) wrote :

My hunch is that main.maybeWaitForSecurityProfileRegeneration() is called due to system key mismatch because apparmorfs got unmounted.

Zygmunt Krynicki (zyga) wrote :

Apr 08 19:10:43 test-snapd lxd.daemon[1015]: system key on disk and in memory are different version
Apr 08 19:10:43 test-snapd lxd.daemon[1015]: memory:
Apr 08 19:10:43 test-snapd lxd.daemon[1015]: {"version":10,"build-id":"8129f62e70d6d67812fd8d615a202ce66c7038c4","apparmor-features":["caps","dbus","domain","file","mount","namespaces","network","network_v8","policy","ptrace","query","rlimit","signal"],"apparmor-parser-mtime":1538072454,"apparmor-parser-features":null,"nfs-home":false,"overlay-root":"","seccomp-features":["allow","errno","kill_process","kill_thread","log","trace","trap","user_notif"],"seccomp-compiler-version":"51685c91c8913d94bead557182d74cff597ecfeb 2.4.1 8c73f36d3de1f71977107bf6687514f16787f639058b4db4c67b28dfdb2fd3af bpf-actlog","cgroup-version":"1"}
Apr 08 19:10:43 test-snapd lxd.daemon[1015]: disk:
Apr 08 19:10:43 test-snapd lxd.daemon[1015]: {"version":10,"build-id":"799a88b406b245795da51b18f6224003020c6fb9","apparmor-features":["caps","dbus","domain","file","mount","namespaces","network","network_v8","policy","ptrace","query","rlimit","signal"],"apparmor-parser-mtime":1538072454,"apparmor-parser-features":null,"nfs-home":false,"overlay-root":"","seccomp-features":["allow","errno","kill_process","kill_thread","log","trace","trap","user_notif"],"seccomp-compiler-version":"66988dd2c3fb0abf9b1fb29be212771d7c38ae85 2.4.1 8c73f36d3de1f71977107bf6687514f16787f639058b4db4c67b28dfdb2fd3af bpf-actlog","cgroup-version":"1"}

Zygmunt Krynicki (zyga) wrote :

As we can see build IDs are different, in addition, because seccomp compiler versions are different we are going to wait indefinitely.

I repackaged the core snap and updated /usr/bin/snap with a custom build that has additional diagnostic messages. Due to system key mismatch the shutdown / reboot process hangs.

Disabling reexec is one way to avoid it. I'm looking at options.

Changed in snapd:
status: New → Confirmed
importance: Undecided → Critical
assignee: nobody → Zygmunt Krynicki (zyga)
Changed in snapd (Ubuntu):
status: Triaged → Confirmed
Zygmunt Krynicki (zyga) wrote :

The build ID of snapd from the deb is: 8129f62e70d6d67812fd8d615a202ce66c7038c4
The build ID of snapd from the core snap revision 8937 is: 799a88b406b245795da51b18f6224003020c6fb9

As we can see in the log in comment #5, the disk format was written by snapd after re-exec but the measurement was performed by snap that runs before reexec (the core snap must have been unmounted already).

As such we have a perpetual mismatch. In this state the system cannot start snapd.socket or snapd.service as systemd is shutting down (nor should it).

We need to detect this case (shutdown, core/snapd unmounted) and ignore system key mismatch.

Zygmunt Krynicki (zyga) wrote :

The state of snapd.{socket,service} at the time this happens:

root@test-snapd:~# systemctl status snapd.socket
● snapd.socket - Socket activation for snappy daemon
   Loaded: loaded (/lib/systemd/system/snapd.socket; enabled; vendor preset: enabled)
   Active: active (listening) since Wed 2020-04-08 19:17:36 UTC; 9min ago
   Listen: /run/snapd.socket (Stream)
           /run/snapd-snap.socket (Stream)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/snapd.socket

Apr 08 19:17:36 test-snapd systemd[1]: Starting Socket activation for snappy daemon.
Apr 08 19:17:36 test-snapd systemd[1]: Listening on Socket activation for snappy daemon.
root@test-snapd:~# systemctl status snapd.service
● snapd.service - Snappy daemon
   Loaded: loaded (/lib/systemd/system/snapd.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Wed 2020-04-08 19:26:28 UTC; 52s ago
  Process: 623 ExecStart=/usr/lib/snapd/snapd (code=exited, status=0/SUCCESS)
 Main PID: 623 (code=exited, status=0/SUCCESS)

Apr 08 19:17:36 test-snapd systemd[1]: Starting Snappy daemon...
Apr 08 19:17:36 test-snapd snapd[623]: AppArmor status: apparmor is enabled and all features are available
Apr 08 19:17:36 test-snapd snapd[623]: AppArmor status: apparmor is enabled and all features are available
Apr 08 19:17:37 test-snapd snapd[623]: daemon.go:343: started snapd/2.44.1 (series 16; classic) ubuntu/18.04 (arm64) linux/5.3.0-46-generic.
Apr 08 19:17:37 test-snapd snapd[623]: daemon.go:436: adjusting startup timeout by 40s (pessimistic estimate of 30s plus 5s per snap)
Apr 08 19:17:37 test-snapd systemd[1]: Started Snappy daemon.
Apr 08 19:26:28 test-snapd snapd[623]: main.go:155: Exiting on terminated signal.
Apr 08 19:26:28 test-snapd systemd[1]: Stopping Snappy daemon...
Apr 08 19:26:28 test-snapd systemd[1]: Stopped Snappy daemon.

Zygmunt Krynicki (zyga) on 2020-04-08
summary: - Daemon snaps not properly stopped in some cases
+ snap run hangs on sysystem-key mismatch due to reexec and shutdown
summary: - snap run hangs on sysystem-key mismatch due to reexec and shutdown
+ snap run hangs on system-key mismatch due to reexec and shutdown
Zygmunt Krynicki (zyga) wrote :

I've pushed a tentative fix for the symptom: https://github.com/snapcore/snapd/pull/8462

Zygmunt Krynicki (zyga) on 2020-04-08
Changed in snapd:
status: Confirmed → In Progress
Zygmunt Krynicki (zyga) on 2020-04-09
Changed in snapd:
milestone: none → 2.43.3
Zygmunt Krynicki (zyga) on 2020-04-09
Changed in snapd:
status: In Progress → Fix Committed
Zygmunt Krynicki (zyga) on 2020-04-09
Changed in snapd:
milestone: 2.43.3 → 2.44.3
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers