[daily impish-live-server] snap stuck in the installer system

Bug #1946656 reported by Paride Legovini
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu CD Images
New
Undecided
Unassigned
snapd
Fix Released
Undecided
Alberto Mardegan
subiquity
Fix Committed
Undecided
Unassigned

Bug Description

With the latest ISOs, that is with:

$ bsdtar -x -f impish-live-server-amd64.iso -O \./.disk/info
Ubuntu-Server 21.10 "Impish Indri" - Daily amd64 (20211011)

the 'snap' command hangs when called after the subiquity "proxy setup" screen, waiting forever on snapd, which appears to be dead.

The simplest reproducer is:

1. qemu-img create -f qcow2 disk1.qcow2 10G
2. kvm -m 2048 -boot d -cdrom ./impish-live-server-amd64.iso -drive file=disk1.qcow2,if=virtio
3. At the very fist screen do: 'F1 -> Enter shell'.
   Verify that `snap list` returns a list of snaps.
4. Exit the shell and proceed up to the storage config.
   Accept all defaults. No need to config storage.
5. Jump again to a shell and do `snap list`.
   Now the command hangs indefinitely, and
   `systemctl status snapd` will report is as inactive (dead).

I think this can potentially block refreshes of subiquity.

I'm pretty sure images up to 2021-10-08 included were good.

Tags: fr-1794
Paride Legovini (paride)
summary: - [daily impish-live-server] snapd stuck in the installer system
+ [daily impish-live-server] snap stuck in the installer system
Revision history for this message
Brian Murray (brian-murray) wrote :

I was able to confirm this and have attached a screenshot with messages from snapd.

Dan Bungert (dbungert)
tags: added: fr-1794
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

It might be a transient snapstore issue, as I have been told some endpoints have been rate-limited temporarily. Let's check back on this later.

Revision history for this message
Dan Bungert (dbungert) wrote :

I was able to see a dead snapd using the procedure above.

Revision history for this message
Dan Bungert (dbungert) wrote :

output of `systemctl status snapd`

Revision history for this message
Dan Bungert (dbungert) wrote :

output of `journalctl --no-pager -u snapd`

Revision history for this message
Dan Bungert (dbungert) wrote :

output of `sudo snap debug state /var/lib/snapd/state.json`

Revision history for this message
Dan Bungert (dbungert) wrote :
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

So some investigation reveals:

The issue is that snapd is not exiting on receiving the TERM signal. I can reproduce this by launching the installer and running pkill snapd.

After 5 minutes of being stuck like this, the systemd watchdog kills snapd and from that point on restarting it works normally. I'll attach the journal output from this session which includes goroutine tracebacks from when the watchdog timer killed snapd, but I don't find them very illuminating myself.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Staring at snapd source, I think we must have got as far as this line:

https://github.com/snapcore/snapd/blob/master/daemon/daemon.go#L471

because runWatchdog appears to have stopped, but not got as far as this line:

https://github.com/snapcore/snapd/blob/master/daemon/daemon.go#L494

because there is no maintenance.json file. The only think in between that looks like it could hang is the d.mu.Lock() call. But maybe I'm misreading everything completely :)

Alberto Mardegan (mardy)
Changed in snapd:
assignee: nobody → Alberto Mardegan (mardy)
Revision history for this message
Alberto Mardegan (mardy) wrote :

I think the log with the backtraces you attached before is very illuminating, Michael:

...
Oct 11 20:57:55 ubuntu-server snapd[1052]: net/http.(*Server).Shutdown(0xc0000f62a0, {0x5555d540f4d0, 0xc00012a180})
Oct 11 20:57:55 ubuntu-server snapd[1052]: net/http/server.go:2734 +0x1d6
Oct 11 20:57:55 ubuntu-server snapd[1052]: github.com/snapcore/snapd/daemon.(*Daemon).Stop(0xc0000f6540, 0xc00060fe24)
Oct 11 20:57:55 ubuntu-server snapd[1052]: github.com/snapcore/snapd/daemon/daemon.go:519 +0x2aa
...

It clearly shows that we are stuck here:

https://github.com/snapcore/snapd/blob/master/daemon/daemon.go#L519

The questions I have (both to you and to more experienced snapd team members) are:

1) Is snapd correct in determining the shutdown reason as a "system restart"? This is happening here: https://github.com/snapcore/snapd/blob/master/daemon/daemon.go#L479-L482

2) Why is snapd waiting for the clients to disconnect? Can't we just abruptly close the connection and be confident that they'll handle the error properly?

3) That line 519 is just a `time.Sleep(rebootNoticeWait)`, where rebootNoticeWait is 3 seconds only. Something doesn't sum up. Also, we see in the backtrace that the stack frame below time.Sleep() is a call to net/http.(*Server).Shutdown(); I'm relatively inexperience with go; could it be that time.Sleep is actually doing a select(), and that this caused the same thread to handle the http shutdown (which then got stuck)? I would find it normal in a Glib or Qt application where we have an event loop, but afaik golang doesn't have one, which makes me a bit confused on how this could happen.

Revision history for this message
Maciej Borzecki (maciek-borzecki) wrote :

I got a bit more insight, subiquity calls `systemctl stop snapd.service`. For some reason the snapd socket is still in use and whatever connection is there, does not become idle, so graceful shutdown eventually fails to close the http server. Eventually snapd is killed by systemd, since it was in a deactivating state and the snapd.service unit state is 'failed'. At this point snap-failure kicks in, stops the snapd.socket unit, and tries to fix things, but since there was no previous revision of the snapd snap or core, it blindly tries to execute /snap/core/current/usr/lib/snapd/snapd. However, the image only has core20 and snapd snaps, there is no 'core'. Thus snap-failure fails, and snapd remain inaccessible.

Open questions is what keeps accessing the snapd socket, so that graceful shutdown fails.

Revision history for this message
Paride Legovini (paride) wrote (last edit ):

The issue can indeed be reproduced by jumping to a shell from the very first subiquity screen and trying to `systemctl restart snapd`.

Doing a `systemctl stop snap.subiquity.subiquity-service` appears to unblocks snapd: it restarts reliably. Interestingly restarting snap.subiquity.subiquity-service doesn't reproduce the issue: subiquity starts correctly and snapd doesn't get hung anymore.

description: updated
Revision history for this message
Maciej Borzecki (maciek-borzecki) wrote :
Download full text (9.3 KiB)

More in depth look.

root@ubuntu-server:/home/ubuntu-server# ss -x -p -a |grep snapd
u_str LISTEN 0 4096 /run/user/999/snapd-session-agent.socket 21532 * 0 users:(("systemd",pid=1979,fd=31))
u_str LISTEN 0 4096 /run/snapd.socket 16890 * 0 users:(("snapd",pid=1255,fd=5),("systemd",pid=1,fd=67))
u_str LISTEN 0 4096 /run/snapd-snap.socket 16892 * 0 users:(("snapd",pid=1255,fd=9),("systemd",pid=1,fd=69))
u_str ESTAB 0 0 * 17034 * 17038 users:(("snapd",pid=1255,fd=2),("snapd",pid=1255,fd=1))
u_str ESTAB 0 0 /run/snapd.socket 21249 * 21248 users:(("snapd",pid=1255,fd=33))
u_str ESTAB 0 0 /run/snapd.socket 21258 * 21257 users:(("snapd",pid=1255,fd=14))
u_str ESTAB 0 0 /run/snapd.socket 21252 * 21251 users:(("snapd",pid=1255,fd=34))
u_str ESTAB 0 0 /run/snapd.socket 21255 * 21254 users:(("snapd",pid=1255,fd=35))
u_str ESTAB 0 0 /run/snapd.socket 21099 * 21098 users:(("snapd",pid=1255,fd=17))
u_str ESTAB 0 0 /run/snapd.socket 21207 * 21206 users:(("snapd",pid=1255,fd=19))
u_str ESTAB 0 0 /run/snapd.socket 21210 * 21209 users:(("snapd",pid=1255,fd=20))
u_str ESTAB 0 0 /run/snapd.socket 21204 * 21203 users:(("snapd",pid=1255,fd=16))
u_str ESTAB 0 0 /run/snapd.socket 21237 * 21236 users:(("snapd",pid=1255,fd=29))
u_str ESTAB 0 0 /run/snapd.socket 21240 * 21239 users:(("snapd",pid=1255,fd=30)) ...

Read more...

Revision history for this message
Alberto Mardegan (mardy) wrote :

The issue with snapd not honouring the TERM signal when there are pending connections can easily be reproduced locally like this:

1. Download the attached test.py script
2. sudo systemctl stop snapd.{service,socket}
3. Run your local build of snap (or /usr/lib/snapd/snapd)
4. python3 test.py # in another terminal
5. pkill snapd # in yet another terminal

snapd will terminate only when you press ENTER in the terminal where test.py is running.

Revision history for this message
Paride Legovini (paride) wrote :

I'm able to reproduce the issue by following those steps.

On the subiquity side this is being fixed (or at least worked around) by:

https://github.com/canonical/subiquity/pull/1094

but AIUI this is still a valid snapd bug.

Revision history for this message
Paride Legovini (paride) wrote :

Speaking of the subiquity side of this, I can confirm that version 21.10.3 (currently in latest/candidate) doesn't cause snapd to hang on restart.

Changed in subiquity:
status: New → Fix Committed
Revision history for this message
Alberto Mardegan (mardy) wrote :

https://github.com/snapcore/snapd/pull/10911 fixes the issue on the snapd side.

Changed in snapd:
status: New → In Progress
Michael Vogt (mvo)
Changed in snapd:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers