Bug #1946656 “[daily impish-live-server] snap stuck in the insta...” : Bugs : snapd

Paride Legovini (paride) on 2021-10-11

summary:

- [daily impish-live-server] snapd stuck in the installer system
+ [daily impish-live-server] snap stuck in the installer system

Revision history for this message

Brian Murray (brian-murray) wrote on 2021-10-11:

#1

Screenshot from 2021-10-11 07-42-30.png Edit (60.0 KiB, image/png)

I was able to confirm this and have attached a screenshot with messages from snapd.

Dan Bungert (dbungert) on 2021-10-11

tags:

added: fr-1794

Revision history for this message

Łukasz Zemczak (sil2100) wrote on 2021-10-11:

#2

It might be a transient snapstore issue, as I have been told some endpoints have been rate-limited temporarily. Let's check back on this later.

Revision history for this message

Dan Bungert (dbungert) wrote on 2021-10-11:

#3

I was able to see a dead snapd using the procedure above.

Revision history for this message

Dan Bungert (dbungert) wrote on 2021-10-11:

#4

systemctl-status-snapd.log Edit (2.1 KiB, text/plain)

output of `systemctl status snapd`

Revision history for this message

Dan Bungert (dbungert) wrote on 2021-10-11:

#5

journalctl-u-snapd.log Edit (44.9 KiB, text/plain)

output of `journalctl --no-pager -u snapd`

Revision history for this message

Dan Bungert (dbungert) wrote on 2021-10-11:

#6

snap-debug-state.log Edit (660 bytes, text/plain)

output of `sudo snap debug state /var/lib/snapd/state.json`

Revision history for this message

Dan Bungert (dbungert) wrote on 2021-10-11:

#7

subiquity-server-debug.log Edit (71.2 KiB, text/plain)

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2021-10-11:

#8

journal.txt Edit (212.6 KiB, text/plain)

So some investigation reveals:

The issue is that snapd is not exiting on receiving the TERM signal. I can reproduce this by launching the installer and running pkill snapd.

After 5 minutes of being stuck like this, the systemd watchdog kills snapd and from that point on restarting it works normally. I'll attach the journal output from this session which includes goroutine tracebacks from when the watchdog timer killed snapd, but I don't find them very illuminating myself.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2021-10-11:

#9

Staring at snapd source, I think we must have got as far as this line:

https://github.com/snapcore/snapd/blob/master/daemon/daemon.go#L471

because runWatchdog appears to have stopped, but not got as far as this line:

https://github.com/snapcore/snapd/blob/master/daemon/daemon.go#L494

because there is no maintenance.json file. The only think in between that looks like it could hang is the d.mu.Lock() call. But maybe I'm misreading everything completely :)

Alberto Mardegan (mardy) on 2021-10-12

Changed in snapd:
assignee:	nobody → Alberto Mardegan (mardy)

Revision history for this message

Alberto Mardegan (mardy) wrote on 2021-10-12:

#10

I think the log with the backtraces you attached before is very illuminating, Michael:

...
Oct 11 20:57:55 ubuntu-server snapd[1052]: net/http.(*Server).Shutdown(0xc0000f62a0, {0x5555d540f4d0, 0xc00012a180})
Oct 11 20:57:55 ubuntu-server snapd[1052]: net/http/server.go:2734 +0x1d6
Oct 11 20:57:55 ubuntu-server snapd[1052]: github.com/snapcore/snapd/daemon.(*Daemon).Stop(0xc0000f6540, 0xc00060fe24)
Oct 11 20:57:55 ubuntu-server snapd[1052]: github.com/snapcore/snapd/daemon/daemon.go:519 +0x2aa
...

It clearly shows that we are stuck here:

https://github.com/snapcore/snapd/blob/master/daemon/daemon.go#L519

The questions I have (both to you and to more experienced snapd team members) are:

1) Is snapd correct in determining the shutdown reason as a "system restart"? This is happening here: https://github.com/snapcore/snapd/blob/master/daemon/daemon.go#L479-L482

2) Why is snapd waiting for the clients to disconnect? Can't we just abruptly close the connection and be confident that they'll handle the error properly?

3) That line 519 is just a `time.Sleep(rebootNoticeWait)`, where rebootNoticeWait is 3 seconds only. Something doesn't sum up. Also, we see in the backtrace that the stack frame below time.Sleep() is a call to net/http.(*Server).Shutdown(); I'm relatively inexperience with go; could it be that time.Sleep is actually doing a select(), and that this caused the same thread to handle the http shutdown (which then got stuck)? I would find it normal in a Glib or Qt application where we have an event loop, but afaik golang doesn't have one, which makes me a bit confused on how this could happen.

Revision history for this message

Maciej Borzecki (maciek-borzecki) wrote on 2021-10-12:

#11

I got a bit more insight, subiquity calls `systemctl stop snapd.service`. For some reason the snapd socket is still in use and whatever connection is there, does not become idle, so graceful shutdown eventually fails to close the http server. Eventually snapd is killed by systemd, since it was in a deactivating state and the snapd.service unit state is 'failed'. At this point snap-failure kicks in, stops the snapd.socket unit, and tries to fix things, but since there was no previous revision of the snapd snap or core, it blindly tries to execute /snap/core/current/usr/lib/snapd/snapd. However, the image only has core20 and snapd snaps, there is no 'core'. Thus snap-failure fails, and snapd remain inaccessible.

Open questions is what keeps accessing the snapd socket, so that graceful shutdown fails.

Revision history for this message

Paride Legovini (paride) wrote on 2021-10-12 (last edit on 2021-10-12):

#12

The issue can indeed be reproduced by jumping to a shell from the very first subiquity screen and trying to `systemctl restart snapd`.

Doing a `systemctl stop snap.subiquity.subiquity-service` appears to unblocks snapd: it restarts reliably. Interestingly restarting snap.subiquity.subiquity-service doesn't reproduce the issue: subiquity starts correctly and snapd doesn't get hung anymore.

description:

updated

Revision history for this message

Maciej Borzecki (maciek-borzecki) wrote on 2021-10-12:

#13

Download full text (9.3 KiB)

More in depth look.

root@ubuntu-server:/home/ubuntu-server# ss -x -p -a |grep snapd
u_str LISTEN 0      4096     /run/user/999/snapd-session-agent.socket 21532            * 0     users:(("systemd",pid=1979,fd=31))                                                                                                 
u_str LISTEN 0      4096                            /run/snapd.socket 16890            * 0     users:(("snapd",pid=1255,fd=5),("systemd",pid=1,fd=67))                                                                            
u_str LISTEN 0      4096                       /run/snapd-snap.socket 16892            * 0     users:(("snapd",pid=1255,fd=9),("systemd",pid=1,fd=69))                                                                            
u_str ESTAB  0      0                                               * 17034            * 17038 users:(("snapd",pid=1255,fd=2),("snapd",pid=1255,fd=1))                                                                            
u_str ESTAB  0      0                               /run/snapd.socket 21249            * 21248 users:(("snapd",pid=1255,fd=33))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21258            * 21257 users:(("snapd",pid=1255,fd=14))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21252            * 21251 users:(("snapd",pid=1255,fd=34))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21255            * 21254 users:(("snapd",pid=1255,fd=35))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21099            * 21098 users:(("snapd",pid=1255,fd=17))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21207            * 21206 users:(("snapd",pid=1255,fd=19))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21210            * 21209 users:(("snapd",pid=1255,fd=20))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21204            * 21203 users:(("snapd",pid=1255,fd=16))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21237            * 21236 users:(("snapd",pid=1255,fd=29))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21240            * 21239 users:(("snapd",pid=1255,fd=30))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21264            * 21263 users:(("snapd",pid=1255,fd=3))                                                                                                    
u_str ESTAB  0      0                               /run/snapd.socket 21246            * 21245 users:(("snapd",pid=1255,fd=32))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21261            * 21260 users:(("snapd",pid=1255,fd=36))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21243            * 21242 users:(("snapd",pid=1255,fd=31))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21296            * 21295 users:(("snapd",pid=1255,fd=12))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21219            * 21218 users:(("snapd",pid=1255,fd=23))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21222            * 21221 users:(("snapd",pid=1255,fd=24))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21216            * 21215 users:(("snapd",pid=1255,fd=22))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21213            * 21212 users:(("snapd",pid=1255,fd=21))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 20677            * 20675 users:(("snapd",pid=1255,fd=13))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21225            * 21224 users:(("snapd",pid=1255,fd=25))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 20690            * 20689 users:(("snapd",pid=1255,fd=15))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21231            * 21230 users:(("snapd",pid=1255,fd=27))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21234            * 21233 users:(("snapd",pid=1255,fd=28))                                                                                                   
u_str ESTAB  0      0                               /run/snapd.socket 21228            * 21227 users:(("snapd",pid=1255,fd=26))

Sampling the peer connections:

root@ubuntu-server:/home/ubuntu-server# ss -x -p -a |grep 21098
u_str ESTAB  0      0                               /run/snapd.socket 21099            * 21098 users:(("snapd",pid=1255,fd=17))                                                                                                   
u_str ESTAB  0      0                                               * 21098            * 21099 users:(("python3.8",pid=1830,fd=20))                                                                                               
root@ubuntu-server:/home/ubuntu-server# ss -x -p -a |grep 21230
u_str ESTAB  0      0                                               * 21230            * 21231 users:(("python3.8",pid=1830,fd=33))                                                                                               
u_str ESTAB  0      0                               /run/snapd.socket 21231            * 21230 users:(("snapd",pid=1255,fd=27))                                                                                                   
root@ubuntu-server:/home/ubuntu-server# ss -x -p -a |grep 21242
u_str ESTAB  0      0                               /run/snapd.socket 21243            * 21242 users:(("snapd",pid=1255,fd=31))                                                                                                   
u_str ESTAB  0      0                                               * 21242            * 21243 users:(("python3.8",pid=1830,fd=37))                                                                                               
root@ubuntu-server:/home/ubuntu-server# ss -x -p -a |grep 21209
u_str ESTAB  0      0                                               * 21209            * 21210 users:(("python3.8",pid=1830,fd=26))                                                                                               
u_str ESTAB  0      0                               /run/snapd.socket 21210            * 21209 users:(("snapd",pid=1255,fd=20))

Where pid 1830 is:

root@ubuntu-server:/home/ubuntu-server# ps -ef|grep 1830
root        1830    1809  0 08:35 ?        00:00:02 /snap/subiquity/2793/usr/bin/python3.8 -m subiquity.cmd.server

It is unclear to me why there needs to be ~25 connections from subiquity to snapd. The connections do not go away when subiquity calls a snapd.service restart:

root@ubuntu-server:/home/ubuntu-server# ps -ef|grep snapd
root        1255       1  0 08:35 ?        00:00:01 /usr/lib/snapd/snapd
root        2374    1830  0 08:52 ?        00:00:00 systemctl restart snapd.service
root        2415    2223  0 08:52 pts/0    00:00:00 grep --color=auto snapd

Since those are not closed, or just do not enter an idle state and snapd performs a graceful shutdown things take longer, eventually snapd hits the sigterm timeout and systemd issues a SIGKILL.

I think someone needs to investigate in paralell why there are so many connections.

Revision history for this message

Alberto Mardegan (mardy) wrote on 2021-10-12:

#14

test.py Edit (255 bytes, text/x-python)

The issue with snapd not honouring the TERM signal when there are pending connections can easily be reproduced locally like this:

1. Download the attached test.py script
2. sudo systemctl stop snapd.{service,socket}
3. Run your local build of snap (or /usr/lib/snapd/snapd)
4. python3 test.py # in another terminal
5. pkill snapd # in yet another terminal

snapd will terminate only when you press ENTER in the terminal where test.py is running.

Revision history for this message

Paride Legovini (paride) wrote on 2021-10-12:

#15

I'm able to reproduce the issue by following those steps.

On the subiquity side this is being fixed (or at least worked around) by:

https://github.com/canonical/subiquity/pull/1094

but AIUI this is still a valid snapd bug.

Revision history for this message

Paride Legovini (paride) wrote on 2021-10-12:

#16

Speaking of the subiquity side of this, I can confirm that version 21.10.3 (currently in latest/candidate) doesn't cause snapd to hang on restart.

Changed in subiquity:
status:	New → Fix Committed

Revision history for this message

Alberto Mardegan (mardy) wrote on 2021-10-12:

#17

https://github.com/snapcore/snapd/pull/10911 fixes the issue on the snapd side.

Changed in snapd:
status:	New → In Progress

Michael Vogt (mvo) on 2022-11-11

Changed in snapd:
status:	In Progress → Fix Released

	Status	Importance	Assigned to
Ubuntu CD Images	New	Undecided	Unassigned
snapd	Fix Released	Undecided	Alberto Mardegan
subiquity	Fix Committed	Undecided	Unassigned

snapd

[daily impish-live-server] snap stuck in the installer system

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches