snapd eats 100% CPU for about 5 minutes on first boot causing a load of >2.0

Bug #1638537 reported by Oliver Grawert
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
snapd
Triaged
High
Unassigned
snapd (Ubuntu)
Triaged
High
Unassigned

Bug Description

during first boot it takes very long for the "please press eneter" message to come up on screen.
booting a pi image with systemd.debug-shell added to /boot/cmdline.txt one can watch top and syslog during that time. obviously snapd is doing the initial snap setup at this point and eats 100% CPU and causes extreme I/O for about 5 minutes where the board nearly does not react at all.

http://paste.ubuntu.com/23415807/ has the output of
grep -E '(snapd|mount)' /var/log/syslog

Revision history for this message
Oliver Grawert (ogra) wrote :

seems the last lines are most interesting here:

Nov 2 11:52:26 localhost /usr/lib/snapd/snapd[1128]: taskrunner.go:353: DEBUG: Running task 21 on Do: Run configure hook of "pi3" snap if present
Nov 2 11:56:19 localhost /usr/lib/snapd/snapd[1128]: taskrunner.go:353: DEBUG: Running task 24 on Do: Request device serial

there are 4minutes between these two steps ...

also:

trying to watch snap changes takes very long to return anything at that point, eventually it shows "initialize system" ... then on a second run "initialize device", both take minutes to return.

Changed in snappy:
importance: Undecided → High
Changed in snapd (Ubuntu):
importance: Undecided → High
summary: snapd eats 100% CPU for about 5 minutes on first boot causing a load of
- >2.0c
+ >2.0
Revision history for this message
Simon Fels (morphis) wrote :

I see this as well on various arm based devices. Any more ideas what could be the reason or a possible fix?

Revision history for this message
Kyle Fazzari (kyrofa) wrote :

From the moment I entered my SSO email and hit "go" to the moment it returned telling me how to SSH in was about 7 minutes on the dragonboard.

Revision history for this message
Selene ToyKeeper (toykeeper) wrote :

I'm not sure if it's the same issue, but I tend to get a delay of ~8 minutes on a dragonboard on first boot. I'm booting an image with a system-user assertion, not using the console at all.

On first boot, the system creates a new user, I confirm that ssh works, then log in and try to run "snap version" to detect the version of snapd... and sometimes it works, while sometimes it hangs for ~8 minutes until snapd starts responding.

I haven't been able to investigate very deeply since a default image doesn't include many debugging tools, but this seems to be the last entry in syslog before it hangs:

Mar 1 05:55:33 localhost /usr/lib/snapd/snapd[1371]: taskrunner.go:353: DEBUG: Running task 29 on Do: Generate device key

The 100% CPU load isn't a problem for me, but it's an issue when snapd doesn't respond to new requests. Could it be made to spin off tasks into separate threads so the service won't block?

Revision history for this message
John Lenton (chipaca) wrote :

Until the device key is generated there is nothing it _can_ do (in fact it might not even be listening to requests yet; not sure offhand). That said, on the pi2 it takes a little over a minute to do the keygen (and that's too long, but not much we can do about it without better hardware support for vector multiplication), so you should be seeing times significantly shorter than that.

Would you be able to build a profiling snapd and do a first boot with that?

Revision history for this message
Seth Arnold (seth-arnold) wrote :

What kind of device key is being constructed? For what purpose? Do the machines have suitable randomness at boot? Are there hardware sources of entropy that aren't feeding into the kernel's randomness pool? Where's the slowdown?

Thanks

Revision history for this message
Selene ToyKeeper (toykeeper) wrote :
Download full text (5.9 KiB)

A default image on firstboot doesn't have a lot of tools available, and I'm not sure how to modify that at image-build time... so I have only very limited information about what exactly it's doing. It might not be key generation after all though...

I don't recall seeing it for a while, but today I caught it hanging for 10 minutes on rpi3.

Mar 22 16:08:14 localhost /usr/lib/snapd/snapd[1179]: taskrunner.go:353: DEBUG: Running task 26 on Do: Make snap "pi3" (6) available to the system
Mar 22 16:08:14 localhost /usr/lib/snapd/snapd[1179]: taskrunner.go:353: DEBUG: Running task 27 on Do: Setup snap "pi3" (6) security profiles (phase 2)
Mar 22 16:08:14 localhost /usr/lib/snapd/snapd[1179]: taskrunner.go:353: DEBUG: Running task 32 on Do: Generate device key
Mar 22 16:08:14 localhost /usr/lib/snapd/snapd[1179]: taskrunner.go:353: DEBUG: Running task 28 on Do: Set automatic aliases for snap "pi3"
--- blocked here ---
Mar 22 16:18:18 localhost /usr/lib/snapd/snapd[1179]: daemon.go:176: DEBUG: uid=1000;@ GET /v2/system-info 9m49.567409367s 200
Mar 22 16:18:18 localhost /usr/lib/snapd/snapd[1179]: taskrunner.go:353: DEBUG: Running task 29 on Do: Setup snap "pi3" aliases
Mar 22 16:18:18 localhost /usr/lib/snapd/snapd[1179]: taskrunner.go:353: DEBUG: Running task 33 on Do: Request device serial

At 16:08:28 a script logged in via ssh and tried to run "snap version", which blocked for 10 minutes. But in this case it looks like maybe the key generation was already complete.

Anyway, here's the rest of syslog from the time between snapd entries:

Mar 22 16:08:24 localhost rsyslogd-2007: action 'action 11' suspended, next retry is Wed Mar 22 16:08:54 2017 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
Mar 22 16:08:25 localhost systemd[1]: Created slice User Slice of qa.
Mar 22 16:08:25 localhost systemd[1]: Starting User Manager for UID 1000...
Mar 22 16:08:25 localhost systemd[1]: Started Session 1 of user qa.
Mar 22 16:08:25 localhost systemd[1721]: Reached target Sockets.
Mar 22 16:08:25 localhost systemd[1721]: Reached target Paths.
Mar 22 16:08:25 localhost systemd[1721]: Reached target Timers.
Mar 22 16:08:25 localhost systemd[1721]: Reached target Basic System.
Mar 22 16:08:25 localhost systemd[1721]: Reached target Default.
Mar 22 16:08:25 localhost systemd[1721]: Startup finished in 691ms.
Mar 22 16:08:25 localhost systemd[1]: Started User Manager for UID 1000.
Mar 22 16:08:26 localhost systemd[1]: Stopping User Manager for UID 1000...
Mar 22 16:08:26 localhost systemd[1721]: Reached target Shutdown.
Mar 22 16:08:26 localhost systemd[1721]: Starting Exit the Session...
Mar 22 16:08:26 localhost systemd[1721]: Stopped target Default.
Mar 22 16:08:26 localhost systemd[1721]: Stopped target Basic System.
Mar 22 16:08:26 localhost systemd[1721]: Stopped target Paths.
Mar 22 16:08:26 localhost systemd[1721]: Stopped target Sockets.
Mar 22 16:08:26 localhost systemd[1721]: Stopped target Timers.
Mar 22 16:08:26 localhost systemd[1721]: Received SIGRTMIN+24 from PID 1737 (kill).
Mar 22 16:08:26 localhost systemd[1]: Stopped User Manager for UID 1000.
Mar 22 16:08:26 localhost systemd[1]: Removed slice User Slice of qa.
Mar 22 16:08:26 localhost systemd[...

Read more...

Revision history for this message
Oliver Grawert (ogra) wrote :

we have been doing various test runs ... adding rng_core.default_quality=700 to teh commandline on pi and dragonboard gives a much greater entropy on boot:

$ cat /proc/sys/kernel/random/entropy_avail
958

but the user.creation step in console-conf still hangs for 3min doing the device key generation. this is a 4096bit rsa key ...

for comparison trying to run something like:
ssh-keygen -N "" -b 4096 -t rsa -f foo.key

does finish in abouit 30sec

so the suspicion is that rsa.GenerateKey() from go has some issues with arm here ...
the respective code in snapd is at:

https://github.com/snapcore/snapd/blob/master/overlord/devicestate/handlers.go#L88

Revision history for this message
Oliver Grawert (ogra) wrote :

one obvious workaround would indeed be to make snapd call out to ssh-keygen to have it handle the key generation ... if the security team does not see any drawbacks with this

Revision history for this message
Seth Arnold (seth-arnold) wrote :

How much time had elapsed between the user.creation step and the manual run of ssh-keygen? Perhaps the ssh-keygen just ran quickly because the kernel had already accumulated enough entropy.

Do these keys need to be 4096 bit RSA keys? curve25519 ecc keys are vastly smaller, faster to generate, and should provide security roughly on par with 2048 bit RSA keys. (There are longer curves in case RSA 4096 security level is important for this use; see Ed448-goldilocks, m-511, e-521, for examples.)

Thanks

Revision history for this message
Oliver Grawert (ogra) wrote :

@seth: this is with an entropy of above 900 using the rng_core.default_quality=700 kernel boot option. even if i let the board sit for a while before hitting enter to run console-conf the builtin key generation never goes below 3min while the ssh-keygen run is always fast (it goes down to 15-20sec if there is more entropy or if i run it multiple times in succession (that latter one might indeed be more I/O than entropy related))

the key type might be based on store requirements, not sure why else we picked 4096/rsa here.

Revision history for this message
Oliver Grawert (ogra) wrote :

using https://github.com/tmc/genkey/blob/master/genkey.go i see similar slowness ...

right after boot a key generation (with 4096 bits like used in snapd) takes between 3-5 min.

doing a 25 run loop like in http://paste.ubuntu.com/24260311/ gets me runs between 12sec and 7min ... with an average around 2-3min.

using the rng_core.default_quality=700 boot option seems to help the run right after boot a little, but it never seems to run as fast as ssh-keygen can where the slowest i saw was around 50sec. so it seems to clearly be the rsa.GenerateKey() function of go that does cause the slowness on arm.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

ssh-keygen is about 5.5 times faster than that genkey.go program on my laptop too (it's just that the times are much smaller, obviously). AIUI, generating an RSA key ends up benchmarking montgomery multiplication, and I know there are SIMD tricks you can use for that. Go doesn't seem to be using anything in this area but it looks like ssh-keygen is using openssh's routines for this and I bet they are optimized nearly as much as is possible. So, shelling out to ssh-keygen to make a key is probably going to be much faster until someone goes to the effort of improving Go's implementation...

Revision history for this message
Oliver Grawert (ogra) wrote :

in case anyone wants to test:

http://people.canonical.com/~ogra/snappy/all-snaps/ssh-keygen-test/

has images for pi2, pi3 and beaglebone black with
https://github.com/snapcore/snapd/pull/3130 included in snapd

Revision history for this message
Oliver Grawert (ogra) wrote :

http://paste.ubuntu.com/24307672/ has the log output of a test of the above image on a pi3 ...
(with manually added rng_core.default_quality=700 to cmdline.txt in system-boot before the first boot)

Revision history for this message
Oliver Grawert (ogra) wrote :

to gather some data i did 16 fresh installs in a row today (freshly flashed SD for each) and captured the log between the start of the key generation ("Generate device key") and the finalization ("System seeded") ... http://paste.ubuntu.com/24312862/ has the captured raw data ...

in average the key generation is taking between 10 and 20seconds in these tests.

now the slowest bit of a first boot is the wait for the "press enter to configure" message to come up (30sec up to 1min) this should be researched in a new bug, but i suspect this is caused by copying the snaps in place before we allow interaction with the system, so perhaps unavoidable.

Michael Vogt (mvo)
affects: snappy → snapd
Changed in snapd (Ubuntu):
status: New → Triaged
Changed in snapd:
status: New → Triaged
Revision history for this message
Oliver Grawert (ogra) wrote :

with https://github.com/snapcore/snapd/pull/3130 landed, should this bug be fix-released or is there desire to actually fix the Go implementation that snapd uses at some point ? (in which case we probably should let it open)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.