snapd eats 100% CPU for about 5 minutes on first boot causing a load of >2.0

Bug #1638537 reported by Oliver Grawert on 2016-11-02
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Snappy
High
Unassigned
snapd (Ubuntu)
High
Unassigned

Bug Description

during first boot it takes very long for the "please press eneter" message to come up on screen.
booting a pi image with systemd.debug-shell added to /boot/cmdline.txt one can watch top and syslog during that time. obviously snapd is doing the initial snap setup at this point and eats 100% CPU and causes extreme I/O for about 5 minutes where the board nearly does not react at all.

http://paste.ubuntu.com/23415807/ has the output of
grep -E '(snapd|mount)' /var/log/syslog

Oliver Grawert (ogra) wrote :

seems the last lines are most interesting here:

Nov 2 11:52:26 localhost /usr/lib/snapd/snapd[1128]: taskrunner.go:353: DEBUG: Running task 21 on Do: Run configure hook of "pi3" snap if present
Nov 2 11:56:19 localhost /usr/lib/snapd/snapd[1128]: taskrunner.go:353: DEBUG: Running task 24 on Do: Request device serial

there are 4minutes between these two steps ...

also:

trying to watch snap changes takes very long to return anything at that point, eventually it shows "initialize system" ... then on a second run "initialize device", both take minutes to return.

Changed in snappy:
importance: Undecided → High
Changed in snapd (Ubuntu):
importance: Undecided → High
summary: snapd eats 100% CPU for about 5 minutes on first boot causing a load of
- >2.0c
+ >2.0
Simon Fels (morphis) wrote :

I see this as well on various arm based devices. Any more ideas what could be the reason or a possible fix?

Kyle Fazzari (kyrofa) wrote :

From the moment I entered my SSO email and hit "go" to the moment it returned telling me how to SSH in was about 7 minutes on the dragonboard.

Selene Scriven (toykeeper) wrote :

I'm not sure if it's the same issue, but I tend to get a delay of ~8 minutes on a dragonboard on first boot. I'm booting an image with a system-user assertion, not using the console at all.

On first boot, the system creates a new user, I confirm that ssh works, then log in and try to run "snap version" to detect the version of snapd... and sometimes it works, while sometimes it hangs for ~8 minutes until snapd starts responding.

I haven't been able to investigate very deeply since a default image doesn't include many debugging tools, but this seems to be the last entry in syslog before it hangs:

Mar 1 05:55:33 localhost /usr/lib/snapd/snapd[1371]: taskrunner.go:353: DEBUG: Running task 29 on Do: Generate device key

The 100% CPU load isn't a problem for me, but it's an issue when snapd doesn't respond to new requests. Could it be made to spin off tasks into separate threads so the service won't block?

John Lenton (chipaca) wrote :

Until the device key is generated there is nothing it _can_ do (in fact it might not even be listening to requests yet; not sure offhand). That said, on the pi2 it takes a little over a minute to do the keygen (and that's too long, but not much we can do about it without better hardware support for vector multiplication), so you should be seeing times significantly shorter than that.

Would you be able to build a profiling snapd and do a first boot with that?

Seth Arnold (seth-arnold) wrote :

What kind of device key is being constructed? For what purpose? Do the machines have suitable randomness at boot? Are there hardware sources of entropy that aren't feeding into the kernel's randomness pool? Where's the slowdown?

Thanks

Selene Scriven (toykeeper) wrote :
Download full text (5.9 KiB)

A default image on firstboot doesn't have a lot of tools available, and I'm not sure how to modify that at image-build time... so I have only very limited information about what exactly it's doing. It might not be key generation after all though...

I don't recall seeing it for a while, but today I caught it hanging for 10 minutes on rpi3.

Mar 22 16:08:14 localhost /usr/lib/snapd/snapd[1179]: taskrunner.go:353: DEBUG: Running task 26 on Do: Make snap "pi3" (6) available to the system
Mar 22 16:08:14 localhost /usr/lib/snapd/snapd[1179]: taskrunner.go:353: DEBUG: Running task 27 on Do: Setup snap "pi3" (6) security profiles (phase 2)
Mar 22 16:08:14 localhost /usr/lib/snapd/snapd[1179]: taskrunner.go:353: DEBUG: Running task 32 on Do: Generate device key
Mar 22 16:08:14 localhost /usr/lib/snapd/snapd[1179]: taskrunner.go:353: DEBUG: Running task 28 on Do: Set automatic aliases for snap "pi3"
--- blocked here ---
Mar 22 16:18:18 localhost /usr/lib/snapd/snapd[1179]: daemon.go:176: DEBUG: uid=1000;@ GET /v2/system-info 9m49.567409367s 200
Mar 22 16:18:18 localhost /usr/lib/snapd/snapd[1179]: taskrunner.go:353: DEBUG: Running task 29 on Do: Setup snap "pi3" aliases
Mar 22 16:18:18 localhost /usr/lib/snapd/snapd[1179]: taskrunner.go:353: DEBUG: Running task 33 on Do: Request device serial

At 16:08:28 a script logged in via ssh and tried to run "snap version", which blocked for 10 minutes. But in this case it looks like maybe the key generation was already complete.

Anyway, here's the rest of syslog from the time between snapd entries:

Mar 22 16:08:24 localhost rsyslogd-2007: action 'action 11' suspended, next retry is Wed Mar 22 16:08:54 2017 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
Mar 22 16:08:25 localhost systemd[1]: Created slice User Slice of qa.
Mar 22 16:08:25 localhost systemd[1]: Starting User Manager for UID 1000...
Mar 22 16:08:25 localhost systemd[1]: Started Session 1 of user qa.
Mar 22 16:08:25 localhost systemd[1721]: Reached target Sockets.
Mar 22 16:08:25 localhost systemd[1721]: Reached target Paths.
Mar 22 16:08:25 localhost systemd[1721]: Reached target Timers.
Mar 22 16:08:25 localhost systemd[1721]: Reached target Basic System.
Mar 22 16:08:25 localhost systemd[1721]: Reached target Default.
Mar 22 16:08:25 localhost systemd[1721]: Startup finished in 691ms.
Mar 22 16:08:25 localhost systemd[1]: Started User Manager for UID 1000.
Mar 22 16:08:26 localhost systemd[1]: Stopping User Manager for UID 1000...
Mar 22 16:08:26 localhost systemd[1721]: Reached target Shutdown.
Mar 22 16:08:26 localhost systemd[1721]: Starting Exit the Session...
Mar 22 16:08:26 localhost systemd[1721]: Stopped target Default.
Mar 22 16:08:26 localhost systemd[1721]: Stopped target Basic System.
Mar 22 16:08:26 localhost systemd[1721]: Stopped target Paths.
Mar 22 16:08:26 localhost systemd[1721]: Stopped target Sockets.
Mar 22 16:08:26 localhost systemd[1721]: Stopped target Timers.
Mar 22 16:08:26 localhost systemd[1721]: Received SIGRTMIN+24 from PID 1737 (kill).
Mar 22 16:08:26 localhost systemd[1]: Stopped User Manager for UID 1000.
Mar 22 16:08:26 localhost systemd[1]: Removed slice User Slice of qa.
Mar 22 16:08:26 localhost systemd[...

Read more...

Oliver Grawert (ogra) wrote :

we have been doing various test runs ... adding rng_core.default_quality=700 to teh commandline on pi and dragonboard gives a much greater entropy on boot:

$ cat /proc/sys/kernel/random/entropy_avail
958

but the user.creation step in console-conf still hangs for 3min doing the device key generation. this is a 4096bit rsa key ...

for comparison trying to run something like:
ssh-keygen -N "" -b 4096 -t rsa -f foo.key

does finish in abouit 30sec

so the suspicion is that rsa.GenerateKey() from go has some issues with arm here ...
the respective code in snapd is at:

https://github.com/snapcore/snapd/blob/master/overlord/devicestate/handlers.go#L88

Oliver Grawert (ogra) wrote :

one obvious workaround would indeed be to make snapd call out to ssh-keygen to have it handle the key generation ... if the security team does not see any drawbacks with this

Seth Arnold (seth-arnold) wrote :

How much time had elapsed between the user.creation step and the manual run of ssh-keygen? Perhaps the ssh-keygen just ran quickly because the kernel had already accumulated enough entropy.

Do these keys need to be 4096 bit RSA keys? curve25519 ecc keys are vastly smaller, faster to generate, and should provide security roughly on par with 2048 bit RSA keys. (There are longer curves in case RSA 4096 security level is important for this use; see Ed448-goldilocks, m-511, e-521, for examples.)

Thanks

Oliver Grawert (ogra) wrote :

@seth: this is with an entropy of above 900 using the rng_core.default_quality=700 kernel boot option. even if i let the board sit for a while before hitting enter to run console-conf the builtin key generation never goes below 3min while the ssh-keygen run is always fast (it goes down to 15-20sec if there is more entropy or if i run it multiple times in succession (that latter one might indeed be more I/O than entropy related))

the key type might be based on store requirements, not sure why else we picked 4096/rsa here.

Oliver Grawert (ogra) wrote :

using https://github.com/tmc/genkey/blob/master/genkey.go i see similar slowness ...

right after boot a key generation (with 4096 bits like used in snapd) takes between 3-5 min.

doing a 25 run loop like in http://paste.ubuntu.com/24260311/ gets me runs between 12sec and 7min ... with an average around 2-3min.

using the rng_core.default_quality=700 boot option seems to help the run right after boot a little, but it never seems to run as fast as ssh-keygen can where the slowest i saw was around 50sec. so it seems to clearly be the rsa.GenerateKey() function of go that does cause the slowness on arm.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers