snap install seems to be polling at 3000 nanoseconds in a pselect call

Bug #1792959 reported by Colin Ian King
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
snapd
Triaged
Medium
Unassigned

Bug Description

I observed snap install was eating a lot of CPU across threads, attaching strace -f in a snap install I observed *lots* of polling pselect calls:

[pid 27055] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=3000}, NULL <unfinished ...>
[pid 27055] <... pselect6 resumed> ) = 0 (Timeout)
[pid 27055] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=3000}, NULL <unfinished ...>
[pid 27055] <... pselect6 resumed> ) = 0 (Timeout)
[pid 27049] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=3000}, NULL <unfinished ...>
[pid 27049] <... pselect6 resumed> ) = 0 (Timeout)
[pid 27049] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=3000}, NULL <unfinished ...>
[pid 27049] <... pselect6 resumed> ) = 0 (Timeout)

Honestly, not much can happen in 3000 nanoseconds, light travels less that 900 meters in the kind of duration. Polling on pselect is generating a lot of work for the scheduler. Can this pselect timeout be increased to something sensible, like 10th of a second so we don't eat up so much CPU. Is this some kind of broken spin lock?

Michael Vogt (mvo)
Changed in snapd:
importance: Undecided → High
Revision history for this message
Michael Vogt (mvo) wrote :

Thanks for your bugreport.

The 300 nanoseconds apparently comes from a usleep(3) in GO runtime (runtime/proc.go:runqgrab to be precise). I have not looked into the details why this is happening yet, the internet (https://stackoverflow.com/questions/35155119/how-to-optimize-golang-program-that-spends-most-time-in-runtime-osyield-and-runt) indicates that we might trigger too much GC for some reason. But it might also be a red-herring. The next step here is probably to run pprof to see what triggers the usleep(3).

Changed in snapd:
status: New → Triaged
Revision history for this message
Michael Vogt (mvo) wrote :

We fixed on over agressive progress bar which resulted in much less of these calls AFAICT. Moving to medium for this reason.

Changed in snapd:
importance: High → Medium
Revision history for this message
Robert Collins (lifeless) wrote :
Download full text (10.6 KiB)

Seeing the same behaviour currently.
pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=20000}, NULL) = 0 (Timeout)

robertc@robertc-xps:~$ sudo snap debug stacktraces
goroutine 788 [running]:
github.com/snapcore/snapd/daemon.getStacktraces(0xc42042a1d0, 0xb)
 /build/snapd/parts/snapd-deb/build/_build/src/github.com/snapcore/snapd/daemon/api_debug_stacktrace.go:29 +0x76
github.com/snapcore/snapd/daemon.postDebug(0x562917885b60, 0xc4206e4c00, 0x0, 0x0, 0x0)
 /build/snapd/parts/snapd-deb/build/_build/src/github.com/snapcore/snapd/daemon/api_debug.go:344 +0x37a
github.com/snapcore/snapd/daemon.(*Command).ServeHTTP(0x562917885b60, 0x56291742f140, 0xc4205ee420, 0xc4206e4c00)
 /build/snapd/parts/snapd-deb/build/_build/src/github.com/snapcore/snapd/daemon/daemon.go:167 +0x3fd
github.com/snapcore/snapd/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc42014e000, 0x56291742f140, 0xc4205ee420, 0xc4206e4900)
 /build/snapd/parts/snapd-deb/build/_build/src/github.com/snapcore/snapd/vendor/github.com/gorilla/mux/mux.go:212 +0xcf
github.com/snapcore/snapd/daemon.logit.func1(0x56291742f980, 0xc4201536c0, 0xc4206e4900)
 /build/snapd/parts/snapd-deb/build/_build/src/github.com/snapcore/snapd/daemon/daemon.go:214 +0xdf
net/http.HandlerFunc.ServeHTTP(0xc420594580, 0x56291742f980, 0xc4201536c0, 0xc4206e4900)
 /usr/lib/go-1.10/src/net/http/server.go:1947 +0x46
net/http.serverHandler.ServeHTTP(0xc420435930, 0x56291742f980, 0xc4201536c0, 0xc4206e4900)
 /usr/lib/go-1.10/src/net/http/server.go:2697 +0xbe
net/http.(*conn).serve(0xc4209ac320, 0x562917430880, 0xc4204046c0)
 /usr/lib/go-1.10/src/net/http/server.go:1830 +0x653
created by net/http.(*Server).Serve
 /usr/lib/go-1.10/src/net/http/server.go:2798 +0x27d

goroutine 1 [select, 317 minutes]:
main.run(0xc4202a96e0, 0x0, 0x0)
 /build/snapd/parts/snapd-deb/build/_build/src/github.com/snapcore/snapd/cmd/snapd/main.go:153 +0x501
main.main()
 /build/snapd/parts/snapd-deb/build/_build/src/github.com/snapcore/snapd/cmd/snapd/main.go:64 +0x10d

goroutine 5 [syscall, 317 minutes]:
os/signal.signal_recv(0x0)
 /usr/lib/go-1.10/src/runtime/sigqueue.go:139 +0xa8
os/signal.loop()
 /usr/lib/go-1.10/src/os/signal/signal_unix.go:22 +0x24
created by os/signal.init.0
 /usr/lib/go-1.10/src/os/signal/signal_unix.go:28 +0x43

goroutine 10 [select, 317 minutes, locked to thread]:
runtime.gopark(0x56291741e780, 0x0, 0x562916d20415, 0x6, 0x18, 0x1)
 /usr/lib/go-1.10/src/runtime/proc.go:291 +0x120
runtime.selectgo(0xc42005a750, 0xc42039efc0)
 /usr/lib/go-1.10/src/runtime/select.go:392 +0xe56
runtime.ensureSigM.func1()
 /usr/lib/go-1.10/src/runtime/signal_unix.go:549 +0x1f6
runtime.goexit()
 /usr/lib/go-1.10/src/runtime/asm_amd64.s:2361 +0x1

goroutine 53 [IO wait, 316 minutes]:
internal/poll.runtime_pollWait(0x7fa74c12bf00, 0x72, 0x0)
 /usr/lib/go-1.10/src/runtime/netpoll.go:173 +0x59
internal/poll.(*pollDesc).wait(0xc420136318, 0x72, 0xc4202de000, 0x0, 0x0)
 /usr/lib/go-1.10/src/internal/poll/fd_poll_runtime.go:85 +0x9d
internal/poll.(*pollDesc).waitRead(0xc420136318, 0xffffffffffffff00, 0x0, 0x0)
 /usr/lib/go-1.10/src/internal/poll/fd_poll_runtime.go:90 +0x3f
internal/poll.(*FD).Accept(0xc420136300, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
 /u...

Revision history for this message
Paweł Stołowski (stolowski) wrote :

@Robert I'm not sure it's the same problem, the original one was related to 'snap install ...' activity (which involved I/O related to our progress bar, among other things, and we tuned that as mvo said), while in your case snapd seems to be idle and not doing anything, is that right? On twitter you mentioned powertop reported lots of wakeups - did you collect 'snap debug stacktraces' while powertop was still reporting wakeups?

There is a similar problem for containerd https://bugs.launchpad.net/ubuntu/+source/containerd/+bug/1826684 which might suggest a go-runtime issue.

Revision history for this message
Colin Ian King (colin-king) wrote :

One can also run health-check on the process and observe the activity, e.g.

sudo health-check -p snapd -d 60

..this will profile the process snapd for 60 seconds to get an overall idea of the kinds of system calls that are producing excessive wakeups.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.