200 models (no applications) results in a fair bit of CPU/memory overhead
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Triaged
|
Low
|
Unassigned |
Bug Description
I bootstrapped a 2.5.0 LXD controller, and just did:
juju controller-config controller-
juju enable-ha -n3
for i in `seq -f%03g 0 199`; do juju add-model --no-switch m$i; done
First off, it took a bit longer than I expected to bring up 200 models.
Afterward, lsof | grep 17017 | wc -l reports 419 sockets (probably this is 209 with both source and destination).
top with nothing special going on says:
USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
100000 20 1967592 569424 33368 S 14.2 3.5 63:28.40 /var/lib/
100000 20 2041580 568484 33884 S 36.8 3.5 77:07.14 /var/lib/
100000 20 1921152 463688 50172 S 13.9 2.9 5:50.20 /var/lib/
so across the 3 instances, that is 64% cpu. This is an older machine, so not super fast, but given the only thing running is the model workers (and 5 units of ubuntu), this seems a bit high.
I had run some log sending tests, but AFAICT there isn't any broken routines running.
'juju_cpu_profile 60' reports:
Showing nodes accounting for 3.58s, 53.67% of 6.67s total
Dropped 831 nodes (cum <= 0.03s)
Showing top 10 nodes out of 274
flat flat% sum% cum cum%
1950ms 29.24% 29.24% 1950ms 29.24% runtime.futex
610ms 9.15% 38.38% 610ms 9.15% runtime.usleep
460ms 6.90% 45.28% 500ms 7.50% syscall.Syscall
140ms 2.10% 47.38% 140ms 2.10% crypto/
100ms 1.50% 48.88% 130ms 1.95% runtime.
100ms 1.50% 50.37% 100ms 1.50% runtime.
80ms 1.20% 51.57% 80ms 1.20% runtime.aeshashbody
80ms 1.20% 52.77% 80ms 1.20% runtime.epollwait
80ms 1.20% 53.97% 1150ms 17.24% runtime.
80ms 1.20% 55.17% 600ms 9.00% runtime.
futex doesn't seem particularly interesting. sha1 seems like something might be going on with that.
If I sort by Cumulative time and do more entries I get:
Showing top 50 nodes out of 274
flat flat% sum% cum cum%
1.95s 29.24% 29.24% 1.95s 29.24% runtime.futex
0 0% 29.24% 1.33s 19.94% runtime.mcall
0.04s 0.6% 29.84% 1.33s 19.94% runtime.schedule
0 0% 29.84% 1.25s 18.74% runtime.park_m
0.02s 0.3% 30.13% 1.24s 18.59% runtime.futexsleep
0.03s 0.45% 30.58% 1.16s 17.39% runtime.timerproc
0.08s 1.20% 31.78% 1.15s 17.24% runtime.
0.07s 1.05% 32.83% 1.12s 16.79% runtime.systemstack
0 0% 32.83% 0.85s 12.74% github.
0.04s 0.6% 33.43% 0.85s 12.74% runtime.stopm
0.01s 0.15% 33.58% 0.83s 12.44% runtime.sysmon
0 0% 33.58% 0.82s 12.29% runtime.mstart
0 0% 33.58% 0.82s 12.29% runtime.mstart1
0.02s 0.3% 33.88% 0.77s 11.54% runtime.notesleep
0 0% 33.88% 0.75s 11.24% runtime.notetsleepg
0.01s 0.15% 34.03% 0.74s 11.09% runtime.notewakeup
0 0% 34.03% 0.73s 10.94% runtime.futexwakeup
0.61s 9.15% 43.18% 0.61s 9.15% runtime.usleep
0.08s 1.20% 44.38% 0.60s 9.00% runtime.
0 0% 44.38% 0.59s 8.85% github.
0.01s 0.15% 44.53% 0.58s 8.70% runtime.startm
0.46s 6.90% 51.42% 0.50s 7.50% syscall.Syscall
0 0% 51.42% 0.48s 7.20% github.
0 0% 51.42% 0.48s 7.20% github.
0.02s 0.3% 51.72% 0.48s 7.20% github.
0 0% 51.72% 0.46s 6.90% github.
0.01s 0.15% 51.87% 0.45s 6.75% github.
0.05s 0.75% 52.62% 0.45s 6.75% runtime.mallocgc
Maybe its just a lot of goroutines hung in sleep?
goroutine profile: total 29150
2170 @ 0x430f2b 0x440fa6 0xa54361 0x45f401
# 0xa54360 github.
1972 @ 0x430f2b 0x440fa6 0xa5409b 0x45f401
# 0xa5409a github.
1811 @ 0x430f2b 0x430fd3 0x40867e 0x40836b 0xa00e89 0xac0d90 0xa5ab5e 0xa5aed4 0x45f401
# 0xa00e88 github.
# 0xac0d8f github.
# 0xa5ab5d github.
1811 @ 0x430f2b 0x430fd3 0x40867e 0x40836b 0xa5ac22 0x45f401
# 0xa5ac21 github.
1811 @ 0x430f2b 0x430fd3 0x441b4c 0x441789 0x468f54 0xa57198 0x45f401
# 0x441788 sync.runtime_
# 0x468f53 sync.(*
# 0xa57197 github.
1770 @ 0x430f2b 0x430fd3 0x40867e 0x40836b 0xa4b3ea 0xa5724d 0xa54227 0x45f401
# 0xa4b3e9 github.
# 0xa5724c github.
# 0xa54226 github.
1250 @ 0x430f2b 0x430fd3 0x40867e 0x4083ab 0x1f86f60 0x45d6eb 0x4c4074 0x4c3b04 0x9ff676 0x1f7fd96 0xa034db 0x45f401
# 0x1f86f5f github.
# 0x45d6ea runtime.call32+0x3a /snap/go/
# 0x4c4073 reflect.
# 0x4c3b03 reflect.
# 0x9ff675 github.
# 0x1f7fd95 github.
1250 routines waiting in Next, catacomb.Add() creates 2 goroutines (one synchronizing catacomb shutdown to shut down the worker, and one having the worker stopping shut down the catacomb.) We could pretty easily turn that into a single catacomb shutdown telling all pending workers to stop, and one goroutine watching each worker.
tags: |
added: performance removed: perfor |
Part of this overhead is that every controller is creating a dependency
engine and all the workers for every model. Now not all those workers
are running, but some, like api-caller are.
The model workers should be reworked so the controller only creates the
nested dependency engine for the models that it currently has leases on.
That would make a significant difference in the idle state at least.
Tim