Massive goroutine leak (logsink 2.5.0)

Bug #1813104 reported by Christopher Lee on 2019-01-24
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
Critical
John A Meinel
2.5
Critical
John A Meinel

Bug Description

We noticed a large leak of goroutines while monitoring a 2.5.0 environment.

An extract of a goroutine dump[1] points us to logsink/logsink.go:234 (full details[2]).

It appears that the logsink ServeHTTP is starting up go routines that never get aborted (perhaps related to the mux issue xtian has seen in the past?)

[1] https://pastebin.canonical.com/p/292ksKsgGX/
[2] https://private-fileshare.canonical.com/~barryprice/2019-01-24-juju-dumps/

John A Meinel (jameinel) on 2019-01-24
Changed in juju:
status: Confirmed → Triaged
importance: High → Critical
John A Meinel (jameinel) wrote :

(note that the original heap dumps have 1 line from stderr at the top that have to be trimmed. I've attached a tar of the cleaned up dumps to this bug).

If you run:
$ go tool pprof -top -base juju_heap_profile-2019-01-24.00 jujud juju_heap_profile-2019-01-24.01
File: jujud
Type: inuse_space
Showing nodes accounting for 1620.64MB, 76.19% of 2127.01MB total
Dropped 743 nodes (cum <= 10.64MB)
      flat flat% sum% cum cum%
 1330.01MB 62.53% 62.53% 1330.01MB 62.53% github.com/juju/juju/vendor/github.com/gorilla/websocket.newConnBRW
  246.83MB 11.60% 74.13% 248.85MB 11.70% github.com/juju/juju/state/watcher.(*HubWatcher).queueChange
   80.63MB 3.79% 77.92% 80.63MB 3.79% crypto/tls.(*block).reserve

1 vs 2
      flat flat% sum% cum cum%
1136454738B 84.39% 84.39% 1136454738B 84.39% github.com/juju/juju/vendor/github.com/gorilla/websocket.newConnBRW
 65547996B 4.87% 89.25% 65547996B 4.87% crypto/tls.(*block).reserve

2 vs 3
 1007.44MB 85.21% 85.21% 1007.44MB 85.21% github.com/juju/juju/vendor/github.com/gorilla/websocket.newConnBRW
   48.33MB 4.09% 89.29% 48.33MB 4.09% crypto/tls.(*block).reserve

It is fairly clear that the memory is all in gorilla websocket caches.

From the goroutine information, you can see that likely the logSinkHandler is somehow holding on to websockets.
Note that
https://grafana.admin.canonical.com/d/sR1-JkYmz/juju2-controllers-thumpers?orgId=1&from=1548230449490&to=1548287973681&var-controller=prodstack-45-bootstack-ps45-prodstack-is&var-host=All&var-node=All

Shows the clear memory growth, but the "ConnectionCount" is perfectly stable.

It does also show the goroutine growth.

John A Meinel (jameinel) on 2019-01-24
Changed in juju:
milestone: 2.5.1 → 2.6-beta1
John A Meinel (jameinel) wrote :

I wrote a go script that connects to the Juju Log endpoint, and sends messages (as fast as possible) for 5s. (see attached)

With that, I was able to spawn 100 machine agents, all running concurrently.
Note that the logsender explictly doesn't call Writer.Close(), though I haven't done extensive testing if that matters.
After running 100 at a time multiple times, I do see memory growth. I can't quite tell if it is only peak-memory based on max number of concurrent loggers. Still investigating.

John A Meinel (jameinel) wrote :

After running 200 concurrent loggers repeated 10x, and letting the machine sit for a bit, the memory seems to have gone back down.
So it doesn't seem to be a reproducer.

Maybe we have a different bug where clients keep connecting to the controller and not closing the conn. (possible 2.4 vs 2.5 bug?)

John A Meinel (jameinel) wrote :

After running 200 concurrent loggers repeated 10x, and letting the machine sit for a bit, the memory seems to have gone back down.
So it doesn't seem to be a reproducer.

Maybe we have a different bug where clients keep connecting to the controller and not closing the conn. (possible 2.4 vs 2.5 bug?)

Note that ConnectionCount as we have it today counts the number of connections to '/api' but doesn't notice '/logsink' connections.

I looked into adding it, but it isn't trivial because of how the various layers interact. It would certainly be important to add that to the system.

John A Meinel (jameinel) wrote :

I also set up a 'badlogsender.go' (attached). This randomly corrupts a packet so that it introduces a write error into the system. However, I still don't see any leaking goroutines.

John A Meinel (jameinel) wrote :

logsink.go needs a bunch more logging added, so that you can set DEBUG and see the level of granularity that we see on /api. It doesn't have to be *every* log message, but you should see a message every time an agent connects, what agent, for what model, and when they disconnect.

We also need to get the metrics around logsink into the engine report and prometheus metrics.
Unfortunately the prometheus metrics we have today are set up during NewServer, but the logsink endpoint doesn't exist until server.Endpoints() is called. And the server doesn't have a reference to the metricsAdapter/PrometheusConfig at that point. Probably we just change the Server to forward metrics from other handlers and change the metricsAdapter to forward from the Server.

Christian Muirhead (2-xtian) wrote :

From the goroutine dump we could see that the leaked goroutines were blocked sending on logCh, so it seems like the ServeHTTP handler has finished (so nothing is reading on logCh). This PR adds a channel that gets closed when the handler finishes to avoid that situation.
https://github.com/juju/juju/pull/9669

Tim Penhey (thumper) on 2019-02-08
Changed in juju:
status: Triaged → Fix Committed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers