Canonical Juju

Massive goroutine leak (logsink 2.5.0)

Bug #1813104 reported by Christopher Lee on 2019-01-24

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	Critical	John A Meinel	Canonical Juju 2.6-beta1
	2.5	Fix Released	Critical	John A Meinel	Canonical Juju 2.5.1

Bug Description

We noticed a large leak of goroutines while monitoring a 2.5.0 environment.

An extract of a goroutine dump[1] points us to logsink/logsink.go:234 (full details[2]).

It appears that the logsink ServeHTTP is starting up go routines that never get aborted (perhaps related to the mux issue xtian has seen in the past?)

[1] https://pastebin.canonical.com/p/292ksKsgGX/
[2] https://private-fileshare.canonical.com/~barryprice/2019-01-24-juju-dumps/

John A Meinel (jameinel) on 2019-01-24

Changed in juju:
status:	Confirmed → Triaged
importance:	High → Critical

Revision history for this message

John A Meinel (jameinel) wrote on 2019-01-24:

Heap dumps, cleaned up for go tool pprof Edit (4.7 MiB, application/x-tar)

(note that the original heap dumps have 1 line from stderr at the top that have to be trimmed. I've attached a tar of the cleaned up dumps to this bug).

If you run:
$ go tool pprof -top -base juju_heap_profile-2019-01-24.00 jujud juju_heap_profile-2019-01-24.01
File: jujud
Type: inuse_space
Showing nodes accounting for 1620.64MB, 76.19% of 2127.01MB total
Dropped 743 nodes (cum <= 10.64MB)
      flat flat% sum% cum cum%
1330.01MB 62.53% 62.53% 1330.01MB 62.53% github.com/juju/juju/vendor/github.com/gorilla/websocket.newConnBRW
  246.83MB 11.60% 74.13% 248.85MB 11.70% github.com/juju/juju/state/watcher.(*HubWatcher).queueChange
   80.63MB 3.79% 77.92% 80.63MB 3.79% crypto/tls.(*block).reserve

1 vs 2
flat flat% sum% cum cum%
1136454738B 84.39% 84.39% 1136454738B 84.39% github.com/juju/juju/vendor/github.com/gorilla/websocket.newConnBRW
65547996B 4.87% 89.25% 65547996B 4.87% crypto/tls.(*block).reserve

2 vs 3
1007.44MB 85.21% 85.21% 1007.44MB 85.21% github.com/juju/juju/vendor/github.com/gorilla/websocket.newConnBRW
48.33MB 4.09% 89.29% 48.33MB 4.09% crypto/tls.(*block).reserve

It is fairly clear that the memory is all in gorilla websocket caches.

From the goroutine information, you can see that likely the logSinkHandler is somehow holding on to websockets.
Note that
https://grafana.admin.canonical.com/d/sR1-JkYmz/juju2-controllers-thumpers?orgId=1&from=1548230449490&to=1548287973681&var-controller=prodstack-45-bootstack-ps45-prodstack-is&var-host=All&var-node=All

Shows the clear memory growth, but the "ConnectionCount" is perfectly stable.

It does also show the goroutine growth.

John A Meinel (jameinel) on 2019-01-24

Changed in juju:
milestone:	2.5.1 → 2.6-beta1

Revision history for this message

John A Meinel (jameinel) wrote on 2019-01-24:

logsender.go Edit (4.7 KiB, text/plain)

I wrote a go script that connects to the Juju Log endpoint, and sends messages (as fast as possible) for 5s. (see attached)

With that, I was able to spawn 100 machine agents, all running concurrently.
Note that the logsender explictly doesn't call Writer.Close(), though I haven't done extensive testing if that matters.
After running 100 at a time multiple times, I do see memory growth. I can't quite tell if it is only peak-memory based on max number of concurrent loggers. Still investigating.

Revision history for this message

John A Meinel (jameinel) wrote on 2019-01-24:

After running 200 concurrent loggers repeated 10x, and letting the machine sit for a bit, the memory seems to have gone back down.
So it doesn't seem to be a reproducer.

Maybe we have a different bug where clients keep connecting to the controller and not closing the conn. (possible 2.4 vs 2.5 bug?)

Revision history for this message

John A Meinel (jameinel) wrote on 2019-01-24:

After running 200 concurrent loggers repeated 10x, and letting the machine sit for a bit, the memory seems to have gone back down.
So it doesn't seem to be a reproducer.

Maybe we have a different bug where clients keep connecting to the controller and not closing the conn. (possible 2.4 vs 2.5 bug?)

Note that ConnectionCount as we have it today counts the number of connections to '/api' but doesn't notice '/logsink' connections.

I looked into adding it, but it isn't trivial because of how the various layers interact. It would certainly be important to add that to the system.

Revision history for this message

John A Meinel (jameinel) wrote on 2019-01-24:

I also set up a 'badlogsender.go' (attached). This randomly corrupts a packet so that it introduces a write error into the system. However, I still don't see any leaking goroutines.

Revision history for this message

John A Meinel (jameinel) wrote on 2019-01-24:

a log sender that occasionally writes bad packets Edit (5.5 KiB, text/plain)

Revision history for this message

John A Meinel (jameinel) wrote on 2019-01-24:

logsink.go needs a bunch more logging added, so that you can set DEBUG and see the level of granularity that we see on /api. It doesn't have to be *every* log message, but you should see a message every time an agent connects, what agent, for what model, and when they disconnect.

We also need to get the metrics around logsink into the engine report and prometheus metrics.
Unfortunately the prometheus metrics we have today are set up during NewServer, but the logsink endpoint doesn't exist until server.Endpoints() is called. And the server doesn't have a reference to the metricsAdapter/PrometheusConfig at that point. Probably we just change the Server to forward metrics from other handlers and change the metricsAdapter to forward from the Server.

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2019-01-25:

From the goroutine dump we could see that the leaked goroutines were blocked sending on logCh, so it seems like the ServeHTTP handler has finished (so nothing is reading on logCh). This PR adds a channel that gets closed when the handler finishes to avoid that situation.
https://github.com/juju/juju/pull/9669

Tim Penhey (thumper) on 2019-02-08

Changed in juju:
status:	Triaged → Fix Committed

Anastasia (anastasia-macmood) on 2019-05-13

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.