Canonical Juju

memory leak after repeated model creation/destruction

Bug #1625774 reported by Jason Hobbs on 2016-09-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Christian Muirhead	Canonical Juju 2.0.0

Bug Description

I'm using a simple script to repeatedly create and destroy models on my local lxd provider:
http://paste.ubuntu.com/23208327/

After about 1500 models, jujud's memory usage has gone from ~500M at the start to about ~8GB (RSS as shown by top). If I continue the test, juju starts getting errors for failure to allocate memory:
http://paste.ubuntu.com/23208503/

Here it is using about 10G:
18126 165536 20 0 11.335g 9.978g 13916 R 650.7 63.9 153:23.76 jujud

Also for some reason I can't destroy the models I have left:
http://paste.ubuntu.com/23208338/

This should be easy to reproduce - it takes about 30 minutes to get up to about 8GB on my laptop, but you can see the memory use growing much sooner than that.

This is with beta18.

See original description

Tags:

Jason Hobbs (jason-hobbs) on 2016-09-20

summary:	- memory leak and can't destroy models after repeated model - creation/destruction + memory leak after repeated model creation/destruction
description:	updated

Jason Hobbs (jason-hobbs) on 2016-09-20

description:

updated

Anastasia (anastasia-macmood) on 2016-09-21

tags:

added: eda

Anastasia (anastasia-macmood) on 2016-09-21

Changed in juju:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 2.0.0

Alexis Bruemmer (alexis-bruemmer) on 2016-09-21

Changed in juju:
assignee:	nobody → Alexis Bruemmer (alexis-bruemmer)

Christian Muirhead (2-xtian) on 2016-09-30

Changed in juju:
assignee:	Alexis Bruemmer (alexis-bruemmer) → Christian Muirhead (2-xtian)

Alexis Bruemmer (alexis-bruemmer) on 2016-10-03

Changed in juju:
status:	Triaged → In Progress

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2016-10-03:

I can reproduce both the leak and the error destroying models for this. I've added an endpoint to the introspection worker to produce a heap dump so I can work out what it is that's being leaked. The tooling in this area of Go is a bit rough - I'm trying to use the hview tool linked to from this thread https://github.com/golang/go/issues/16410.

It doesn't work with Go 1.7 yet - I tried hacking it to ignore the version reported in the dump, but got errors about unrecognised record kind 101, so I'm trying again with a binary built with Go 1.6.

Ryan Beisner (1chb1n) on 2016-10-03

tags:

added: uosci

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2016-10-04:

Ok, I'm not having much luck with that heap viewer - running it with the binary produces a crash that I don't understand:
parser.go:1272: pointer without base pointer ""

Running it without a binary works (at least, it starts a web server), but the type and attribute names are missing so it's impossible to work out what anything is.

There's another prototype linked in that discussion (https://github.com/tombergan/goheapdump), I'm going to try that.

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2016-10-04:

Dumping the goroutines shows that it's leaking State instances, with the attendant watchers and mongo connections. That makes sense because the mongo memory also goes up as we create and destroy models.

Tracking down state open/close pairs shows that the "leaked" States are stored in the API server's StatePool. There's nothing that will close and remove States from the pool when the model is destroyed. At the moment the pool doesn't keep track of which States are in use - there's no call made by the server to release them back to the pool. I'll need to add something like that so that we don't try to close it in destroy-model while a different request is trying to use it.

Alexis Bruemmer (alexis-bruemmer) on 2016-10-04

tags:

added: ateam

Revision history for this message

John A Meinel (jameinel) wrote on 2016-10-05:

As an aside, I'd say that the real bug is that we have StatePool at all. AIUI, it was created because we had a bunch of places that we creating on-demand State instances with new mongo connections on the fly. But those should be tracked instances that only get created when they are really needed. Off-the-cuff the problem was that to answer the question "what IP addresses does the controller have" we needed to create a State instance, which then created new connections to mongo, etc.

While we could go down the route of ref-counting StatePool, the better fix is to get rid of the need for it in the first place. That may not be doable in the RC timeframe. I thought Andrew had been planning on it, but maybe he didn't get to it yet?

Curtis Hovey (sinzui) on 2016-10-06

Changed in juju:
milestone:	2.0-rc3 → 2.0.0

Revision history for this message

Christopher Lee (veebers) wrote on 2016-10-07:

I'm in the process of doing some long-run/perf testing and I've seen this bug occur during those tests.

I have an example report here: http://people.canonical.com/~leecj2/perfscalemem/ that shows the memory usage creep (shows both system and mongodb) that might be of interest.

The test is similar to that mentioned in the bug description, adding and removing models as well as some charm deployment etc.

I'll be continuing to run theses tests so once this bug is fixed we will see that reflected in these generated reports.

(I currently have a run going with juju 2.0.0, I can report those details when it's done if there is interest.)

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2016-10-07:

I'll send an email to Andrew asking about that.

PR here for refcounting in the StatePool (and using that from the API server): https://github.com/juju/juju/pull/6408

Unfortunately this doesn't account for all of the goroutines leaked from adding/destroying a model - there are still 10 new ones for each iteration.

All of the new goroutines seem to be caused by leaked net/http.Transports - they're the read and write loops created on lines 1062-3: https://golang.org/src/net/http/transport.go#L1057

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2016-10-11:

The leaked goroutines are all coming from the LXD provider - they're idle keepalive connections created from various places in the code that create their own Environ instance (often as a local variable):

ModelManager
Firewaller
EnvironTracker
Provisioner
StorageProvisioner

The easiest way to have the connections close is to set IdleConnTimeout when the http transport is created in lxc/lxd/client.go, so that when the env is dropped on the floor the leaked goroutines eventually clean themselves up. (This might also be useful to other users of the library if they create more than one lxd.Client - at the moment they'll also leak goroutines.) This is reasonable because keepalive is purely an optimisation - the connections will be re-established if another request is made.

An alternative would be to add a Close method to the Environ interface and change all of the places that create Environs to close them afterwards, then thread that through the lxd provider to lxd.Client which could call Transport.CloseIdleConnections. The benefit of this is that the goroutines would stop immediately when the environ is closed rather than timing out eventually, but it's a lot more work.

I haven't looked at other providers to see if the same thing happens with them yet.

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2016-10-11:

It doesn't happen for the maas provider, at least. :)

Revision history for this message

Christopher Lee (veebers) wrote on 2016-10-12:

I noticed Christians fixes landed so I fired off another long running test, there is quite a difference in overall memory usage!

You can see the results here: http://people.canonical.com/~leecj2/perfscalemem/

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2016-10-12:

#10

Hi Chris - thanks for running the test! I can only see the results of the first test there, though. Is there another URL for the new one?

Should be even better with this PR: https://github.com/juju/juju/pull/6439
It updates the LXD client to the new version which fixes the goroutine leak.

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-10-12:

#11

Christian,

The other URL for your joy and comparison is http://people.canonical.com/~leecj2/perfscalemem2/

Both Chris and I really love the memory and mongo graphs differences \o/

Well done :D

Revision history for this message

Christopher Lee (veebers) wrote on 2016-10-12:

#12

Rats sorry I pasted the wrong url, Anastasia is right the correct url for the second run is: http://people.canonical.com/~leecj2/perfscalemem2/

Anastasia (anastasia-macmood) on 2016-10-12

Changed in juju:
status:	In Progress → Fix Committed

Curtis Hovey (sinzui) on 2016-10-13

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

auto-github-golang-go #16410
[closed] Edit

Bug watches keep track of this bug in other bug trackers.