juju 1.25 leaks memory (1.25.11+)

Bug #1701481 reported by Mario Splivalo
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Andrew Wilkins

Bug Description

Juju 1.25.12 (current latest) leaks memory over time.

I have deployed a test environment against local (test) MAAS cluster - there are three physical nodes, each hosting 48 containers (totaling 144 units).

I have deployed 16 3-node MongoDB replicasets - each replicaset is standalone, not related to anything (excluding 'internal' peer relations).

Then I have 16 3-node percona-cluster with hacluster subordinates, each of those 3-node clusters related to 3-node keystone, also with hacluster subordinate.

I have attached juju status output to this bug report.

I deployed munin and munin-node on bootstrap node and configured meminfo plugin to track memory for jujud and mongod processes. Munin graphs reside in my test-environment, but I will copy then on a public www server, for convenience.

Revision history for this message
Mario Splivalo (mariosplivalo) wrote :

The munin graphs for memory usage can be seen here:

http://people.canonical.com/~mario/lp1701481/www/localdomain/localhost.localdomain/index.html#memory

The environment was deployed on 22nd (check the graph on the right side), and it's been left to sit 'without touching' since.

On 28th there is interruption in graphs because I run out of disk space - I misconfigured the 'juju heap profile collector' (simple script that periodically collect juju heap profiles: https://code.launchpad.net/~mariosplivalo/+junk/juju-profiles-collector).

On 29th I restarted the jujud.

I am attaching the machine-0.log file, as well as compressed profiles collected, to this bug.

Revision history for this message
Mario Splivalo (mariosplivalo) wrote :
Revision history for this message
Mario Splivalo (mariosplivalo) wrote :
Revision history for this message
Mario Splivalo (mariosplivalo) wrote :
Revision history for this message
Mario Splivalo (mariosplivalo) wrote :

This bug is not a copy of bug #1587644, which refers to a memory leak and cpu spikes in Juju 1.25.4, although #1587644 states the issue is fixed in 1.25.11.

I *think* that the last version that haven't had memory issues was 1.25.6.
I am deploying another environment with 1.25.6 (I will try to use local provider as it is way simpler to deploy), and will configure munin and profile collecting there. I will leave that environment for few days, then I will upgrade it to 1.25.12 - to try to confirm that the 1.25.6 was, indeed, ok.

For the deployment mentioned in this bug (MAAS provider) - it is still active and I will post back updated munin graphs as memory usage increases.

tags: added: canonical-bootstack
Revision history for this message
Nicholas Skaggs (nskaggs) wrote :

Mario, what did you discover?

Revision history for this message
Mario Splivalo (mariosplivalo) wrote :

So, it took some time to consolidate all the graphs, here it is:

http://people.canonical.com/~mario/lp1701481/2017-07-25/

These are three environments deployed against Openstack provider (1.25.5 was started last, so I only have barely a day of collected data).

Unfortunately I made a mistake so heap profiles were not collected properly. Now that I fixed my mistake I restarted jujud on the state servers for all three environments, and I'm gathering heap profiles every 10 minutes.

I will update this bug within few hours if I see there is a jump in jujud memory usage on either of the environments.

I've used this bundle to deploy the environments: http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/sparse/default.yaml

Revision history for this message
Mario Splivalo (mariosplivalo) wrote :

I have abandoned the approach with MAAS environment as I realized that I don't have 100+ units to get juju to start 'leaking' memory.

I also tried using local provider to deploy 500+ containers on a single machine but I was hitting various limits when trying to do so, so I abandoned that one too.

Also, what can be seen from the graphs is that I was wrong about my 1.25.6 assumption - it looks like 1.25.6 was leaking connections to mongodb, which doesn't seem to be the case with 1.25.12. Still, the memory growth with 1.25.12 is noticeable.

Felipe Reyes (freyes)
tags: added: sts
Revision history for this message
Mario Splivalo (mariosplivalo) wrote :

I have uploaded new log files, as well as munin graphs along with the heap/routine profiles from running jujud.

Those reside here:

http://people.canonical.com/~mario/lp1701481/2017-07-31/

In there one can find three directories, resembling the juju versions of the running environments. This is the bundle used to deploy the environment, it is the same for each of the listed juju versions:
http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/sparse/default.yaml

As stated above, it seems that juju 1.25.12 is no longer leaking connections to mongodb, but it is still leaking memory.

Inside each of the 1.25.xx directories there is file called profiles-xxx.tar.lrz, for 1.25.12 it is this:

http://people.canonical.com/~mario/lp1701481/2017-07-31/1.25.12/profiles-1.25.12.tar.lrz

Now, that file is only 12M large, but when decompressed (using lrzip) it is around 12GB large - those are juju heap and goroutine profiles collected every 10 minutes.

I have restarted jujud in each environment on Jul 26th, which can be seen from the graphs.

Tim Penhey (thumper)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.25.13
Revision history for this message
Andrew Wilkins (axwalk) wrote :

Mario, I'm going to take a look at the profiles. The log files for 1.25.12 aren't world readable, can you please fix that?

Revision history for this message
Andrew Wilkins (axwalk) wrote :

So I've computed the delta of goprof.2017-07-31_00:20:02.heap from goprof.2017-07-30_00:20:01.heap. In that time, the "inuse_space" grew by approx. 178MB, of which ~90% is due to repeated calls to set the block devices.

The repeated calls are due to https://bugs.launchpad.net/juju-core/+bug/1600539, which was fixed in 2.0, but never back-ported to 1.25. Easy enough to do that.

Changed in juju-core:
status: Triaged → In Progress
assignee: nobody → Andrew Wilkins (axwalk)
Revision history for this message
Andrew Wilkins (axwalk) wrote :

Backport of block device fixes from 2.0: https://github.com/juju/juju/pull/7750.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Mario, are you familiar with building Juju from source? My PR landed on the 1.25 branch -- would you be able test it in your environment, to see if it makes things better for you?

Revision history for this message
Mario Splivalo (mariosplivalo) wrote :

Hello, Andrew!

From the first glance it looks like the issue is resolved.

I deployed test environment with 1.25.13 (locally compiled, branched from github, as per your suggetsion), and memory usage graph line for jujud looks flat:

http://people.canonical.com/~mario/lp1701481/2017-08-21

I failed to include mongodb plugins to monitor connection numbers and operations against the databas (because of SSL issues with the plugin), but I will add those later in the day.

I'll leave my environment running for a few more days to make sure memory is not being leaked but if we compare current data with the 1.25.12 graphs posted earlier it seems that jujud is not leaking memory any more.

I'll keep an eye on my test environment and will update the bug accordingly.

Revision history for this message
Mario Splivalo (mariosplivalo) wrote :

I restarted jujud (actually, I rebooted the state server), and two things happened:

- the number of tcp connections to 37017 (mongod) went up, from 50 to 220
- the memory usage for jujud went down

(The memory usage for mongod went up, but this is, I'm assuming, because of the increase of the number of connections to mongod).

I uploaded new 'snapshot' of the munin graphs here:

http://people.canonical.com/~mario/lp1701481/2017-08-22/

The stateserver was restarted aronud 11AM (one can check the 'uptime' graph at the bottom of the page).

I will leave the environment running for few more days to see how is jujud going to behave over time - but, it still looks that the memory leak issue was fixed by Andrews commit.

Revision history for this message
Mario Splivalo (mariosplivalo) wrote :

I have uploaded fresh set of graphs, one can find them here:

http://people.canonical.com/~mario/lp1701481/2017-08-28

As can be seen, the memory usage for jujud remains the same.

I have just restarted jujud and will post new 'snapshot' of the graphs within a day or so, but from the looks of it, seems that the jujud memory leak is fixed! :)

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Thanks for the testing Mario, I'm glad to hear it!

Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Mario Splivalo (mariosplivalo) wrote :

Hi, Andrew.

No problem! :) I just uploaded the latest graphs:

http://people.canonical.com/~mario/lp1701481/2017-08-29

But it seems that this issue can be closed now.

Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.