juju state server mongod uses too much disk space

Bug #1492237 reported by JuanJo Ciarlante
74
This bug affects 14 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Won't Fix
High
Unassigned
juju-core
Won't Fix
Critical
Unassigned

Bug Description

E.g. in our nonHA openstack deployment with 49 principal units
and 232 subordinates, /var/lib/juju/db uses 4.2G which seems
rather excessive for this small deployment.

JuanJo Ciarlante (jjo)
tags: added: canonical-bootstack
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Which version of Juju is this environment running? There were problems in 1.23 and early 1.24 releases that would cause unreasonable mongodb disk usage. Upgrading to 1.24 should automatically free up the unnecessary disk space over time.

Upgrading to 1.24.5 or later is highly recommended.

See this thread for more information: https://<email address hidden>/msg02783.html

Revision history for this message
JuanJo Ciarlante (jjo) wrote :

Apologies I forgot to provide juju version - it was 1.23.3 which got upgraded
to 1.24.5 ( we found about diskspace usage during while waiting for 'juju backups'
to complete - unfortunately upgrading to 1.24.5 didn't fix the issue:

* 1.23.3 (2015-09-03)
# du -sch /var/lib/juju/db
4.2G /var/lib/juju/db
4.2G total

* 1.24.5 (2015-09-07)
# du -sch /var/lib/juju/db/
5.3G /var/lib/juju/db/
5.3G total

Revision history for this message
JuanJo Ciarlante (jjo) wrote :

hm ... now I actually read your 'over time',
FTR detailed per file disk usage atm:
http://paste.ubuntu.com/12303015/

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I've just reviewed this part of the code. The txn pruning functionality runs every 2 hours so there should be a reduction in disk space 2 hours after the state servers upgraded to 1.24.

If the environment is set to log at INFO or DEBUG you should see logs related to the pruning activity. Search for "prun" in the logs (to catch both "prune" and "pruning"). If logging is set to ERROR any serious problems with the pruning functionality should still be present.

Revision history for this message
JuanJo Ciarlante (jjo) wrote :

grep prun /var/log/juju/machine-0.log -> http://paste.ubuntu.com/12307044/ ,
shows 'starting worker "txnpruner"', but no 'prunning' message fwiw.

Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.26.0
tags: added: mongodb
Revision history for this message
Michael Foley (foli) wrote :

I was able to reduce the size of juju DB on a bootstrap node by following instrucions from http://stackoverflow.com/questions/2966687/reducing-mongodb-database-file-size

This was with juju version 1.24.5 with a 4GB ephemeral disk mounted on /mnt.

Example of commands I used:
ps aux | grep mongo
stop juju-db
# Copy commandline output from ps and modify to have --nojournal --repair --repairpath /mnt
/usr/lib/juju/bin/mongod --auth --dbpath /var/lib/juju/db --sslOnNormalPorts --sslPEMKeyFile /var/lib/juju/server.pem --sslPEMKeyPassword xxxxxxx --port 37017 --noprealloc --syslog --smallfiles --nojournal --keyFile /var/lib/juju/shared-secret --replSet juju --ipv6 --oplogSize 512 --repair --repairpath /mnt
start juju-db

I then restarted all jujud for good measure

Changed in juju-core:
milestone: 1.26.0 → 2.0-beta5
Changed in juju-core:
milestone: 2.0-beta5 → 2.0-beta4
Changed in juju-core:
milestone: 2.0-beta4 → 2.1.0
affects: juju-core → juju
Changed in juju:
milestone: 2.1.0 → none
milestone: none → 2.1.0
Revision history for this message
Greg Lutostanski (lutostag) wrote :

Hit this. We have 5.2 GB in /var/lib/juju/db and another 1.6 GB in /var/lib/juju/charm-get-cache. We bootstrapped to a virtual node with a 10GB disk, so it often runs out of space and we need to find random folders to delete. Either reducing space as a whole or finding a way to vacuum unused disk space for long-term deployments would be necessary.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

+1 Ran into exactly lutostag's situation in OpenStack Charm CI.

tags: added: uosci
tags: added: oil-2.0
Revision history for this message
Jacek Nykis (jacekn) wrote :

I've just hit this problem in a production environment.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

FYI, I've recorded my experience in working around this more permanently in: https://bugs.launchpad.net/juju-core/+bug/1634390

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Added juju-core as this impacts long-running production workloads, which at this point in time are vastly on 1.25.x, with no upgrade path to 2.x.

Changed in juju-core:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 1.25.10
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25.10 → none
Changed in juju-core:
milestone: none → 1.25.11
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Ryan, @Jacek,
Since this bug has been originally filed, we have address many performance related issues in both 1.25 (1.25.10 most recently) and in 2.0 (2.0.3). Are you still seeing this in later versions of Juju?

Since bug # 1634390 has a workaround for this specific issue, can this one be considered its duplicate?

If the answers to the above questions make this a standalone current bug, could you please clarify your juju versions?

Changed in juju:
status: Triaged → Incomplete
Changed in juju-core:
status: Triaged → Incomplete
Changed in juju:
milestone: 2.1.0 → none
Changed in juju-core:
milestone: 1.25.11 → none
Revision history for this message
Ryan Beisner (1chb1n) wrote :

1634390 is not a duplicate of this. 1634390 is specifically about Juju not working when /var is on a separate partition, and that should be considered a common and valid enterprise use case.

The one long-running Juju 1.x deployment I have isn't on 1.25.10 yet, so I cannot confirm whether or not the juju db size is reasonable for such a small deployment. But the best I can offer is a Juju upgrade at some point, where the juju db is already huge.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Ryan,
I'll take this offer :)
Please re-tests with 1.25.10 when you can.
Feel free to re-open the bug if mongod still misbehaves for you.

Revision history for this message
Paul Gear (paulgear) wrote :

Are we to expect that existing long-running juju 1.25.x installations will have their disk usage automatically reduced?

Revision history for this message
Ryan Beisner (1chb1n) wrote :

@anastasia - apologies, afaik there isn't really a test, other than to use Juju for the long haul (perhaps soak test if juju core has that?).

I don't and likely won't have a new long-running 1.25.10 model to get a clean behavior report to you. The long-running 1.25.x models I have will be upgraded to 1.25.10 at some point. However, they've already been worked-around by giving the controller a ton of disk space.

Haw Loeung (hloeung)
Changed in juju:
status: Incomplete → Confirmed
Changed in juju-core:
status: Incomplete → Confirmed
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Haw Loeung (hloeung),
We have addressed a lot of mongo related issues on 2.2, including moving to more efficient and robust version of Mongo as well as updated algorithms for pruning and cleaning transactions and caches.

We have also updated MgoPurge tool to be used with installations that run previous versions of Juju. Please take advantage of new functionality of MgoPurge to resolve your current observed db space growth:

A. `mgopurge -password "...." -yes -stages prune`

 - This can be run on live system. In fact, we'd recommend running this periodically as cron job. From 2.2, similar functionality will be run every couple of hours by Juju itself.
 - In HA setup, this can be run on any node, preferably on primary.

B. `mgopurge -password "...." -yes -stages compact`

This will give space back to the operation system. However:
 - It does lock database and takes time proportional to database size.
 - It is best to shut the controllers prior to running this command.
 - It requires ​free space equal to the current size of the database + another 2GB.
 - In HA setup, you may need to run this on each node.

Note that this functionality is compatible with Mongo 2.4 (for Juju 1.25x) and 3.2 (for Juju 2.0x and 2.1x)

I am marking this report as Won't Fix - we are not planning any further releases for Juju series prior to 2.2.

Changed in juju:
status: Confirmed → Won't Fix
Changed in juju-core:
status: Confirmed → Won't Fix
Revision history for this message
Haw Loeung (hloeung) wrote :

mgopurge 1.4 does indeed help.

Before (du output):

| 3.9G ./db

Before (mongo show databases etc.):

| juju 1.999267578125GB
| juju.txns, storage = 1404 MB, index = 0 MB, total = 1404 MB

mgopurge output snippet:

| 2017-04-13 05:52:22 0 txns are still referenced and will be kept
| 2017-04-13 05:52:22 pruning completed: removed 1637 txns
| 2017-04-13 05:52:22 Running stage "compact": Compact database to release disk space (does not
 compact replicas)
| 2017-04-13 05:52:22 detected mmapv1 storage engine: using repairDatabase to compact

Now (du output):

| 1.4G ./db

Now (mongo show databases etc.):

| juju 0.0625GB
| juju.txns.log, storage = 10 MB, index = 1 MB, total = 11 MB

Thanks.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Haw Loeung (hloeung),

Thank you for update \o/ It's great to know and we appreciate the feedback!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.