Poor "juju create-backup" performance

Bug #1680683 reported by Paul Gear
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
High
Unassigned

Bug Description

Backing up juju shared controllers with any significant number of models performs very poorly. On a controller with around 40 models [1], "juju create-backup" took 1 hour and 19 minutes to generate and download, and the resultant backup tarball was 6 GB in size.

This is going to be exacerbated with JAAS, and we need to find a way to confidently upgrade controllers without waiting over an hour to get a viable backup. Are there parts of the database which could be legitimately excluded from the backup?

[1] http://pastebin.ubuntu.com/24331813/

Paul Gear (paulgear)
tags: added: canonical-is
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1680683] Re: Poor "juju create-backup" performance

Do you have numbers on where the time is spent? (And what version the
controller is running.) I wonder if you're running into extra data from
some of the collections that aren't being cleaned up. Or whether it is just
that you have lots of large charms or if it is logs/statuseshistory/etc.

John
=:->

On Apr 7, 2017 8:30 AM, "Paul Gear" <email address hidden> wrote:

> ** Tags added: canonical-is
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1680683
>
> Title:
> Poor "juju create-backup" performance
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1680683/+subscriptions
>

Revision history for this message
Paul Gear (paulgear) wrote :

@jameinel: I'm not sure how I would go about determining where the time was spent - juju create-backup is a black box from the user perspective: we ask for a backup and get a tarball.

The controller was running 2.0.3, shortly before being upgraded to 2.1.2. Because the environment is a hosted JAAS, the models are opaque to us.

Please let us know what additional information is required in order to optimise the juju create-backup UX.

Revision history for this message
John A Meinel (jameinel) wrote :

you should be able to run the attached script as an admin user and find the sizes of various collections that are involved:

Something like:
   $ agent=$(cd /var/lib/juju/agents; echo machine-*)
   $ pw=$(sudo cat /var/lib/juju/agents/${agent}/agent.conf |grep statepassword |awk '{ print $2 }')
   $ /usr/lib/juju/mongo3.2/bin/mongo --ssl --sslAllowInvalidCertificates -u ${agent} -p $pw localhost:37017/juju --authenticationDatabase admin sizes.js

should report the largest collections in the database.

Changed in juju:
status: New → Incomplete
Revision history for this message
Paul Gear (paulgear) wrote :

sizes.js output for the environment in question: https://pastebin.canonical.com/185462/
(Note for future travelers: the above script must run on the primary node in an HA environment.)

@jameinel: Those sizes seem rather incongruent with a backup that results in a 6 GB compressed tarball.

Changed in juju:
status: Incomplete → New
Revision history for this message
John A Meinel (jameinel) wrote :

I think this change would let you run it on a secondary:
$ diff -u sizes.js sizes2.js
--- sizes.js 2017-04-11 07:34:48.000000000 +0400
+++ sizes2.js 2017-04-11 07:34:37.000000000 +0400
@@ -17,7 +17,7 @@
 var collStats = [];
 for (i = 0; i < collectionNames.length; i++) {
   coll = collectionNames[i];
- s = db[coll].stats();
+ s = db[coll].stats({"slaveOk": 1});
   var storageSizeMB = s['storageSize'] / bytesInMB;
   var indexSizeMB = s['totalIndexSize'] / bytesInMB;
   var totalSizeMB = storageSizeMB + indexSizeMB;

That is the primary Juju database, and while you're experience a bit of bloat (256MB for txns) which should get better, that's clearly not where the bulk of your data is.

Another command you can run in the mongo shell is:

juju:PRIMARY> show databases
admin 0.000GB
backups 4.279GB
blobstore 1.821GB
juju 6.124GB
local 0.577GB
logs 3.434GB
presence 0.325GB

(I think it can be run on a secondary.)

"backups" shouldn't be interesting, as we shouldn't be grabbing data from there. 'logs' and 'blobstore' in particular are the ones that I think could be contributing to your 6GB backup.
You should be able to run 'sizes2.js' (with the patch is attached which hopefully runs on a secondary), but run it against specific databases:

for db in blobstore juju logs presence; do echo $db; /usr/lib/juju/mongo3.2/bin/mongo --ssl --sslAllowInvalidCertificates -u ${agent} -p $pw localhost:37017/$db --authenticationDatabase admin sizes2.js; done

Revision history for this message
Paul Gear (paulgear) wrote :

@jameinel: The updated sizes script didn't work on a slave; the problem is earlier in the code than your change - it fails on calling db.getCollectionNames().

Here are the "show databases" results for this environment:

juju:PRIMARY> show databases
admin 0.000GB
backups 13.161GB
blobstore 6.017GB
juju 0.541GB
local 0.545GB
logs 0.344GB
presence 0.199GB

Here are the sizes2 results for the individual databases: https://pastebin.canonical.com/185602/

Revision history for this message
John A Meinel (jameinel) wrote :

So the bulk of the backup is just that you have a large 'blobstore' which
is all of the charms, resources, etc that you have deployed.

Offhand I can't say whether that is active charms (currently in use) or
historical. I know we've looked at auto-pruning charms that are no longer
in use (as long as it would be available from somewhere like the charm
store).

One major caveat is that if these are "local" charms, we really don't know
if you have a copy if you ever had to go back to a version that isn't your
current version. So I don't think we could prune automatically. That said,
we could provide commands to manage/cleanup your cached charm revisions
manually.

You might have a feel for what you have currently running, and how large
they are.

John
=:->

On Wed, Apr 12, 2017 at 5:12 AM, Paul Gear <email address hidden> wrote:

> @jameinel: The updated sizes script didn't work on a slave; the problem
> is earlier in the code than your change - it fails on calling
> db.getCollectionNames().
>
> Here are the "show databases" results for this environment:
>
> juju:PRIMARY> show databases
> admin 0.000GB
> backups 13.161GB
> blobstore 6.017GB
> juju 0.541GB
> local 0.545GB
> logs 0.344GB
> presence 0.199GB
>
> Here are the sizes2 results for the individual databases:
> https://pastebin.canonical.com/185602/
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1680683
>
> Title:
> Poor "juju create-backup" performance
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1680683/+subscriptions
>

Revision history for this message
Stuart Bishop (stub) wrote :

On 12 April 2017 at 13:32, John A Meinel <email address hidden> wrote:

> One major caveat is that if these are "local" charms, we really don't know
> if you have a copy if you ever had to go back to a version that isn't your
> current version. So I don't think we could prune automatically. That said,
> we could provide commands to manage/cleanup your cached charm revisions
> manually.

I would not call this a major caveat. I don't think anyone has ever
wanted to revert a charm this way, and it is always done by retrieving
a known good version and reuploading to the controller. Reverting to
'local-14' is never done because there is no way of knowing what
'local-14' actually is without reverting to it, and at that point it
is too late and your system potentially destroyed.

I would recommend cleanup happening automatically, perhaps with tuning
('last N revisions'), rather than requiring people to manually cron
this for each controller or model and deal with the support requests
from people who didn't follow best practice and allowed their mongodb
to grow unbounded.

--
Stuart Bishop <email address hidden>

Revision history for this message
Paul Gear (paulgear) wrote :

@jameinel: This is a JAAS instance, and the charms in the models are entirely opaque to us.

I think 6 GB/hr (1.7 MB/s) is an unreasonably slow backup rate regardless of the content, and if there is any data we can collect to work out where the bottleneck lies, I would be happy to assist.

Revision history for this message
Tim Penhey (thumper) wrote :

I think that one thing that could be done here is to look at fixing the backup/restore process to not exclusively use the mongo dump.

Excluding the blobstore would probably make a big difference here, but it would require quite a bit of work around the backup/restore process.

I do think that the recent work that trims the presence and transaction collections will help this.

Changed in juju:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 5 years, so we're marking it Expired. If you believe this is incorrect, please update the status.

Changed in juju:
status: Triaged → Expired
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.