Canonical Juju

Bug #1852502
Comment #16

Comment 16 for bug 1852502

Revision history for this message

John A Meinel (jameinel) wrote on 2022-01-12:

#16

So if they are running into CappedPositionLost there are a few things to investigate

We do have 2 configuration settings for how large the various capped collections are. There are 2 types in play
1) The txns.log which tracks recent transactions against the database, this can be tweaked with:
`juju controller-config max-txn-log-size`
Essentially, it needs to be large enough that we can record any active transactions while the backup is being taken. It defaults to 10MB. That is an older setting, so I'm not sure if we support changing it (vs setting it on initial bootstrap). We can work with you to ensure that it both get set correctly and the database table is the right size.
https://docs.mongodb.com/manual/reference/command/convertToCapped/

2) log collections for each model also have a
`juju controller-config model-log-size`

The default here is 20MB. This one does appear to be properly handled at startup. So if it is changed, restarting the controllers should apply the new collection size.

Some other thoughts:
a) I'm a bit concerned that with only 1GB of juju data and 200MB of log data, we're running into capped position lost. Something is happening at a very high churn rate, for a fairly small amount of data.

b) The backups collection isn't empty, it is at ~500MB, which for a 2.8 client and a 2.9 controller, I would expect that to be essentially empty, since we aren't saving anything into the database.
In fact, when I run 'juju create-backup' on a test 2.9 controller, I don't even have a 'backups' collection.

c) the 'juju' table being 1GB seems larger than I would expect given the other collection sizes. It is possible that there is a significant content in the database (lots of models/units/etc) but that doesn't match the idea that 'blobstore' that is holding all of the binaries for all deployed charms is only 500MB.

It would be good to get some information on the size breakdown. Is it possible to use

```
var collectionNames = db.getCollectionNames(), stats = [];
collectionNames.forEach(function (n) { stats.push(db[n].stats()); });
stats = stats.sort(function(a, b) { return b['size'] - a['size']; });
for (var c in stats) { print(stats[c]['ns'] + ": " + stats[c]['size'] + " (" + stats[c]['storageSize'] + ")"); }
```

One idea is that we might have a broken transaction document, causing us to spin trying to apply a new transaction, causing more churn that is causing a hiccup during backups.

I would expect to see log messages (available from `juju debug-log -m controller`, or introspective /var/log/juju/machine-* on a controller machine) complaining that it is failing to apply a transaction.

If that is an issue, it would be possible to stop the juju controllers (systemctl stop jujud-machine-* for any controller machine), and then run mgopurge to fix any obviously broken transactions.

So if they are running into CappedPositionLost there are a few things to investigate

We do have 2 configuration settings for how large the various capped collections are. There are 2 types in play
1) The txns.log which tracks recent transactions against the database, this can be tweaked with:
  `juju controller-config max-txn-log-size`
Essentially, it needs to be large enough that we can record any active transactions while the backup is being taken. It defaults to 10MB. That is an older setting, so I'm not sure if we support changing it (vs setting it on initial bootstrap). We can work with you to ensure that it both get set correctly and the database table is the right size. 
https://docs.mongodb.com/manual/reference/command/convertToCapped/

2) log collections for each model also have a 
  `juju controller-config model-log-size`

The default here is 20MB. This one does appear to be properly handled at startup. So if it is changed, restarting the controllers should apply the new collection size.

It would be good to get some information on the size breakdown. Is it possible to use

One idea is that we might have a broken transaction document, causing us to spin trying to apply a new transaction, causing more churn that is causing a hiccup during backups.

If that is an issue, it would be possible to stop the juju controllers (systemctl stop jujud-machine-* for any controller machine), and then run mgopurge to fix any obviously broken transactions.