Bug #1852502 “Juju backups failing Executor error: CappedPositio...” : Bugs : Canonical Juju

Richard Harding (rharding) on 2020-02-03

tags:	added: backup-restore
Changed in juju:
status:	New → Triaged
importance:	Undecided → Medium

Chris Johnston (cjohnston) on 2020-08-21

tags:

added: sts

Erlon R. Cruz (sombrafam) on 2020-10-05

Changed in juju:
assignee:	nobody → Erlon R. Cruz (sombrafam)

Revision history for this message

Erlon R. Cruz (sombrafam) wrote on 2020-11-17:

#2

FYI, this is a mongodump bug that still open[1], so the approach I'm working on is just to
remove the capped collection from the dump. There's actually 1 used by juju: 'tnxs.log'

[1] https://jira.mongodb.org/browse/TOOLS-1636

Revision history for this message

Felipe Reyes (freyes) wrote on 2020-11-18:

#3

> so the approach I'm working on is just to remove the capped collection from the dump. There's actually 1 used by juju: 'tnxs.log'

the txns collection most likely will contain in-flight transactions referenced by documents in their txn-queue property, when restoring from the backup jujud will get confused and my understanding is that it won't be able to commit changes until that out-of-sync gets sorted out (something that usually make us rely on mgopurge).

Erlon R. Cruz (sombrafam) on 2020-12-10

Changed in juju:
assignee:	Erlon R. Cruz (sombrafam) → nobody

Revision history for this message

Pen Gale (pengale) wrote on 2021-03-23:

#4

Dropping in some notes on voice and chat conversations so that they don't get lost down the line:

- We are working on migrating to Mongo 4 in the Juju 3.0 release. Per upstream, Mongo 4 probably doesn't have this issue.

- For Mongo 3, upstream has recommended that we "Consider using a TTL index to limit data size instead of a capped collection". We can look into that option if we aren't able to do the mongo upgrade for some reason.

- There do not seem to be workarounds in the meantime, other than doing the mongodb dump separately, without the collection that's causing trouble. Everything *should* work when you restore this mongodb to a fresh controller, but you won't be able to use the Juju backup tools in order to orchestrate, and we're not 100% certain that there won't be ill effects from losing the collection.

Revision history for this message

Joseph Phillips (manadart) wrote on 2021-03-29:

#5

Observed this for a client cloud with the logs.logs collection.

Changed in juju:
importance:	Medium → High

Revision history for this message

Soumya (trsoumi88) wrote on 2021-04-20:

#6

I have a periodic juju backup script which sometimes fails with a similar error. Pasting it here for reference.

----------------------------
ERROR while creating backup archive: while dumping juju state database: error dumping databases: error executing "/usr/bin/mongodump": 2021-04-19T07:05:31.665+0000 writing admin.system.users to ; 2021-04-19T07:05:31.667+0000 done dumping admin.system.users (2 documents); 2021-04-19T07:05:31.667+0000 writing admin.system.version to ; 2021-04-19T07:05:31.668+0000 done dumping admin.system.version (2 documents); 2021-04-19T07:05:31.669+0000 writing logs.logs.bac50024-0ebc-4409-8261-2cf17197e703 to ; 2021-04-19T07:05:31.669+0000 writing logs.logs.f11ecb98-49f4-4bbe-8990-b9ad8fcbf316 to ; 2021-04-19T07:05:31.669+0000 writing juju.txns.log to ; 2021-04-19T07:05:31.670+0000 writing juju.statuseshistory to ; 2021-04-19T07:05:31.828+0000 done dumping juju.statuseshistory (21524 documents); 2021-04-19T07:05:31.828+0000 writing juju.actions to ; 2021-04-19T07:05:31.832+0000 Failed: error writing data for collection `logs.logs.bac50024-0ebc-4409-8261-2cf17197e703` to disk: error reading collection: Executor error during find command: CappedPositionLost: CollectionScan died due to position in capped collection being deleted. Last seen record id: RecordId(95589716);
----------------------------

Revision history for this message

John A Meinel (jameinel) wrote on 2021-04-20: Re: [Bug 1852502] Re: Juju backups failing Executor error: CappedPositionLost: CollectionScan died due to position in capped collection being deleted.

#7

One plausible fix would be to have 'juju create-backup --no-logs", as they
aren't usually super relevant to restore and usually consume a
significant amount of data.

On Tue, Apr 20, 2021 at 1:55 AM Soumya <email address hidden> wrote:

> I have a periodic juju backup script which sometimes fails with a
> similar error. Pasting it here for reference.
>
> ----------------------------
> ERROR while creating backup archive: while dumping juju state database:
> error dumping databases: error executing "/usr/bin/mongodump":
> 2021-04-19T07:05:31.665+0000 writing admin.system.users to ;
> 2021-04-19T07:05:31.667+0000 done dumping admin.system.users (2
> documents); 2021-04-19T07:05:31.667+0000 writing admin.system.version
> to ; 2021-04-19T07:05:31.668+0000 done dumping admin.system.version (2
> documents); 2021-04-19T07:05:31.669+0000 writing
> logs.logs.bac50024-0ebc-4409-8261-2cf17197e703 to ;
> 2021-04-19T07:05:31.669+0000 writing
> logs.logs.f11ecb98-49f4-4bbe-8990-b9ad8fcbf316 to ;
> 2021-04-19T07:05:31.669+0000 writing juju.txns.log to ;
> 2021-04-19T07:05:31.670+0000 writing juju.statuseshistory to ;
> 2021-04-19T07:05:31.828+0000 done dumping juju.statuseshistory (21524
> documents); 2021-04-19T07:05:31.828+0000 writing juju.actions to ;
> 2021-04-19T07:05:31.832+0000 Failed: error writing data for collection
> `logs.logs.bac50024-0ebc-4409-8261-2cf17197e703` to disk: error reading
> collection: Executor error during find command: CappedPositionLost:
> CollectionScan died due to position in capped collection being deleted.
> Last seen record id: RecordId(95589716);
> ----------------------------
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1852502
>
> Title:
> Juju backups failing Executor error: CappedPositionLost:
> CollectionScan died due to position in capped collection being
> deleted.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1852502/+subscriptions
>

One plausible fix would be to have 'juju create-backup --no-logs", as they
aren't usually super relevant to restore and usually consume a
significant amount of data.

On Tue, Apr 20, 2021 at 1:55 AM Soumya <1852502@bugs.launchpad.net> wrote:

> I have a periodic juju backup script which sometimes fails with a
> similar error. Pasting it here for reference.
>
> ----------------------------
> ERROR while creating backup archive: while dumping juju state database:
> error dumping databases: error executing "/usr/bin/mongodump":
> 2021-04-19T07:05:31.665+0000     writing admin.system.users to ;
> 2021-04-19T07:05:31.667+0000    done dumping admin.system.users (2
> documents); 2021-04-19T07:05:31.667+0000     writing admin.system.version
> to ; 2021-04-19T07:05:31.668+0000  done dumping admin.system.version (2
> documents); 2021-04-19T07:05:31.669+0000   writing
> logs.logs.bac50024-0ebc-4409-8261-2cf17197e703 to ;
> 2021-04-19T07:05:31.669+0000        writing
> logs.logs.f11ecb98-49f4-4bbe-8990-b9ad8fcbf316 to ;
> 2021-04-19T07:05:31.669+0000        writing juju.txns.log to ;
> 2021-04-19T07:05:31.670+0000 writing juju.statuseshistory to ;
> 2021-04-19T07:05:31.828+0000  done dumping juju.statuseshistory (21524
> documents); 2021-04-19T07:05:31.828+0000       writing juju.actions to ;
> 2021-04-19T07:05:31.832+0000  Failed: error writing data for collection
> `logs.logs.bac50024-0ebc-4409-8261-2cf17197e703` to disk: error reading
> collection: Executor error during find command: CappedPositionLost:
> CollectionScan died due to position in capped collection being deleted.
> Last seen record id: RecordId(95589716);
> ----------------------------
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1852502
>
> Title:
>   Juju backups failing Executor error: CappedPositionLost:
>   CollectionScan died due to position in capped collection being
>   deleted.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1852502/+subscriptions
>

Revision history for this message

Haw Loeung (hloeung) wrote on 2021-04-20:

#8

On Tue, Apr 20, 2021 at 05:47:54PM -0000, John A Meinel wrote:
> One plausible fix would be to have 'juju create-backup --no-logs", as they
> aren't usually super relevant to restore and usually consume a
> significant amount of data.

Yes please. Perhaps --no-logs should be the default instead? So 'juju
create-backup' excludes logs by default, with an option such as
'--with-logs' or '--include-logs' for those wanting to also back up
logs.

Reducing the time it takes to perform Juju controller backups would
also be super useful to IS since we do this before performing any juju
upgrades (we run quite a lot of controllers and try to be as
up-to-date as possible).

Revision history for this message

Haw Loeung (hloeung) wrote on 2021-04-20:

#9

I guess that falls under LP:1680683 - 'Poor "juju create-backup" performance'

Revision history for this message

Erlon R. Cruz (sombrafam) wrote on 2021-05-19:

#10

> One plausible fix would be to have 'juju create-backup --no-logs", as they
> aren't usually super relevant to restore and usually consume a
> significant amount of data.

So, that would only reduce the set of problematic collections as there are other collections known
to fail (aka tnxs.log).

Revision history for this message

John A Meinel (jameinel) wrote on 2021-05-20:

#11

It's fundamentally a race in how many changes happen while you are doing
the rest of the backup. If you can skip more of the actual content, then
the backup runs faster, giving less time for the data to overflow.

On Wed, May 19, 2021 at 10:30 AM Erlon R. Cruz <email address hidden>
wrote:

> > One plausible fix would be to have 'juju create-backup --no-logs", as
> they
> > aren't usually super relevant to restore and usually consume a
> > significant amount of data.
>
> So, that would only reduce the set of problematic collections as there are
> other collections known
> to fail (aka tnxs.log).
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1852502
>
> Title:
> Juju backups failing Executor error: CappedPositionLost:
> CollectionScan died due to position in capped collection being
> deleted.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1852502/+subscriptions
>

Revision history for this message

Erlon R. Cruz (sombrafam) wrote on 2021-05-31:

#12

So, adding '--no-logs' would not be possible. Once you call mongodump with '--oplog' thats is what juju does[1], mongo only dumps the whole database, including all databases and tables. Can we revisit the need of using --oplog? It was not clear to me how this work and why juju needs it.

_________________
[1] mongodump -h 127.0.0.1 --port 37017 --ssl --sslAllowInvalidCertificates -u $agent -p $dbpass --authenticationDatabase admin --oplog

Revision history for this message

John A Meinel (jameinel) wrote on 2021-09-27:

#13

Without --oplog you would have to take the Juju controllers offline to create a stable backup. Oplog allows you to start dumping the collections and include the write-ahead-log which means that if you start dumping collection A, by the time you get to collection Z, you have a consistent view across all of the collections.

It is interesting that with capped collections, we fundamentally have data that is fairly 'ephemeral' and would be ok if it wasn't consistent in the output. Certainly the option being discussed is to discard it entirely with '--no-logs'.

Revision history for this message

nikhil kshirsagar (nkshirsagar) wrote on 2022-01-12 (last edit on 2022-02-03):

#14

Adding some information on the bug as advised by John (@jam). A user who is hitting this has the following environment,

-$ juju --version
2.8.13-bionic-amd64

$ juju status
Model Controller Cloud/Region Version SLA Timestamp Notes
controller sj1-prod-maas-01-juju-controller maas_cloud 2.9.16 unsupported 09:49:29-08:00 upgrade available: 2.9.22

Machine State DNS Inst id Series AZ Message
0 started 10.100.40.233 kq34k7 bionic default Deployed
1 started 10.100.40.234 juju-2 bionic default Deployed
2 started 10.100.40.235 juju-3 bionic default Deployed

juju:PRIMARY> show databases
admin 0.000GB
backups 0.498GB
blobstore 0.454GB
config 0.000GB
juju 1.084GB
local 1.311GB
logs 0.205GB
juju:PRIMARY> show collections
system.keys
system.users
system.version
juju:PRIMARY>

They run a script to take daily backups of the juju controller and then a command to only keep the latest backup but the backups are filling up the disk on all the 3 juju controller nodes. I believe the script failure causes backups to keep piling up because the script to only keep the latest backup only runs upon successful backup.

even if they run this command manually they get the message that there is only most current backup.

# juju remove-backup -m sj1-prod-maas-01-juju-controller:admin/controller --keep-latest
WARNING no backups to remove, 20210921-190247.c06182d1-ffc0-454f-8418-dac55ad882a8 most current

But when they login into the controllers they see juju backup directories for every backup run in /tmp
:/tmp$ ls -ld jujuBackup*
drwx------ 3 root root 4096 Dec 20 10:58 jujuBackup-016549046
drwx------ 3 root root 4096 Dec 20 11:37 jujuBackup-042799675
drwx------ 3 root root 4096 Dec 20 10:58 jujuBackup-063462045
drwx------ 3 root root 4096 Dec 20 10:58 jujuBackup-270640115

These are the commands used

/snap/bin/juju create-backup -m sj1-prod-maas-01-juju-controller:admin/controller --filename $DEST_DIR/juju-backup-sj1-prod-maas-01-juju-controller-$TIME.tar.gz "sj1-prod-maas-01-juju-controller" 2>&1

juju remove-backup -m sj1-prod-maas-01-juju-
controller:admin/controller --keep-latest

Adding some information on the bug as advised by John (@jam). A user who is hitting this has  the following environment,

-$ juju --version
2.8.13-bionic-amd64

$ juju status
Model Controller Cloud/Region Version SLA Timestamp Notes
controller sj1-prod-maas-01-juju-controller maas_cloud 2.9.16 unsupported 09:49:29-08:00 upgrade available: 2.9.22

Machine State DNS Inst id Series AZ Message
0 started 10.100.40.233 kq34k7 bionic default Deployed
1 started 10.100.40.234 juju-2 bionic default Deployed
2 started 10.100.40.235 juju-3 bionic default Deployed

juju:PRIMARY> show databases
admin 0.000GB
backups 0.498GB
blobstore 0.454GB
config 0.000GB
juju 1.084GB
local 1.311GB
logs 0.205GB
juju:PRIMARY> show collections
system.keys
system.users
system.version
juju:PRIMARY>

They run a script to take daily backups of the juju controller and then a command to only keep the latest backup but the backups are filling up the disk on all the 3 juju controller nodes. I believe the script failure causes backups to keep piling up because the script to only keep the latest backup only runs upon successful backup.

even if they run this command manually they get the message that there is only most current backup.

# juju remove-backup -m sj1-prod-maas-01-juju-controller:admin/controller --keep-latest
WARNING no backups to remove, 20210921-190247.c06182d1-ffc0-454f-8418-dac55ad882a8 most current

But when they login into the controllers they see juju backup directories for every backup run in /tmp
:/tmp$ ls -ld jujuBackup*
drwx------ 3 root root 4096 Dec 20 10:58 jujuBackup-016549046
drwx------ 3 root root 4096 Dec 20 11:37 jujuBackup-042799675
drwx------ 3 root root 4096 Dec 20 10:58 jujuBackup-063462045
drwx------ 3 root root 4096 Dec 20 10:58 jujuBackup-270640115

These are the commands used

/snap/bin/juju create-backup -m sj1-prod-maas-01-juju-controller:admin/controller --filename $DEST_DIR/juju-backup-sj1-prod-maas-01-juju-controller-$TIME.tar.gz "sj1-prod-maas-01-juju-controller" 2>&1

juju remove-backup -m sj1-prod-maas-01-juju-
controller:admin/controller --keep-latest

Revision history for this message

John A Meinel (jameinel) wrote on 2022-01-12:

#15

@nikhil are they running into "CappedPositionLost" or is it just that the temp directories are not being cleaned up? it sounds like a different bug, and I want to make sure we're trying to fix the right thing.

Revision history for this message

John A Meinel (jameinel) wrote on 2022-01-12:

#16

So if they are running into CappedPositionLost there are a few things to investigate

We do have 2 configuration settings for how large the various capped collections are. There are 2 types in play
1) The txns.log which tracks recent transactions against the database, this can be tweaked with:
`juju controller-config max-txn-log-size`
Essentially, it needs to be large enough that we can record any active transactions while the backup is being taken. It defaults to 10MB. That is an older setting, so I'm not sure if we support changing it (vs setting it on initial bootstrap). We can work with you to ensure that it both get set correctly and the database table is the right size.
https://docs.mongodb.com/manual/reference/command/convertToCapped/

2) log collections for each model also have a
`juju controller-config model-log-size`

The default here is 20MB. This one does appear to be properly handled at startup. So if it is changed, restarting the controllers should apply the new collection size.

Some other thoughts:
a) I'm a bit concerned that with only 1GB of juju data and 200MB of log data, we're running into capped position lost. Something is happening at a very high churn rate, for a fairly small amount of data.

b) The backups collection isn't empty, it is at ~500MB, which for a 2.8 client and a 2.9 controller, I would expect that to be essentially empty, since we aren't saving anything into the database.
In fact, when I run 'juju create-backup' on a test 2.9 controller, I don't even have a 'backups' collection.

c) the 'juju' table being 1GB seems larger than I would expect given the other collection sizes. It is possible that there is a significant content in the database (lots of models/units/etc) but that doesn't match the idea that 'blobstore' that is holding all of the binaries for all deployed charms is only 500MB.

It would be good to get some information on the size breakdown. Is it possible to use

```
var collectionNames = db.getCollectionNames(), stats = [];
collectionNames.forEach(function (n) { stats.push(db[n].stats()); });
stats = stats.sort(function(a, b) { return b['size'] - a['size']; });
for (var c in stats) { print(stats[c]['ns'] + ": " + stats[c]['size'] + " (" + stats[c]['storageSize'] + ")"); }
```

One idea is that we might have a broken transaction document, causing us to spin trying to apply a new transaction, causing more churn that is causing a hiccup during backups.

I would expect to see log messages (available from `juju debug-log -m controller`, or introspective /var/log/juju/machine-* on a controller machine) complaining that it is failing to apply a transaction.

If that is an issue, it would be possible to stop the juju controllers (systemctl stop jujud-machine-* for any controller machine), and then run mgopurge to fix any obviously broken transactions.

So if they are running into CappedPositionLost there are a few things to investigate

We do have 2 configuration settings for how large the various capped collections are. There are 2 types in play
1) The txns.log which tracks recent transactions against the database, this can be tweaked with:
  `juju controller-config max-txn-log-size`
Essentially, it needs to be large enough that we can record any active transactions while the backup is being taken. It defaults to 10MB. That is an older setting, so I'm not sure if we support changing it (vs setting it on initial bootstrap). We can work with you to ensure that it both get set correctly and the database table is the right size. 
https://docs.mongodb.com/manual/reference/command/convertToCapped/

2) log collections for each model also have a 
  `juju controller-config model-log-size`