restore-backup ends up with a broken controller: ERROR could not clean up after failed restore attempt: cannot complete restore: <nil>: juju restore is in progress - API is disabled to prevent data loss

Bug #1818831 reported by Nobuto Murata
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Invalid
High
Simon Richardson

Bug Description

I'm following the procedure written in:
https://docs.jujucharms.com/2.5/en/controllers-backup#high-availability-considerations

A customer environment is on MAAS provider, but this specific issue can be reproduced in LXD provider.

How to reproduce:

$ juju bootstrap localhost
$ juju enable-ha

$ juju show-controller | grep ha-status
      ha-status: ha-enabled
      ha-status: ha-enabled
      ha-status: ha-enabled

$ juju create-backup -m controller
Remote backup was not created.
Downloaded to juju-backup-20190306-130043.tar.gz.

$ juju restore-backup -m controller --file juju-backup-20190306-130043.tar.gz
ERROR unable to restore backup in HA configuration. For help see https://docs.jujucharms.com/stable/controllers-backup

-> it's a documented behavior, so remove two of three controller.

$ juju remove-machine -m controller 1 2

$ juju restore-backup -m controller --file juju-backup-20190306-130043.tar.gz

-> then, the error below happens and the controller won't accept API request (full output will be attached).

ERROR could not clean up after failed restore attempt: cannot complete restore: <nil>: juju restore is in progress - API is disabled to prevent data loss
ERROR cannot perform restore: <nil>: restore failed: error restoring state from backup: error restoring database: error executing "/usr/bin/mongorestore": 2019-03-06T13:02:32.001+0000 preparing collections to restore from; 2019-03-06T13:02:32.008+0000 reading metadata for blobstore.blobstore.chunks from /tmp/juju-backups-003269483/juju-backup/dump/blobstore/blobstore.chunks.metadata.json; 2019-03-06T13:02:32.013+0000 restoring blobstore.blobstore.chunks from /tmp/juju-backups-003269483/juju-backup/dump/blobstore/blobstore.chunks.bson; 2019-03-06T13:02:32.033+0000 reading metadata for juju.txns from /tmp/juju-backups-003269483/juju-backup/dump/juju/txns.metadata.json;

...

finished restoring juju.generations (0 documents); 2019-03-06T13:02:32.758+0000 finished restoring juju.migrations (0 documents); 2019-03-06T13:02:32.762+0000 finished restoring juju.machineUpgradeSeriesLocks (0 documents); 2019-03-06T13:02:32.764+0000 reading metadata for juju.actions from /tmp/juju-backups-003269483/juju-backup/dump/juju/actions.metadata.json; 2019-03-06T13:02:32.770+0000 restoring juju.actions from /tmp/juju-backups-003269483/juju-backup/dump/juju/actions.bson; 2019-03-06T13:02:32.772+0000 restoring indexes for collection juju.actions from metadata; 2019-03-06T13:02:32.776+0000 finished restoring juju.actions (0 documents); 2019-03-06T13:02:32.776+0000 replaying oplog; 2019-03-06T13:02:32.776+0000 Failed: restore error: error applying oplog: applyOps: applyOps had one or more errors applying ops;

Tags: cpe-onsite
Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :

I cannot attach juju-crashdump as Juju API is unavailable.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Subscribing ~field-high as it's blocking one of the user acceptance tests in the field.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Hmm, having HA controllers may not be necessary to trigger this issue. I've reproduced it with AWS provider as well just with `juju bootstrap && juju create-backup -m controller && juju restore-backup -m controller --file <backup>`.

Revision history for this message
Richard Harding (rharding) wrote :

Is this the edge snap? I note the juju.generations in the log output which is a feature for 2.6 in the edge/develop branch.

Thanks for clarifying.

Revision history for this message
Richard Harding (rharding) wrote :

Please confirm the issue with 2.5 vs 2.6.

Changed in juju:
status: New → Incomplete
Changed in juju:
assignee: nobody → Richard Harding (rharding)
Revision history for this message
Nobuto Murata (nobuto) wrote :

I tested it with 2.5.1.

Changed in juju:
status: Incomplete → New
Changed in juju:
status: New → Triaged
importance: Undecided → High
Changed in juju:
assignee: Richard Harding (rharding) → Simon Richardson (simonrichardson)
Revision history for this message
Simon Richardson (simonrichardson) wrote :

So this does work in 2.5.0, so investigating what changed between 2.5.0 and 2.5.1

Revision history for this message
Simon Richardson (simonrichardson) wrote :

I've gone through and done a git bisect to identify the commit that broke the restoration of a dump. https://github.com/juju/juju/commit/31f4f5526d710578511df3799cdeea8a1f8fa42c

Revision history for this message
Tim Penhey (thumper) wrote :

Unfortunately this commit is a merge of 2.5 into another branch, and includes a whole bunch of things. We should manually bisect the mainline merge commits on the 2.5 branch between 2.5.0 and 2.5.1.

Revision history for this message
Simon Richardson (simonrichardson) wrote :

So unfortunately my initial thoughts for the commit found by git bisect was wrong and in fact was caused by the intermittent nature of the underlying bug. Depending on what is found inside the mongodump oplog.bson file will cause the restore to fail and what's found inside the bson file will depending on timings of items being processed.

To understand when this occurred with in the code base, I ran both 2.4.(5,6,7), 2.5.(0,1,2) on bionic with the packaged version of mongo 3.6.3. Running `juju bootstrap && juju create-backup -m controller && juju restore-backup -m controller --file <backup>` failed with "Failed: restore error: error applying oplog: applyOps: applyOps had one or more errors applying ops;".

Running mongorestore with the supplied tar.gz from `juju create-backup` and using a lxc container `lxc launch ubuntu:18.04 bionic`, I installed mongo 3.6.3 and ran the command by hand. This also failed with the same error. Checking `/var/log/syslog` revealed why this failed showed that there was an error:

applyOps error applying: InvalidUUID: uuid must be a 16-byte binary field with UUID (4) subtype

To understand if this was an issue with just bionic, I also tested against xenial. Xenial didn't experience this bug, even running numerous times (over 8 times, between me and others).

----

On a hunch checking mongo issue tracker, also revealed this defect[1], which is really similar to what we're experienced.

Creating a new lxc container for bionic `lxc launch ubuntu:18.04 bionic` and then installing mongo version 3.6.9 I then took the same tar.gz file and ran mongorestore by hand in the same way as 3.6.3, it was able to successfully restore the backup dump. I also tested against 4.0.6 as well.

---

Notes:
 - Scripts to install mongodb 3.6.9[2]
 - Output from mongorestore 3.6.3[3]
 - Output from mongorestore 3.6.9[4]
 - Output from mongorestore 4.0.6[5]

 - 1: https://jira.mongodb.org/browse/SERVER-36951
 - 2: https://paste.ubuntu.com/p/R2FnzZNpzS/
 - 3: https://paste.ubuntu.com/p/b3ntq5p6DG/
 - 4: https://paste.ubuntu.com/p/YRCfTjbPyw/
 - 5: https://paste.ubuntu.com/p/w8TNKwxPHh/

Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Uploads done, awaiting SRU team review. Would you like to mark this bug as invalid, or something else? It won't autoclose, since the bug is in mongodb, not juju.

Revision history for this message
Richard Harding (rharding) wrote :

ok, let's mark this as invalid for Juju with the note the packages are being tracked for the update.

Changed in juju:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.