Deja Dup keeps making fresh backups

Bug #490188 reported by MFeif
76
This bug affects 14 people
Affects Status Importance Assigned to Milestone
Déjà Dup
Expired
Medium
Unassigned

Bug Description

Deja Dup is not failing, and I have no errors that I can see, but it keeps making FRESH backups. I have a stable backup target (a network drive) and have not done anything unusual or special, but in the 2 weeks or so that I have DD running, it has "started over" with fresh backups about 3-4 times. My 25G of only subtly changing data seems to be eating 80G on the target share. Also, of course for the 2h+ that it takes to backup, my machine is pretty tied up. In short, it's frustrating that my machine keeps backing up 25G of data that is ALREADY BACKED UP.

The instructions for submitting bugs (running with DEBUG=1 etc) don't apply, as it's not failing, it's just behaving badly. If there's something else I can provide, I'd be happy to.

(gconf attached.)

deja-dup 11.1-0ubuntu0karmic1
duplicity 0.6.06-0ubuntu0karmic1
Description: Ubuntu 9.10

Thanks!

Revision history for this message
MFeif (matt-feifarek) wrote :
Revision history for this message
Michael Terry (mterry) wrote :

So, just to be clear, you're saying that DD keeps making fresh backups each time you backup with no incremental backups? How often do you backup now? Do you let the fresh backup finish?

For background, a feature added in 11.0 was that DD would occasionally make fresh backups to provide data security (if it didn't, and there was a problem with any of the incremental backups, you'd have trouble restoring). So it's a 'better safe than sorry' feature, even though it does take some extra space (which can be alleviated somewhat by setting the 'keep backups' preference to something besides 'forever'). The default period between fresh backups should be about 3 months.

Also, you say your machine is tied up during a backup? How bad does it get? DD is supposed to lower its priority for CPU and disk, so as not to block usage.

Changed in deja-dup:
status: New → Incomplete
Revision history for this message
Michael Terry (mterry) wrote :

Oh, sorry, missed the bit in your gconf settings where you have it backup daily. This will cause DD to make fresh backups every 2 weeks (it scales according to how often you backup, so as to keep not too-long chains of incremental backups).

But the rest of my questions remain. In particular, do you let the fresh backups finish? Because DD will keep trying/resuming a fresh backup if it gets interrupted.

Thanks for the bug report, BTW! :)

Michael Terry (mterry)
Changed in deja-dup:
importance: Undecided → Medium
Revision history for this message
MFeif (matt-feifarek) wrote :

Thanks for the follow-up.

So, to your questions and comments:

1. I understand; full-backups are a feature. I wish I knew that before... it seemed like it's was a failure that DD is trying to fix; I didn't get a message or anything.

2. Not understanding how the backup archives work, it seems odd for 25G to take 2.5x the data. I've been using rsnapshot for years, I assumed it was like that. So a "fresh backup" is not a diff, it seems... it's a full set. Seems odd to keep a dupe of old backup data if the system knows that nothing is wrong.

3. Thanks for the "keep forever" tip. I'll look into that.

4. Yes, all three times I saw this, I let it finish (I was afraid that it would bork the archives if I stopped it... again, not understanding how this works makes it kinda scary to use).

DD seems to take about 100% of one CPU, and about 10% (ish) of the second. The machine isn't locked up, but doing anything is slow. Especially anything with disk. It's not unusable, but it's certainly a drag. I could try and find out where the hit is (sftp? network? disk?) if you want. Maybe it's gvfs and not duplicity or DD at all.

Seems like fresh backups every 2 weeks is a little extreme. Can this be configured?

I understand and like the philosophy of DD; but it seems that the preferences are too coarse. If "daily" is requested, it really gets in the way. Next option to lighten the load is "weekly"... but a lot can happen to a PC in a week. I suppose I can configure via gconf for a different number of days.

If there's anything I can do to help, let me know.

Seems like this is not a "bug"... I guess we should close it.

Thanks!

Revision history for this message
Michael Terry (mterry) wrote :

Eh, there's at least one bug here.
 * Lack of feedback. The UI currently says "making a fresh backup" but it should probably say "making a fresh backup; this is normal and your old backups are still available" or something.
 * 2 weeks not being appropriate. I'm not excited about making the 'fresh backup interval' user-customizable, as I think of that as a implementation detail, but I'm certainly open to changing it. I picked 2 weeks myself, without any data -- maybe 1 month if you use daily backups is better? Really thinking on it, I'm not sure I'd want a full backup every 2 weeks either.
 * Possibly too much resource usage.
 * Plus, the original bug that we may be backing up too often. You said it tried to do a fresh backup 4-5 times in roughly 2 weeks. If that's accurate, DD is definitely backing up too often.

Revision history for this message
Andrew Fister (andrewfister) wrote :

You could build into the determination of how often to make a full backup two additional factors:

1) How large was the previous full backup? If it's large, we don't want to be doing full backups as often, because of storage constraints on the target media. This could also be calculated as a ratio like (size of full backup)/(free space in target media).

2) How much time did it take to complete the previous full backup? If it takes a long time to complete a full backup, we should make a full backup less frequently, because the longer duplicity is running, the more likelihood there is for it to fail in some way, either by user fault or duplicity's fault. Backups also take up the machine's resources while you're working. A full backup over a network to an external drive attached to a server can seriously take a long time, especially if you don't keep your client machine running all the time(suspending a laptop when you go to work, for example). So, we should use not the run time, I think, but the real time elapsed from the start of the backup until its completion.

Michael Terry (mterry)
Changed in deja-dup:
status: Incomplete → Confirmed
Revision history for this message
norrist (norrist) wrote :

Since I use Deja Dup as my third backup (offsite to S3), I do not need any fresh backups. Is there a way to completely disable the fresh backup feature?

Revision history for this message
Michael Terry (mterry) wrote :

No. The fresh backups are used for other purposes, like deleting old backups when space gets tight. If only incremental backups were used, Deja Dup couldn't do that.

Revision history for this message
Dan Le (its-elnad) wrote :

Not sure if there's any correlation, but everything was working fine until I changed the retention from 6 months to 3 months.

Revision history for this message
Joshua Jensen (joshua-joshuajensen) wrote :

If we don't trust the old archive, shouldn't we be doing backups to another site, or verifying the backup... not simply making another entire full backup (and then not trusting it later either)?

From my "end user" standpoint, if we can't trust the archives we've written, I want to verify them. If they're good, fine... if not, delete them and re-make a full backup. Simply assuming they are bad and duplicating them "every so often", while using 2x the storage, isn't what I need a backup program to do.

Revision history for this message
floflooo (florent-angly-gmail) wrote :

1/ I like the idea of doing tests on the backup instead of making "fresh backups"
2/ The fact that a fresh backup cannot be resumed is problematic: if the amount of data to backup is too large, I cannot let it all be backed up without interruption, and it has to be started all over again the next time

Revision history for this message
Michael Terry (mterry) wrote :

In terms of whether we trust the old archive, there are two reasons we do fresh backups:
1) It's difficult to determine automatically if the backup is somehow corrupted without actually doing a full restore. Better safe than sorry.
2) Full backups also let us delete old backups. If we kept long incremental chains forever, we could never delete any of them.

Revision history for this message
Florian Kauer (koalo) wrote :

The fact that a fresh backup occurs every two weeks is VERY problematic:
I use deja dup to backup my netbook, but my backup is large (the whole backup - not the changes) and my internet connection is slow. Therefore my backup is interrupted every two weeks until I let my netbook run all night long.
This makes deja dup practically useless for me.

Revision history for this message
Christoph Buchner (bilderbuchi) wrote :

concerning 2) - long backup chains precludes deleting old versions:
You could do reverse diffs, the rdiff-backup way. That way, the "full version" will always be the newest one, preserving older versions with diffs, which can be deleted at leisure (e.g. expiry interval).
http://www.nongnu.org/rdiff-backup/features.html

Revision history for this message
Michael Terry (mterry) wrote :

Florian, a fresh backup should happen at most every month (though I think the old minimum was 2 weeks -- which version are you on?).

Christoph, that would be a neat feature, but Deja Dup relies on Duplicity, which does not have that feature. And it would be impossible/very-difficult to add, since Duplicity does not assume that it can run code on the backend (which allows it to support cloud backends). So it would have to download, repack, and re-upload the chain to support that feature.

Revision history for this message
Christoph Buchner (bilderbuchi) wrote :

Yeah, I already suspected that.
Didn't think about that it would need to run code on the backend, though. Although (and I don't want to derail the bug report, so shut me down if needed), how do you make the "traditional" diffs - you would have to compare (and possibly download) data, too, isn't it? and then, where's the difference between diffing towards the old or the new version?

Revision history for this message
BigJules (julianstockwin) wrote :

As a front-line user who earns his crust with what's on his machine, I've a v definite interest in back-up. And have to say DD is clearly the way to go. Since getting full satisfaction from S3, including the occasional restore, I've ditched all intermediate non-archive methods (DVD-RAM) etc with relief.

BUT: I back up 2wice a day on a 15GB data set ( < 5mins) which I don't think is excessive for a working machine, but this has the result that there's a fresh backup happening every week or ten days which is insanely annoying, tying up the system for 22 hours or so on my pitiful broadband each time.

I fully see the need for one and don't resent it, but can we not have some way of choosing when it happens? e.g. a deferrable button or 'do it now' thing?

- big thanks to Michael T however for bringing us an awesome product!

Revision history for this message
emilio (emiliomaggio) wrote :

I have used deja-dup to back-up daily my home folder containing about 8GB of data. After one year of usage the target folder on my local server has 100GB of backup files. This seems to be excessive. Reverse backup would be the ideal solution. The other possible approach is a logarithmic discard for the old backups. For example one could retain all the daily backups in the last 2 weeks, then only weekly backups in the last 2 months and monthly backups in the last two years. Probably this requires more bandwidth as by discarding some full backups one might need to generate new differential backups. However the situation as it is does not seem to scale very well over time.

Revision history for this message
Michael Terry (mterry) wrote :

Emilio, smarter deletion logic is wishlist bug 546575. But if you want simple deletion logic, you can do that today by changing the "Keep backups" to "At least 6 months" or something to save on space.

Revision history for this message
Jerry (priegog) wrote :

I came here looking for possible solutions to the same problem, and I found a very interesting discussion about the fundamental problems and limitations of Deja dup going into the future. After all, it was picked up by canonical to be the default Ubuntu backup solution, so it should be held to a higher standard.

I think we can all agree that when it comes to user interface, ease of use, and abstraction of difficult tasks (like setting up backups to a networked drive), Deja Dup is by far the best backup program currently available on Linux. It's probably also the reason it was pi

Revision history for this message
Jerry (priegog) wrote :

(dammit, I hit send by accident. I will continue here)

...picked by Canonical.

But then we have all the problems that just came up in this discussion:

- Need for periodical new backups
- Not a smart usage of space or bandwidth
- No post hoc hash checking of backups (which is related to the first and to the next problems), so no ability to truly trust what's actually on the backup (and this has burned me personally in the past)
- Not using potentially available resources, like the ability to run a little code on the server side
- Keeping anterograde instead of retrograde differential backups (as explained, a-la rdiff-backup)

Coupled with a couple of other problems that are not mentioned here...

- Not directly accesible files on the backup (which is a HUGE problem when duplicity decides it doesn't want to cooperate and open the files)
- (related to the last one) the fact that, even when not encrypted, the files are packed up causing a lot of hurdles and inefficiencies in the whole process as a whole.

...makes me increasingly think that the main problem here is the back-end, duplicity. And most of those are problems that, thanks to the decisions taken when creating the software, are probably unsolvable unless a radical 180 is done to the back-end code, and the way duplicity works.

The ideal for me of course would be if Deja Dup became "simply" the front end for more powerful backup solutions, like rdiff-backup, rsync, etc; giving those sorts of programs the usability (in a "set it and forget it" kind of way) they currently lack.

Anyways, I think this turned into mindless rambling, I was a little dissapointed that the particular problem this bug is talking about doesn't really have a solution.

Revision history for this message
Charlie DeTar (cfd+lp) wrote :

I'd very much like to see the 2-week threshold for full backups to be reconsidered. I want daily (or more frequent) backups to ensure that my latest changes are safe. I'm backing up a reasonable amount of slowly-changing data (50GB or so) over a typical US cable provider connection (1.5 Mbit up; I usually get uploads of about 150 kBytes/s) to a cloud provider. This means a full backup takes 4-5 days continuously.

This is problematic for two reasons:
1. I'm tied up doing full backups for 4-5 days out of every 14.
2. During the 4-5 day full uploads, I have no current (daily) backups.

I understand the aversion to making this configurable, but the defaults chosen work so poorly in my case that I'm looking for other backup solutions, or rolling my own with duplicity. Would be great to get data on what people's needs and rates are, but in the absence of that data, letting users change this (even in a config file) would help mitigate the absence of data. If I had to choose, I'd do a full backup no more often than once every 3 months.

(I'm on deja-dup v22.0, current for Ubuntu Precise LTS; please correct me if this has been changed in a more recent version -- the changelog doesn't suggest any.)

Revision history for this message
Michael Terry (mterry) wrote :

So to follow up on some previous conversations here. In Deja Dup 24.0 (first used in Ubuntu 12.10), we added a couple features that should help with this bug.

Deja Dup does now verify the backup a little bit. It tries to restore a tiny test file it inserts into every backup. And every two months, it does the same thing, but without any local cache or saved passwords (i.e. it prompts the user again -- largely just to remind them of what it is before the need it in a restore situation). So the verification story is slightly better. It's not foolproof, but we do something.

It also defaults to 3 months between full backups, all the time. Regardless of what your backup frequency is. This is also customizable in gsettings (change 'full-backup-period' to the number of days between backups).

With these changes, this bug could probably be considered fixed.

Revision history for this message
Michael Terry (mterry) wrote :

Marking as incomplete. If people using 24+ still have problems (i.e. bugs in implementation), feel free to comment and I can re-open. But I think the changes that went into 24 should solve this issue.

Changed in deja-dup:
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Déjà Dup because there has been no activity for 60 days.]

Changed in deja-dup:
status: Incomplete → Expired
Revision history for this message
Marcel Oliver (oliver-member) wrote :

Please reconsider this issue:

While I am grateful that the dconf trick has been pointed out above, it is really only a temporary workaround for the underlying issue: The case where the set of essentially static files is so large that a full backup needs to be planned ahead (because it does not complete within a reasonable time frame). Right now, my only resort is setting the full backup frequency to essentially infinity, and when I really want to do a full backup, to essentially zero. I would suggest two measures:

1. A GUI-exposed option to do a full backup only manually (maybe with a warning "you should think about doing your next full backup soon), so that the user can control and schedule where and when large amounts of data go over the network.

2. The concept of a known-good full backup (e.g. after explicit full verification of this backup). Then subsequent backups should be relative to the known-good backup, rather than creating a long chain of incremental backups. A more elaborate scheme could also rebase a chain of incremental backups to the known-good backup with the chain gets too long. If you wan to be fancy, there could be a binary-tree-like rebasing scheme so that the length of the backup chain is logarithmic in the number of backups (but this would again incur the problem that the size of the data transfer is non-transparent to the user and large backups could be initiated at unpredictable times, so some form of user-control would be necessary in this scheme, too.)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.