Support for Amazon Glacier

Bug #1039511 reported by someone1 on 2012-08-21
416
This bug affects 69 people
Affects Status Importance Assigned to Milestone
Duplicity
Medium
Unassigned

Bug Description

Amazon just announced Amazon Glacier: http://aws.amazon.com/glacier/?utm_source=AWS&utm_medium=website&utm_campaign=LP_glacier_launch

This seems like a better fit for Duplicity in terms of use-case and pricing! I'm sure the Python Boto project will be adding support soon and since Amazon will give the ability to transfer from S3 buckets to Glacier in the coming months, this would be a great feature to keep on the roadmap so a migration to it for S3 users would be seamless.

-Prateek

Related branches

Daniel Schudel (daniel-schudel) wrote :

I'd like to see a way to direct "full" backups to Glacier. I don't know how useful it would be to store incremental backups there.

This is what I'm thinking:

Use duplicity and S3 to store full/incremental backups on a frequent schedule (daily, etc.).

Use duplicity and Glacier to store full backups that I schedule and run on an infrequent schedule (weekly, monthly, etc.).

I don't see any problems with storing incremental backups at Glacier. Why not? Duplicity caches metadata locally, so it doesn't need to read anything from Glacier. I would definitely store incremental backups there.

Although, I'm not sure what to do if local metadata is unavailable.

someone1 (prateek) wrote :

I see no reason to keep incremental backups off of Glacier. What advantage does this grant you?

Glacier would be a great backend to implement, lower pricing than RRS (~11% the cost per GB for the first TB) and the durability of standard S3 storage. Also, default automatic AES encryption further helps to keep our data secure (although to be fair, you can do this in S3 as well).

I wonder if it will take longer to upload though. It doesn't return a "SUCCESS" on upload until it replicates out to all data centers. May this be an excuse to revisit the threading found within duplicity and increase it to a specified number of upload thread handlers? Amazon's best practice recommends that you try and upload multiple objects to its data centers at once since a single connection may be rate limited depending on how often you use it and which server you connect to. I would think 3-6 upload threads could best saturate most small business upload connections.

Nick Welch (mackstann) wrote :

My understanding is that files can already be manually moved to Glacier from S3. Assuming that I'm okay with just doing this manually right now, what files should I keep readily accessible via S3? The man page seems to suggest that duplicity just needs access to the .sigtar files to create new incremental backups. So if I move all of the non-.sigtar files into Glacier, and keep the .sigtars in S3, should that work?

At current prices, keeping incrementals on S3 instead of Glacier only pays off, if you want to keep them for no more than 7 days. 8 days on S3 are already more expensive than the 3 months minimum billed on Glacier. Therefore I think, that it's not necessary to support split storage. But there might be other use cases I'm missing.

I'd like to see this too, please.

In my imagination, the backup files would be in glacier, and the signatures in S3. If duplicity needs to check whether the vol files exist then the S3 can contain a dummy.

someone1 (prateek) wrote :

I think the biggest limitation (and probably hesitation) to implement this would be downloading files. You must submit a "job" which can take 4 hours to fulfill before you are sent a link to download from the batch. Duplicity would have to handle this alternative retrieval process just for Glacier. Not saying it can't be done, but might not be as seamless as dropping in a new backend class.

Maybe an alternative would be to store Full/Partial backups on S3 and have an option similar to "remove-all-but-n-full" but instead "move-all-but-n-full" to move all backups before a certain date to Glacier for archiving purposes.

Is it necessary at all to support direct restoration from Glacier? Since that's a long term backup solution, data restoration shouldn't happen very often, and if it happens at all then I don't think it would be a problem to download the data from Glacier manually, and then use duplicity to decrypt and restore it from local copy.

Ezod (puredoze) wrote :

To follow up @Andrew's imagination:

Amazon annouced "an option that will allow customers to seamlessly move data between Amazon S3 and Amazon Glacier" (http://aws.amazon.com/en/glacier/faqs/#How_should_I_choose_between_Amazon_Glacier_and_Amazon_S3)

In case there is an API for it - duplicty could alternatively store the complete backup in S3 and then moves the data to Glacier.

Uwe L. Korn (uwelk) on 2012-09-30
security vulnerability: no → yes
security vulnerability: yes → no
nukul (nukulorrr) wrote :

To those who consider using glacier for backups, keep in mind the retrieval fees, which could make glacier look very unattractive very suddenly... http://brandt.github.com/amazon-glacier-calc/
I do think the author made a mistake in his calculation, assuming that "retrieval time" means the time it takes Amazon to fulfill the request. In my opinion, this is the time it takes you to download the data. Still, retrieving 100GB spread over 24h would cost some 40$ if you have a couple of hundred GBs stored in glacier.

To me, it doesn't look very attractive as a personal backup solution anymore.

Tim (ww3ib0sg9wt-tim) wrote :

It still looks extremely attractive to me as a secondary backup.

I agree that Glacier is probably not the most ideal primary backup but if you already back things up locally and just want an offsite backup of that data for piece of mind it's perfect.

Is there much interest in getting Glacier support into duplicity or is it just me?

As I understand it, at $0.12 per GB, 100GB retrieval should cost $12, which is roughly the same as 1 month's S3 storage cost, plus $0.05 per 1000 files. 100GB is going to be about 4000 25MB volumes, so that's about $0.20.

That's a grand total of $12.20. What am I missing?

Anyway, if it is $40 for retrieval, you'd save that much in 4 months, so it's not a problem. I use AWS to hold data I will only ever recover in the case my house burns down, or all my computers and backup go pop at once, so a one-time higher fee I never expect to pay is not a problem.

-------------------

As for the 4 hour download delay, I'd be happy to have a command that moves an existing backup between S3 and Glacier, and I'd be happy to be only able to restore from S3, if that's easier.

Once the web console grows the right buttons, I'll be able to move whole backup sets back and forth manually, but it'd be nice to be able to do it at a finer granularity. I.e. to 'glaciate' a full backup, but keep the last incremental one 'active', or backup directly to glacier whilst keeping the metadata in S3.

Ok, I see what I missed. I've taken into account the data transfer and request fee, but not the excess-retrieval fee which is calculated based on the peak download rate.

That can be mitigated by a) using an EC2 server as a go between (possibly free in the first year, but may not be a cheaper option after that), and b) rate-limiting the download.

It might be nice if a duplicity implementation had some kind of smart rate-limiter to solve the problem, but my point above about storage savings still stands.

On Thu, Oct 18, 2012 at 2:32 PM, Andrew Stubbs
<email address hidden>wrote:

> Ok, I see what I missed. I've taken into account the data transfer and
> request fee, but not the excess-retrieval fee which is calculated based
> on the peak download rate.
> That can be mitigated by a) using an EC2 server as a go between
> (possibly free in the first year, but may not be a cheaper option after
> that), and b) rate-limiting the download.
>

Free-tier is limited to a small number of GB of transfer and a single small
instance
of the EC2 server.

You'll pay for it either way (assuming you store more than 10GB of data)

> It might be nice if a duplicity implementation had some kind of smart
> rate-limiter to solve the problem, but my point above about storage
> savings still stands.

Actually, the retrieval part is tricky. We might come to a conclusion, that
it's
out of duplicity scope at this moment. I'd be happy to see duplicity manage
uploading to glacier + metadata upload to S3 and handling archives. As for
retrieval, it could even be an additional script/tool (bundled or not) that
might
be able to determine current retrieval rates, calculate the speed ratios,
select
optimal (cheapest) speed to download the data given its size. There are a
few
variables to take into account and to make it worth our time, the calculator
must be smart enough to really account for everything (including day of
billing
cycle, region, max link throughput etc.).

Duplicity as a cost-calculator and optimizer? :-) I don't believe so.

Arthur

someone1 (prateek) wrote :

I think an option like "--move-all-but-n-full-glacier" to move full backups to glacier for archival purposes would be sufficient. I guess this would break the chain of backups duplicity is able to see with S3 as a primary location for backups, but we can leave it up to the user (or have an option in duplicity to request a backup job) to move any backups found in glacier back into S3 for retrieval purposes.

I guess my use-case is to store anything past 3 months old in Glacier as the odds of needing to retrieve backups these old rarely come, if ever. We just need to hold on to data for up to a year or so for compliance purposes.

Jose Riha (jose1711) wrote :

please note that rate-limiting your download will not save you anything. retrieval is only computed on the side of amazon. if you want, you can use this for planning/calculating: https://docs.google.com/spreadsheet/ccc?key=0Al87cCkTI-7adFVxd213UFNpcXo5RzNoVlFRbTdoVGc

Ezod (puredoze) wrote :

"Amazon S3 Now Supports Archiving Data to Amazon Glacier"
https://forums.aws.amazon.com/ann.jspa?annID=1713

someone1 (prateek) wrote :

Actually with the ability to create rules for expiration/transfer to and from Glacier... I've no longer any interest in direct Glacier support.

It makes sense for my use-case to keep data within the previous 3 months on S3 and have rules to auto-transfer to Glacier beyond that point. If a restore to a previous time is required, I guess I'd have to manually select the files to transfer/restore to S3, but this is something I'm willing to do. Maybe duplicity can detect which files have been moved to Glacier and initiate a transfer request using the S3 API? Not sure if it works that way but maybe its worth looking into.

Dwayne Litzenberger (dlitz) wrote :

I'm definitely still interested in direct Glacier support. I have a few TB of data that would have cost me about $300/month to store in S3, which made it infeasible (I do backups to local media instead). At $30/month, it's reasonable, but only if I'm not also paying the cost of having a large chunk of that data duplicated in S3.

I understand that using Amazon's feature of moving data back and forth between S3 and Glacier, it would be possible to backup data to S3 and Glacier following the general principles below:

1.) We would backup the data to S3
2.) Most (>90%) of the data would reside in Glacier (will be moved there using some rules)
3.) Few files (<10% will reside in S3) so as to not confuse duplicity when backing up
4.) If a restore is ever needed, data is restored from Glacier to S3 and we'll then use duplicity to restore from S3 (possibly using a different restore path from the original one used for backing up to S3)

I'm confused mostly about points 2 and 3, that is:

* What rules should we use to move what files to Glacier?
* What are the files required to reside on S3 so as to not confuse duplicity when doing new backups (as in duplicity must believe the full backup is there, even though >90% of the files have actually been moved to Glacier)?

Alternatively, if what I'm talking about above cannot actually work with duplicity as it is right now, perhaps it would be much easier to adapt duplicity to work as per my general principles above instead of making duplicity work straight with Glacier (if that's even at all possible).

someone1 (prateek) wrote :

You set the rules in S3. Duplicity won't be confused for doing new backups as the file list information for your bucket is the same whether the files reside in S3 or Glacier.

The only issue arises when you want to do a restore and the files required by duplicity reside on Glacier. You must first restore the necessary files into S3. It would be nice if duplicity could detect if the files needed are on Glacier instead of S3 and initiate a restore, but then there's no reliable way for duplicty to know when the restore is complete (could take 4 hours). You'd just have to retry the restore at a later time.

As far as how to make the rules? If you do a backup daily, and a full backup ever X days, you could keep files X days old and older on Glacier, so that you readily have available files needed to do a restore in the past X days. This is how I've been doing it. I keep the last 2 full backups and all the incremental backups in S3, and anything older than that is moved to Glacier automatically. Very rarely do we need a backup older than 60 days old, so this works out great for us. I do full backups every 28 days and incremental backups in between, so I put a rule in to move files 60 days and older to Glacier.

I hope this answers your questions!

@someone1, thanks, very helpful.

Devon (devoncrouse) wrote :

I'm thinking all that would be needed to support Glacier is a change to file_naming.py to add a different prefix to both full and incremental manifests.

Currently, you can put a lifecycle on an S3 bucket, and provide a prefix/age of files to migrate to Glacier. The problem is that it's a very basic prefix match, and there's no way to prevent it from also moving the manifest. If it had a different prefix, the actual data files could be moved, and the manifest left in S3 for future backups.

Restores would first have to take place in AWS (Glacier file request).

Thoughts?

Zach Musgrave (zach-musgrave) wrote :

Devon, that would be a fantastic fix! I'm looking forward to it!

In the meantime, I wrote this kludge to automate renaming the manifest files back and forth every time duplicity runs:

https://github.com/zachmu/glacierplicity

I had been thinking along similar lines about a new naming scheme to support S3/Glacier. I just entered this feature request on Launchpad as bug 1170161.

Note that duplicity should work with S3/Glacier without this feature, because it's supposed to read the manifest and sigtars from the local cache. So it's not supposed to need to read anything from the remote. But unfortunately sometimes it does anyway.

Found this first: http://blog.epsilontik.de/?page_id=68

Maybe the last paragraph about archiving and timestamps will help, specifically.

someone1 (prateek) wrote :

I've written a patch for Duplicity that should add support for handling files in S3 that are stored on Glacier. As of now, Amazon does NOT allow uploading directly to Glacier. This should be handled through lifecycle rules you specify on your duplicity bucket. See my post above for an example on how to handle this.

What my patch does is check to see if the files required are stored on S3 or Glacier. It will then initiate a restore option for every file that is in glacier and wait for the files to finish restoring before continuing (thus, duplicity won't break, it will just be checking every 60 seconds to wait and see if the files have been moved). If your local cache is out of sync with the remote bucket, it will make sure the manifests are transferred to S3, then if you chose to do a restore option, it will make sure the required files are transferred to S3. This means it could potentially take up to 10 hours of Glacier -> S3 time, which is not ideal but at least its something.

This patch also merges parts of the _boto_multi.py and _boto_single.py code since they largely overlap. Not sure if there was a reason to keep the two separate.

Tests I plan to run:
1. Clear my local cache and try to do a restore from a point in time that is being stored in Glacier (Done and tested: PASSED)
2. Do a restore from a point in time that is not in Glacier (Done and tested: PASSED)
3. Do a backup using multi-upload processing (Will do tonight)
4. Do a backup using normal upload processing (Will try tomorrow night)

I see there's also a branch on launchpad that adds Glacier as a backend. My approach is adding compatibility with the current S3 backend instead of making a new one, you could "mimic" a Glacier backend by using lifecycle rules in S3 to immediately transfer to Glacier.

someone1 (prateek) wrote :

Sorry, previous post should have the following edit:

As of now, Amazon does NOT allow uploading directly to S3 with a storage class of "glacier" - it only supports "standard" and "reduced_redundancy".

Guido Serra (zeph1ro) wrote :

@prateek I'd like to give u hand, is ur patch posted somewhere?

p.s. @eric-friedrich84 has already a patch for it, has anyone evaluated it?

 - https://code.launchpad.net/~eric-friedrich84/duplicity/glacier

Edgar Solin, one of the main devs for duplicity provided some comments on
the patch a few months back:

http://lists.nongnu.org/archive/html/duplicity-talk/2013-01/msg00008.html

TL;DR: Glacier's method of archive retrieval doesn't match up very well
with how duplicity works.

I'm still interested in doing more work on this, just looking for the time
:-)

--Eric

On Mon, Aug 5, 2013 at 5:06 AM, Guido Serra <email address hidden> wrote:

> @prateek I'd like to give u hand, is ur patch posted somewhere?
>
> p.s. @eric-friedrich84 has already a patch for it, has anyone evaluated
> it?
>
> - https://code.launchpad.net/~eric-friedrich84/duplicity/glacier
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1039511
>
> Title:
> Support for Amazon Glacier
>
> Status in Duplicity - Bandwidth Efficient Encrypted Backup:
> New
>
> Bug description:
> Amazon just announced Amazon Glacier:
>
> http://aws.amazon.com/glacier/?utm_source=AWS&utm_medium=website&utm_campaign=LP_glacier_launch
>
> This seems like a better fit for Duplicity in terms of use-case and
> pricing! I'm sure the Python Boto project will be adding support soon
> and since Amazon will give the ability to transfer from S3 buckets to
> Glacier in the coming months, this would be a great feature to keep on
> the roadmap so a migration to it for S3 users would be seamless.
>
> -Prateek
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/duplicity/+bug/1039511/+subscriptions
>

someone1 (prateek) wrote :

I've posted my branch here: https://code.launchpad.net/~prateek/duplicity/s3-glacier

I did not know that Google Storage support was coming, so I just tried fixing merge conflicts and hope it doesn't break anything. I'll be going through my tests over the next few days for the S3 side of things. Please let me know of any bugs anyone may find!

Thank you,

jurov (rini17) wrote :

someone1, does your branch upload manifest files, too? IMHO there is no reason why they should ever be in glacier, as they can cause count of paid glacier UPLOAD and RETRIEVAL requests to double(worst case). They cost almost nothing when kept on S3.

Or if ever, all manifests for particular set of backups should be tarred together and moved to glacier as one object.

someone1 (prateek) wrote :

I don't upload to Glacier, all this branch does is check to see if a file we are trying to download is on S3 or on Glacier. Because of the way S3 works, we can create rules such as: If file in bucket A is older than X days, move to Glacier. There would have to be a specific modification with how duplicity names files on S3 in order to prevent moving manifests to glacier. I'm not entirely sure of the future implications of changing such a requirement in duplicity for just one backend.

I do agree though, manifests are better kept on S3. What do you mean "paid glacier UPLOAD"? Moving from S3 to glacier comes at no cost. And storing on glacier is significantly cheaper than S3. When the rule "moves" a file from S3 to glacier, it still appears to be in a S3 bucket, but needs to be "restored" before it can be downloaded, so you only pay for storage on glacier and the restoration cost of glacier.

jurov (rini17) wrote :

Seems they *always* do charge per request, regardless if moving files from S3 to Glacier or to Glacier directly: http://alestic.com/2012/12/s3-glacier-costs Even if automatic archiving using the S3 rules would be free, copying many small files later out from Glacier can be fraught with considerable fees ( when restoring backup).

So it seems worthwhile to think about minimizing the number of files going to glacier, even if that would mean complications in duplicity source. I think simplest would be adding "rename all files smaller than X with some prefix Y" option, I'll look into it.

At least users should be clearly advised about this pitfall , so that they can take precautions.

someone1 (prateek) wrote :

I don't see it as duplicity's job to educate users on which backend to use and associated cost. Additionally, if you're doing a full restore, a few manifest files will be the least of your associated costs depending on the size of your backup. You can always increase the volume size so that you minimize the number of volumes duplicity creates. Play around with various options to help you find a solution that makes sense to you. Also make sure your backup is in its own bucket so you don't have to worry about other files being affected by your backup.

Personally I went for nearly $250/month and using Reduced Redundancy Storage to just under $100/month using Standard storage and using archive rules with glacier. I only move items into glacier after 90 days and it is extremely rare for me to have to go that far back when doing a restore.

Changed in duplicity:
milestone: none → 0.6.24
status: New → Fix Committed
importance: Undecided → Medium
Changed in duplicity:
status: Fix Committed → Fix Released
Martin (martin3000) wrote :

If you backup to S3, you can define automatic migration rules to glacier.
But if you want to restore, duplicity wants the manifest and signature files and for every file, it waits for the restore from glacier. Then for the next file and so an. If it has to wait 6 hours for a file, this needs days or weeks.
So it would be better if duplicity triggers the restore for all files AT ONCE and waits then. This way it needs to wait only 6 hours.

Martin (martin3000) wrote :

What I see in my bucket: If an S3 file went to Glacier after a while and you want the file contents back, amazonaws makes the file available but the storage class is still "glacier".
To check if the file is available:
key = bucket.get_key('dejadup/duplicity-full-signatures.20180914T140006Z.sigtar.gz')

If the file is not available, key.ongoing_restore is "None". If it is available, key.ongoing_restore is "False" and the key.expiry_date is set.
In both cases, key.storage_class="GLACIER"

Martin (martin3000) wrote :

My code for this for _boto_single.py:

    def pre_process_download(self, remote_filename, wait=False):
        print("jms debug: _boto_single: pre_process_download");
        print(remote_filename)
        # Used primarily to restore files in Glacier
        key_name = self.key_prefix + remote_filename
        if not self._listed_keys.get(key_name, False):
            self._listed_keys[key_name] = list(self.bucket.list(key_name))[0]
        key = self._listed_keys[key_name]
        key2 = self.bucket.get_key(key.key)
        print(key)
        print(key2)

        print("storage_class:",key2.storage_class)
        print("ongoing_restore:",key2.ongoing_restore)
        print("expiry_date:",key2.expiry_date)

        if key2.storage_class == "GLACIER":
            if not key2.expiry_date: # no temp copy avail
                if not key2.ongoing_restore:
                    log.Info("File %s is in Glacier storage, restoring" % remote_filename)
                    key.restore(days=2) # we need the temp copy for 2 days
                if wait:
                    log.Info("Waiting for file %s to restore in Glacier" % remote_filename)
                    while not self.bucket.get_key(key.key).expiry_date:
                       time.sleep(60)
                       self.resetConnection()
                    log.Info("File %s was successfully restored in Glacier" % remote_filename)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers