Diminishing performance on large files

Bug #582962 reported by someone1
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Duplicity
Fix Released
Medium
Unassigned
0.6
Fix Released
Medium
Unassigned

Bug Description

I am testing duplicity for backing up my VMs. Using version 0.6.08b, it took 14 hours to transfer a 28.8 GB VM to Amazon's S3 on my FiOS connection, doing a FULL backup. I have sustained well over 2 MB/s on our business line for over 1 hour, so I'm sure the delay/lag is coming from duplicity. The server I am running on is a brand new box with 4GB RAM, 500GB of space, core i5 quad core and is freshly installed with ARCH linux for testing duplicity and nothing more.

I then tried again, but applied this patch here: https://bugs.launchpad.net/duplicity/+bug/550455. This patch GREATLY helped... initially. I went from uploading a volume once every minute to about 4.5 per minute (a HUGE improvement). I was excited as this was keeping my network utilization over 1 MB/s throughout the first 20 minutes i was monitoring the computer. I came back 2 hours later to see that as soon as I left (I don't have any power saving options on or anything), the rate at which I uploaded went beserk. It would go from 2-3 volumes every minute to 1 volume every 2-3 minutes. It still is randomly uploading in bursts. Network utilization is kept between 40 KB/s and 250 KB/s with random bursts up to 680 KB/s. CPU utilization went down from spiking to 100% every few seconds to once every 1-2 minutes, idling otherwise.

I am using the --asynchronous-upload option. Ideally I'd like CPU utilization to be kept at 100% (or close enough) across all cores and keep my network utilization over 1 MB/s throughout the sync (as it was doing initially). I am not sure why the process started to lag so badly after the 20 minute mark. There is nothing else utilizing our upload bandwidth (I am monitoring that) and the speed is not being throttled (I still achieve high scores at speedtest.net after multiple tests). The computer only has about ~15% of its RAM utilized and lots of free CPU time and disk space at its disposal. I even granted duplicity high priority (lower it's nice value).

I am using Python 2.6 on Arch x86-64 (Fully updated and freshly installed).

Revision history for this message
someone1 (prateek) wrote :

I tried restarting the upload (fresh, deleting my s3 bucket) and the drop in performance came MUCH quicker, only after 10 minutes while i was monitroing it. I've attached the last 70 lines or so of a -v9 output. It keeps hanging at "AsyncScheduler: scheduling task for asynchronous execution" for about a minute before continuing (started at volume 43).

I also notice that the creation of the next volume waits before the previous one was done uploading. It won't say "Proccessed <next volume>" until after the upload is finished. Not sure if that's just coincidental.

Revision history for this message
someone1 (prateek) wrote :

So I did some digging around. I was able to modify the asynchronous code to create volumes instead of waiting for my upload. This achieved the high CPU usage I wanted, but started to flood my tmp directory. I can see this feature being useful for small volume sizes on high bandwidth systems on slow computers, maybe a limit of how far ahead to make volumes would be good idea if this was ever implemented, I however just reverted back to waiting for the previous upload to finish before going ahead (keeping only 1 step ahead).

I traced the code all the way down to the Boto implementation on S3. It seems the issue is with Amazon S3. Recent posts on their forum indicate inconsistent speeds.

For instance, using a volume size of 25 OR 10, after ~40 volumes my speeds dropped dramatically. This may be due to some rate limiter that i'm hitting (possibly a check can be put into duplicity to wait 10-15 seconds if this limit is hit?). Keeping at 50MB volume sizes I have been able to maintain my upload speed well passed 40 volumes and its continuing, hopefully this is the sweet spot for me to avoid any S3 rate limiting on my specific system.

If there is a rate limiter, maybe duplicity can account for it and sleep for some amount of time as to avoid being rate limited.

Revision history for this message
someone1 (prateek) wrote :

Another update!

Adding this to line 316 of duplicity seems to remedy the issue (running for over 1 hour and over 100 volumes uploaded with no drop in speed):

if(vol_num % 40 == 0):
            time.sleep(15);

although technically this does not force any worker threads that are uploading to abide by this rule, it will just delay the creation of the next volume by 15 seconds...

A solution I was offered online was to close the connection after every upload to S3. I will modify the boto backend to do so and tell my findings...

Revision history for this message
Trevor Hemsley (trevor-hemsley) wrote :

I came here to report a similar issue but I have a different fix for it. I am backing up my home directory with duplicity 0.6.08b on Centos 5 and my original backup has been running for a week. Yes, a _week_. I'm backing up ~116G to a locally attached firewire removable disk. One of the files in my home directory is a VirtualBox disk image that is a sparse file of 160GB with 73GB allocated. I ran the following command

duplicity --volsize=256 -v5 /home/trevor file:///mnt/firewire_data/backup

That copy of the backup is on volume 256 and currently writing one new 256MB volume approximately every 18 hours. Understandably I got a bit bored with waiting for this one to end so I had a quick dig into the code and there was a fix made a long time ago to reduce the size of incremental backups that causes this. In get_block_size() in diffdir.py the block size is coerced to a maximum of 2048 bytes. I applied the following patch and kicked off a new copy of this same backup to the same external disk, alongside the original week old backup and this copy has been running for about 1 hour 10 mins and is already on volume 136 (the other one took ~24 hours to get to this point).

--- diffdir.old 2010-05-20 00:19:29.000000000 +0100
+++ diffdir.py 2010-05-19 23:04:50.000000000 +0100
@@ -713,6 +713,8 @@
     """
     if file_len < 1024000:
         return 512 # set minimum of 512 bytes
+ elif file_len > 16000000000:
+ return 8388608 # set minimum of 512 bytes
     else:
         # Split file into about 2000 pieces, rounding to 512
         file_blocksize = long((file_len / (2000 * 512)) * 512)

So, for files of > 16GB, I set the block size to 8MB.

Revision history for this message
someone1 (prateek) wrote :

I have gotten confirmation from Amazon as to what the error is:

From one of the forum moderators:
#########
The #1 thing folks do wrong in both Java and Python is not create fresh HTTPConnection objects for each thread or worker...this causes the app to 'stick' to a single IP for relatively forever. S3 has some protection against this level of stickiness (it looks like your job stuck to this single IP for >24 hours on a single connection) to ensure S3 continues to be load-balanced cleanly. If you simply ensure the code path re-negotiates a new HTTPConnection object every time, I think your problem will go away.
#########

Can we get duplicity to try and create a new "fresh" connection for every volume it tries and uploads?

Revision history for this message
someone1 (prateek) wrote :

I should add the moderator confirmed during the 27 hour period I was uploading to Amazon S3 that it kept trying to use the same IP on every connection. I have edited the duplicity code to force a new connection on every put. I will post the patch if it works (will find out tonight!)

Revision history for this message
someone1 (prateek) wrote :

There exists a bug! The duplicity code does NOT start a new connection on ever new object it puts onto S3. S3 has over 20 servers it load balances across. Using one connection actually hinders performance onto the system and they WILL throttle you after a while. This becomes hugely evident on large uploads.

I have edited the code to do fix this but its not exactly optimal (HINT: Set the self.conn and self.bucket to None and reinit both variables, the self.conn.close() function does NOT reset the connection). I assume a little re-write of the backend handling would be the best approach to solve this (among other) bug(s). If you create a new instance of the "backend" for each thread uploading, this would result in a new "fresh" connection as well as solve the threading problem with multiple threads uploading to Amazon S3. I haven't looked at any other backend implementation, but if the logic is the same across the board, this should be a straightforward and easy fix to implement.

Also, the connection should be reset during a retry attempt. This is recommended by Amazon and is not handled within Boto.

@Developers: If you'd like, I can submit my code and assist with any rewrite if your resources are limited. I will be using this as a backend for my company and will highly rely on this to work at an optimal level. I have invested a lot of time tweaking the code on my local installation in order to do so and it has improved my backup time by nearly 40%.

Revision history for this message
Kenneth Loafman (kenneth-loafman) wrote : Re: [Bug 582962] Re: Diminishing performance on large files

Sure, if you can produce a patch (diff -au), then we can take the code,
test it and release it.

All contributions are welcome!

...Ken

someone1 wrote:
> There exists a bug! The duplicity code does NOT start a new connection
> on ever new object it puts onto S3. S3 has over 20 servers it load
> balances across. Using one connection actually hinders performance onto
> the system and they WILL throttle you after a while. This becomes hugely
> evident on large uploads.
>
> I have edited the code to do fix this but its not exactly optimal (HINT:
> Set the self.conn and self.bucket to None and reinit both variables, the
> self.conn.close() function does NOT reset the connection). I assume a
> little re-write of the backend handling would be the best approach to
> solve this (among other) bug(s). If you create a new instance of the
> "backend" for each thread uploading, this would result in a new "fresh"
> connection as well as solve the threading problem with multiple threads
> uploading to Amazon S3. I haven't looked at any other backend
> implementation, but if the logic is the same across the board, this
> should be a straightforward and easy fix to implement.
>
> Also, the connection should be reset during a retry attempt. This is
> recommended by Amazon and is not handled within Boto.
>
> @Developers: If you'd like, I can submit my code and assist with any
> rewrite if your resources are limited. I will be using this as a backend
> for my company and will highly rely on this to work at an optimal level.
> I have invested a lot of time tweaking the code on my local installation
> in order to do so and it has improved my backup time by nearly 40%.
>
>

Revision history for this message
someone1 (prateek) wrote :

I've attached the patch. It includes the following patches as well:

http://bazaar.launchpad.net/~duplicity-team/duplicity/0.6-series/revision/653#duplicity/dup_time.py
http://bazaar.launchpad.net/~duplicity-team/duplicity/0.6-series/revision/652

IT ALSO INCLUDES A FIX FOR THE FOLLOWING: https://bugs.launchpad.net/duplicity/+bug/582711
I added a "--s3-use-rrs" option to enable the usage of RRS with the current released version of boto (not the SVN version with support for RRS built-in).

I did a diff with my distribution's files (Arch) so there might need some cleanup needed (I think the environment tag on top of the duplicity file can be reverted/ignored)

I hope this helps! I also hope to see a new release soon ;)

I will be modifying my source for multi-threading on the upload process, so I'll be sure to upload that patch when its done.

Please let me know if this was the right way to submit the patch. I have never done anything like this before..

Revision history for this message
someone1 (prateek) wrote :

Is this going to be ignored? The finding I've made were verified by Amazon's staff and the changes I've done add support for RRS.

Changed in duplicity:
importance: Undecided → Medium
assignee: nobody → Kenneth Loafman (kenneth-loafman)
milestone: none → 0.6.09
Revision history for this message
Kenneth Loafman (kenneth-loafman) wrote :

I have tried to apply the multiple patches and there are a fair number of conflicts. Please submit each patch separately and make sure to include the entire patch. Some of these were truncated, and I'm not going to second guess you on what you intended.

To generate the patch,
   diff -aur orig changed > patch1.diff
and don't screen scrape.

Changed in duplicity:
status: New → In Progress
Changed in duplicity:
status: In Progress → Incomplete
Revision history for this message
Kenneth Loafman (kenneth-loafman) wrote :

I manually applied all of the patches. Some were old, already applied, and caused conflicts. Others just needed an extra blank line at the end to match the context.

Changed in duplicity:
assignee: Kenneth Loafman (kenneth-loafman) → nobody
status: Incomplete → Fix Committed
no longer affects: duplicity/0.7-series
Changed in duplicity:
status: In Progress → Fix Released
Revision history for this message
Fjodor (sune-molgaard) wrote :

It would seem that the patch in comment #4 didn't make it.

I have manually applied it, and am in the process of determining its impact on the fact that diff'ing one of my larger VM VHDs is not currently possible within the time frame of a normal work day(!)

From what I can see with iotop, there are still bouts of slowness, which I assume to correspond to portions of the file that are actually different, but when it goes fast, it seems to go significantly faster than, e.g., yesterday without the patch.

If it doesn't succeed today, where I have to leave early, I might try implementing some sort of progress reporting to ascertain how far it gets.

Revision history for this message
Kenneth Loafman (kenneth-loafman) wrote :
Download full text (3.4 KiB)

There is an option called --max-blocksize which will do the same thing,
adjustable per run.

On Thu, Mar 3, 2016 at 4:57 AM, Fjodor <email address hidden> wrote:

> It would seem that the patch in comment #4 didn't make it.
>
> I have manually applied it, and am in the process of determining its
> impact on the fact that diff'ing one of my larger VM VHDs is not
> currently possible within the time frame of a normal work day(!)
>
> >From what I can see with iotop, there are still bouts of slowness, which
> I assume to correspond to portions of the file that are actually
> different, but when it goes fast, it seems to go significantly faster
> than, e.g., yesterday without the patch.
>
> If it doesn't succeed today, where I have to leave early, I might try
> implementing some sort of progress reporting to ascertain how far it
> gets.
>
> --
> You received this bug notification because you are subscribed to
> Duplicity.
> https://bugs.launchpad.net/bugs/582962
>
> Title:
> Diminishing performance on large files
>
> Status in Duplicity:
> Fix Released
> Status in Duplicity 0.6-series series:
> Fix Released
>
> Bug description:
> I am testing duplicity for backing up my VMs. Using version 0.6.08b,
> it took 14 hours to transfer a 28.8 GB VM to Amazon's S3 on my FiOS
> connection, doing a FULL backup. I have sustained well over 2 MB/s on
> our business line for over 1 hour, so I'm sure the delay/lag is coming
> from duplicity. The server I am running on is a brand new box with 4GB
> RAM, 500GB of space, core i5 quad core and is freshly installed with
> ARCH linux for testing duplicity and nothing more.
>
> I then tried again, but applied this patch here:
> https://bugs.launchpad.net/duplicity/+bug/550455. This patch GREATLY
> helped... initially. I went from uploading a volume once every minute
> to about 4.5 per minute (a HUGE improvement). I was excited as this
> was keeping my network utilization over 1 MB/s throughout the first 20
> minutes i was monitoring the computer. I came back 2 hours later to
> see that as soon as I left (I don't have any power saving options on
> or anything), the rate at which I uploaded went beserk. It would go
> from 2-3 volumes every minute to 1 volume every 2-3 minutes. It still
> is randomly uploading in bursts. Network utilization is kept between
> 40 KB/s and 250 KB/s with random bursts up to 680 KB/s. CPU
> utilization went down from spiking to 100% every few seconds to once
> every 1-2 minutes, idling otherwise.
>
> I am using the --asynchronous-upload option. Ideally I'd like CPU
> utilization to be kept at 100% (or close enough) across all cores and
> keep my network utilization over 1 MB/s throughout the sync (as it was
> doing initially). I am not sure why the process started to lag so
> badly after the 20 minute mark. There is nothing else utilizing our
> upload bandwidth (I am monitoring that) and the speed is not being
> throttled (I still achieve high scores at speedtest.net after multiple
> tests). The computer only has about ~15% of its RAM utilized and lots
> of free CPU time and disk space at its disposal. I even granted
> duplicity hi...

Read more...

Revision history for this message
Fjodor (sune-molgaard) wrote :

That is true (and thank you for it), but I see no reason, really, to have a default behaviour that has such low performance as the case is for files at even modest VM disk image files...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.