cloudfiles backend slow
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Duplicity |
New
|
Undecided
|
Unassigned |
Bug Description
I noticed that the cloudfiles backend was incredibly slow. After poking around for a bit, I've realised that the culprit is the call to socket.
I created a simple test script to upload a 100 MB file to Cloud Files. It does pretty much exactly what the cloudfiles backend does:
import os
import socket
from cloudfiles import Connection
from cloudfiles.errors import ResponseError
from cloudfiles import consts
conn_kwargs = {}
conn_kwargs[
conn_kwargs[
conn_kwargs[
conn = Connection(
container = conn.create_
sobject = container.
sobject.
If I run it like that, it takes around 15 seconds to upload 100 MB. If I add a call to socket.
If I strace the two runs, I see a call to poll() before each write(). This gets added by Python's socketmodule.c due to the defaulttimeout.
I tried adding timestamps to the strace log and counted how many system calls each of the two runs makes over the course of a single second while transferring the data. With the default sockettimeout, I got just over 100 system calls (poll, write, read, poll, write, read, etc.). It's dealing with a block size of 4kB, so that's 4kB*(100/3) = 133 kB/s. Without the default socket timeout, I got around 1120 system calls (read, write, read, write, etc.). That translates to 4kB*(1120/2) = 2240 kB/s. That's a pretty hefty difference.
What confuses me, though, is that cloudfiles.
Heh! No, cloudfiles. Connection. __init_ _() doesn't call socket. setdefaulttimeo ut after all. Here's the code snippet:
class Connection(object):
socket. setdefaulttimeo ut = int(kwargs. get('timeout' , 5))
[...]
def __init__(self, username=None, api_key=None, **kwargs):
[...]
Later versions of python-cloudfiles has this problem fixed and exhibit this slowness regardless of the socket. setdefaulttimeo ut in duplicity. I suppose I should bug the python-cloudfiles upstream instead.