ftp access inefficiency

Bug #622566 reported by Alexander Belchenko
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Bazaar
Confirmed
High
Unassigned
Breezy
Won't Fix
Low
Unassigned

Bug Description

Doing a pull of couple revisions over FTP is very slow. Today I've spent about 4 minutes to pull 1 revision with changes in 2 files which are not very big. After pull new pack in my 2a repo is only 6KB big.

Looking at the .bzr.log I see that bzr downloading the same pack file from the FTP server again and again and again:

12.234 Using fetch logic to copy between
CHKInventoryRepository('ftp://<email address hidden>/xxx/.bzr/repository/')(RepositoryFormat2a())
and CHKInventoryRepository('file:///C:/work/Bazaar/xxx/.bzr/repository/')(RepositoryFormat2a())
12.234 fetch up to rev {<email address hidden>}
12.234 FTP get: /xxx/.bzr/repository/indices/670812de3ad30ee5a426cb15af4ec05f.rix
12.922 FTP get: /xxx/.bzr/repository/indices/3a9ac6d908537c328ad3b65becc869ec.rix
13.750 FTP get: /xxx/.bzr/repository/indices/1eb19f45fbb44be0b59b4bf6120f0b5b.rix
14.672 FTP get: /xxx/.bzr/repository/indices/670812de3ad30ee5a426cb15af4ec05f.six
15.312 FTP get: /xxx/.bzr/repository/indices/3a9ac6d908537c328ad3b65becc869ec.six
16.031 FTP get: /xxx/.bzr/repository/indices/1eb19f45fbb44be0b59b4bf6120f0b5b.six
16.781 FTP get: /xxx/.bzr/repository/packs/3a9ac6d908537c328ad3b65becc869ec.pack
53.312 FTP get: /xxx/.bzr/repository/indices/670812de3ad30ee5a426cb15af4ec05f.iix
54.219 FTP get: /xxx/.bzr/repository/indices/3a9ac6d908537c328ad3b65becc869ec.iix
54.937 FTP get: /xxx/.bzr/repository/packs/3a9ac6d908537c328ad3b65becc869ec.pack
84.844 FTP get: /xxx/.bzr/repository/packs/3a9ac6d908537c328ad3b65becc869ec.pack
124.609 FTP get: /xxx/.bzr/repository/indices/670812de3ad30ee5a426cb15af4ec05f.cix
125.359 FTP get: /xxx/.bzr/repository/indices/3a9ac6d908537c328ad3b65becc869ec.cix
126.250 FTP get: /xxx/.bzr/repository/packs/3a9ac6d908537c328ad3b65becc869ec.pack
166.953 FTP get: /xxx/.bzr/repository/packs/3a9ac6d908537c328ad3b65becc869ec.pack
203.562 FTP get: /xxx/.bzr/repository/indices/670812de3ad30ee5a426cb15af4ec05f.tix
204.281 FTP get: /xxx/.bzr/repository/indices/3a9ac6d908537c328ad3b65becc869ec.tix
205.000 FTP get: /xxx/.bzr/repository/packs/3a9ac6d908537c328ad3b65becc869ec.pack
243.531 FTP get: /xxx/ddaa/.bzr/branch/tags

As you can see the pack file 3a9ac6d908537c328ad3b65becc869ec.pack has been transferred 6 times. The bad thing is that pack file is 2.5 MB big.

How can this could be improved/fixed? Maybe some simple cache to keep downloaded files in the memory or on the disk in TEMP location? I need help to fix this.

Related branches

Revision history for this message
Alexander Belchenko (bialix) wrote :

Every download of that file via FTP costs 27 seconds. So it's 162 seconds (2.5+ minutes) just to re-download the same file again and again.

Changed in bzr:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Alexander Belchenko (bialix) wrote :

Difference between sftp access to the same branch and ftp access to the same branch to get the full branch (initial branching) is 82s vs 542s.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 622566] [NEW] ftp access inefficiency

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/23/2010 1:58 AM, Alexander Belchenko wrote:
> Public bug reported:
>
> Doing a pull of couple revisions over FTP is very slow. Today I've spent
> about 4 minutes to pull 1 revision with changes in 2 files which are not
> very big. After pull new pack in my 2a repo is only 6KB big.
>
> Looking at the .bzr.log I see that bzr downloading the same pack file
> from the FTP server again and again and again:

Just to mention we've known about this for quite some time (I'm a bit
surprised there isn't already a bug).

Basically, we don't support partial requests from an existing file
(readv). and we don't cache the file content between requests. So 10
separate readv requests for 10 bytes each on a 100MB file will download
10*100MB = 1GB just to offer up the 100 bytes requested.

I think the fix is to figure out how to do partial reads, since some ftp
clients do support 'resume', which must have some sort of support for
requesting the content that doesn't start at the start of the file.

That said, our FTP support is 'functional' but *far* from optimal. If
you have any chance to use sftp instead, do so. Even better if you just
use bzr+ssh...

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxymbkACgkQJdeBCYSNAAOQYQCfVJUIzcVDTDF1lt9F38sd5VKI
aMYAnikFo46SBNJbKCY6Nmqzs0wF2IPB
=tIj3
-----END PGP SIGNATURE-----

Revision history for this message
Alexander Belchenko (bialix) wrote :

John A Meinel пишет:
> That said, our FTP support is 'functional' but *far* from optimal. If
> you have any chance to use sftp instead, do so. Even better if you just
> use bzr+ssh...

In this particular case I need FTP for collaborations with others.

Revision history for this message
Alexander Belchenko (bialix) wrote :

I've coded a simple cached FTP transport on top of standard FTP transport in bzrlib, the plugin is available at lp:~bialix/+junk/kftp
and you need to use kftp://host/path URL instead of ftp://host/path.
It works good and indeed reduce the tme for pull and even push. I know it's ad-hoc, but it's better than nothing at all. I think it should be safe to cache packs/indices because they're supposed to be write-once data.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 622566] Re: ftp access inefficiency

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/24/2010 4:02 PM, Alexander Belchenko wrote:
> I've coded a simple cached FTP transport on top of standard FTP transport in bzrlib, the plugin is available at lp:~bialix/+junk/kftp
> and you need to use kftp://host/path URL instead of ftp://host/path.
> It works good and indeed reduce the tme for pull and even push. I know it's ad-hoc, but it's better than nothing at all. I think it should be safe to cache packs/indices because they're supposed to be write-once data.
>

I think you can arguably cache any downloaded content for the lifetime
of the lock in the outer scope. As long as you take care of invalidating
cached data that you just uploaded. (So if you have something cached by
'get' or 'readv()' then it should be invalided/updated by a 'put' call.)

Note that this will still be significantly more inefficient that sftp,
because you aren't actually requesting a subset. You are just avoiding
requesting the same content twice. (So if you request 10 bytes 10 times
in a 1MB file, you'll still download 1MB, but that is certainly better
than 10MB.)

If you do the cache invalidation, I think we would be fine bringing that
into bzr core.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx0Ni0ACgkQJdeBCYSNAAM0WACgzGNTqP0NBCiKHohYAkvYEGCM
O5gAnA3z/6MJhzseF65s0ULwoGk9IvrN
=TGKE
-----END PGP SIGNATURE-----

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/24/2010 4:02 PM, Alexander Belchenko wrote:
> I've coded a simple cached FTP transport on top of standard FTP transport in bzrlib, the plugin is available at lp:~bialix/+junk/kftp
> and you need to use kftp://host/path URL instead of ftp://host/path.
> It works good and indeed reduce the tme for pull and even push. I know it's ad-hoc, but it's better than nothing at all. I think it should be safe to cache packs/indices because they're supposed to be write-once data.
>

Note a link like this:
  http://cr.yp.to/ftp/retr.html

Which specifies the REST verb. The good and bad is that it lets you
start in the middle of the content. The bad is:

1) There doesn't seem to be a way to 'seek' after reading, or stop the
   existing download without closing the connection. So if you want to
   read 100 bytes at offset 10,000, you can seek to 10,000, but it will
   try to stream the remaining 990,000 bytes back to you.
2) Most of our pack-based 'readv' requests actually request the first
   bytes of the file, because it wants to make sure that the pack magic
   header are present.
   There certainly can be stuff we can do here, especially with caching.
3) Some of this might be decent to do if we got fancier about handling
   the individual ftp connections. You connect for the next data
   transfer, grab some bytes, and then disconnect on that socket only,
   and not the control socket.
   That is about the extent of my knowledge of ftp, though.
4) I did check the docs, and ftplib does support REST:
   http://docs.python.org/library/ftplib.html#ftplib.FTP.retrbinary

   Has a 'rest' parameter. It may cause an exception to be raised if
   REST is not supported, but we can trap that and retry.

   They also allow you to set the 'maxblocksize'. Which will mean more
   bytes are probably on the wire, but we can close the socket once we
   got enough bytes, etc.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx0OEYACgkQJdeBCYSNAAMYbgCfdLJ43PBFjvyCnbp9yqLcj7BZ
jLcAoJZFGxnVFzizsuJGMrYWJ4IvlsSY
=8pa5
-----END PGP SIGNATURE-----

Revision history for this message
Alexander Belchenko (bialix) wrote :

John Arbash Meinel пишет:
> On 8/24/2010 4:02 PM, Alexander Belchenko wrote:
>> I've coded a simple cached FTP transport on top of standard FTP transport in bzrlib, the plugin is available at lp:~bialix/+junk/kftp
>> and you need to use kftp://host/path URL instead of ftp://host/path.
>> It works good and indeed reduce the tme for pull and even push. I know it's ad-hoc, but it's better than nothing at all. I think it should be safe to cache packs/indices because they're supposed to be write-once data.
>
>
> I think you can arguably cache any downloaded content for the lifetime
> of the lock in the outer scope. As long as you take care of invalidating
> cached data that you just uploaded. (So if you have something cached by
> 'get' or 'readv()' then it should be invalided/updated by a 'put' call.)

Yes, I'm invalidating the cache on put, and on rename/move calls. But
looking at logs it seems only put matters. Also looking at logs it
seems all I need is to cache pack files, because bzr doesn't seem to
request indices more than 1 time.

> Note that this will still be significantly more inefficient that sftp,
> because you aren't actually requesting a subset. You are just avoiding
> requesting the same content twice. (So if you request 10 bytes 10 times
> in a 1MB file, you'll still download 1MB, but that is certainly better
> than 10MB.)

Based on my measurements now cached ftp and sftp took roughly the same
time to make the full pull. For partial pull of 1 revision (as in the
initial bug report) now it took 53 seconds instead of 4 minutes. And
sftp partial pull of the same data took 62 seconds for me. So they're
actually very close. I think it depends on how data packed actually
into pack file. Also there is noticeable delay in sftp case to
establish the connection at the beginning of transaction (up to 5
seconds, if no more).

> If you do the cache invalidation, I think we would be fine bringing that
> into bzr core.

I'll prepare a patch then.

One question: do you think it's OK to explicitly cache only pack files
(I'm checking the file extension now)? It seems as bad practice to
bring the knowledge about packs into transport layer. Another way may
be to cache only big files, greater than some threshold, e.g. 64KiB or
so, so we won't cache lock info files, branch.conf, last-revision and
pack-names.

--
All the dude wanted was his rug back

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (3.6 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/25/2010 12:49 AM, Alexander Belchenko wrote:
> John Arbash Meinel пишет:
>> On 8/24/2010 4:02 PM, Alexander Belchenko wrote:
>>> I've coded a simple cached FTP transport on top of standard FTP transport in bzrlib, the plugin is available at lp:~bialix/+junk/kftp
>>> and you need to use kftp://host/path URL instead of ftp://host/path.
>>> It works good and indeed reduce the tme for pull and even push. I know it's ad-hoc, but it's better than nothing at all. I think it should be safe to cache packs/indices because they're supposed to be write-once data.
>>
>>
>> I think you can arguably cache any downloaded content for the lifetime
>> of the lock in the outer scope. As long as you take care of invalidating
>> cached data that you just uploaded. (So if you have something cached by
>> 'get' or 'readv()' then it should be invalided/updated by a 'put' call.)
>
> Yes, I'm invalidating the cache on put, and on rename/move calls. But
> looking at logs it seems only put matters. Also looking at logs it
> seems all I need is to cache pack files, because bzr doesn't seem to
> request indices more than 1 time.

It will really depend on the action being performed. "bzr log" over the
whole history will make repeated passes to the .rix files. Etc.

>
>> Note that this will still be significantly more inefficient that sftp,
>> because you aren't actually requesting a subset. You are just avoiding
>> requesting the same content twice. (So if you request 10 bytes 10 times
>> in a 1MB file, you'll still download 1MB, but that is certainly better
>> than 10MB.)
>
> Based on my measurements now cached ftp and sftp took roughly the same
> time to make the full pull. For partial pull of 1 revision (as in the
> initial bug report) now it took 53 seconds instead of 4 minutes. And
> sftp partial pull of the same data took 62 seconds for me. So they're
> actually very close. I think it depends on how data packed actually
> into pack file. Also there is noticeable delay in sftp case to
> establish the connection at the beginning of transaction (up to 5
> seconds, if no more).

The numbers will depend a lot on your latency versus bandwidth. I also
haven't actually looked at your plugin yet.

>
>> If you do the cache invalidation, I think we would be fine bringing that
>> into bzr core.
>
> I'll prepare a patch then.
>
> One question: do you think it's OK to explicitly cache only pack files
> (I'm checking the file extension now)? It seems as bad practice to
> bring the knowledge about packs into transport layer. Another way may
> be to cache only big files, greater than some threshold, e.g. 64KiB or
> so, so we won't cache lock info files, branch.conf, last-revision and
> pack-names.
>

The things you want to cache are accessed via 'readv' rather than via
'get' directly. I believe in the FTP transport, it doesn't actually
implement readv, so it falls back to the default behavior which calls
get + seek + read (which works for the Local version, not so well for
everything else).

If you just implemented readv() and then only cached requests coming
from there, that would work pretty well.

Note also that you'll want ...

Read more...

Revision history for this message
Alexander Belchenko (bialix) wrote :
Download full text (4.8 KiB)

John Arbash Meinel пишет:
> On 8/25/2010 12:49 AM, Alexander Belchenko wrote:
>> John Arbash Meinel пишет:
>>> On 8/24/2010 4:02 PM, Alexander Belchenko wrote:
>>>> I've coded a simple cached FTP transport on top of standard FTP transport in bzrlib, the plugin is available at lp:~bialix/+junk/kftp
>>>> and you need to use kftp://host/path URL instead of ftp://host/path.
>>>> It works good and indeed reduce the tme for pull and even push. I know it's ad-hoc, but it's better than nothing at all. I think it should be safe to cache packs/indices because they're supposed to be write-once data.
>>>
>>> I think you can arguably cache any downloaded content for the lifetime
>>> of the lock in the outer scope. As long as you take care of invalidating
>>> cached data that you just uploaded. (So if you have something cached by
>>> 'get' or 'readv()' then it should be invalided/updated by a 'put' call.)
>> Yes, I'm invalidating the cache on put, and on rename/move calls. But
>> looking at logs it seems only put matters. Also looking at logs it
>> seems all I need is to cache pack files, because bzr doesn't seem to
>> request indices more than 1 time.
>
> It will really depend on the action being performed. "bzr log" over the
> whole history will make repeated passes to the .rix files. Etc.

OK, it makes sense.

>>> Note that this will still be significantly more inefficient that sftp,
>>> because you aren't actually requesting a subset. You are just avoiding
>>> requesting the same content twice. (So if you request 10 bytes 10 times
>>> in a 1MB file, you'll still download 1MB, but that is certainly better
>>> than 10MB.)
>> Based on my measurements now cached ftp and sftp took roughly the same
>> time to make the full pull. For partial pull of 1 revision (as in the
>> initial bug report) now it took 53 seconds instead of 4 minutes. And
>> sftp partial pull of the same data took 62 seconds for me. So they're
>> actually very close. I think it depends on how data packed actually
>> into pack file. Also there is noticeable delay in sftp case to
>> establish the connection at the beginning of transaction (up to 5
>> seconds, if no more).
>
> The numbers will depend a lot on your latency versus bandwidth. I also
> haven't actually looked at your plugin yet.

I hope I can assume that my ftp vs sftp access to the same server over
the same network has roughly the same latency/bandwidth.

The main part of my plugin here:
http://bazaar.launchpad.net/~bialix/%2Bjunk/kftp/annotate/head:/cached_ftp.py
it wraps several methods of FtpTransport to insert cache.

Actually it seems that creating protocol decorator a-la "cached+ftp://"
would be better, but current implementation is mostly quick hack and
proof of concept.

>>> If you do the cache invalidation, I think we would be fine bringing that
>>> into bzr core.
>> I'll prepare a patch then.
>>
>> One question: do you think it's OK to explicitly cache only pack files
>> (I'm checking the file extension now)? It seems as bad practice to
>> bring the knowledge about packs into transport layer. Another way may
>> be to cache only big files, greater than some threshold, e.g. 64KiB or
>> so, so we won't cache lock inf...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...
> The main part of my plugin here:
> http://bazaar.launchpad.net/~bialix/%2Bjunk/kftp/annotate/head:/cached_ftp.py
> it wraps several methods of FtpTransport to insert cache.
>
> Actually it seems that creating protocol decorator a-la "cached+ftp://"
> would be better, but current implementation is mostly quick hack and
> proof of concept.

The only specific issue I have is that it is all cached in memory, and
we often have very large .pack files. (robert had a repository that was
16GB in size, IIRC.) My largest bzr repo pack is 54MB.

So I would look into using 'tempfile' code to put content into a local
temporary file, and then return the data from that. You could probably
put small content in memory (say <64k or something). The concerns would be:
 a) If you hold the files open (using TemporaryFile, which deletes when
    closed), then you may run out of file descriptors (unlikely, as we
    do try to keep the number of files small, maybe <100)
 b) If you don't, then you have to figure out when to delete the local
    content.
    One option is to actually not delete the local content, considering
    pack files are named by md5sums. However, you'd want some sort of
    invalidation, because when we repack the repo will delete those old
    files, but you wouldn't know about that locally.

Probably best to stick with TemporaryFile.

>

> If I understand correctly I should implement _readv method for
> FtpTransport to override following default implementation:

Correct.

> Does bzrlib always uses readv and never get? Or readv is used only for
> partial reads from pack/indices files? In the latter case indeed I
> should insert caching in the _readv(), and not into get().
>
> Am I understand this correctly?

We always use readv() to read anything that we can read a subset of.
(.pack and indices). We standardized on '.get_bytes()' for stuff like
meta control files (branch-format, branch.conf, etc). Though the default
implementation of .get_bytes() calls .get().read(). (This is because
.get() existed before .get_bytes()).
>
>> If you just implemented readv() and then only cached requests coming
>> from there, that would work pretty well.
>
>> Note also that you'll want to check into the append code a bit, too.
>> Sometimes we'll readv() from something we just appended to.
>
> append method invalidates the cache at the moment.
>
>> I think that would be a better option than trying to find a heuristic
>> based on file size.
>
> Does "that" mean "readv"?
>
> Thank you.

Yes.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx1OUYACgkQJdeBCYSNAAOJPQCcDMwbei9xhZOuLD+ildcJcnHL
SE4Anj98zc7826v92YQwC8m7ZhnksyjV
=l6ft
-----END PGP SIGNATURE-----

Revision history for this message
Gary van der Merwe (garyvdm) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 25/08/2010 07:49, Alexander Belchenko wrote:
> Yes, I'm invalidating the cache on put, and on rename/move calls. But
> looking at logs it seems only put matters.

I would say you should rename the cache when the file is renamed. So if
you rename a b then you should do:
cache['b'] = cache['a']
del cache['a']

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx2L0kACgkQd/3EdwGKOh3CzgCeNIwfn4XJ+bwygPYsbyWm8hCT
jJoAn00AhZAuJicPdJhHxRfllJuhSxam
=YIAb
-----END PGP SIGNATURE-----

Revision history for this message
Alexander Belchenko (bialix) wrote :

Gary van der Merwe пишет:
> On 25/08/2010 07:49, Alexander Belchenko wrote:
>> Yes, I'm invalidating the cache on put, and on rename/move calls. But
>> looking at logs it seems only put matters.
>
> I would say you should rename the cache when the file is renamed. So if
> you rename a b then you should do:
> cache['b'] = cache['a']
> del cache['a']

I'd prefer to invalidate the cache instead of moving content. Because at
this point I don't know will operation be successful or not.

Revision history for this message
Vincent Ladeuil (vila) wrote :

>>>>> Gary van der Merwe <email address hidden> writes:

    > On 25/08/2010 07:49, Alexander Belchenko wrote:
    >> Yes, I'm invalidating the cache on put, and on rename/move calls. But
    >> looking at logs it seems only put matters.

    > I would say you should rename the cache when the file is renamed. So if
    > you rename a b then you should do:
    > cache['b'] = cache['a']
    > del cache['a']

Good idea as it will partly address the repacks. Well, for people using
the plugin that is ;)

I think using temp files is the safest bet to cover all cases anyway
even if it's less efficient from one bzr invocation to the other.

Jelmer Vernooij (jelmer)
tags: added: ftp
Vincent Ladeuil (vila)
tags: added: performance
Jelmer Vernooij (jelmer)
tags: added: check-for-breezy
Jelmer Vernooij (jelmer)
tags: removed: check-for-breezy
Changed in brz:
status: New → Triaged
importance: Undecided → Low
Jelmer Vernooij (jelmer)
Changed in brz:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.