Comment 9 for bug 622566

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 622566] Re: ftp access inefficiency

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/25/2010 12:49 AM, Alexander Belchenko wrote:
> John Arbash Meinel пишет:
>> On 8/24/2010 4:02 PM, Alexander Belchenko wrote:
>>> I've coded a simple cached FTP transport on top of standard FTP transport in bzrlib, the plugin is available at lp:~bialix/+junk/kftp
>>> and you need to use kftp://host/path URL instead of ftp://host/path.
>>> It works good and indeed reduce the tme for pull and even push. I know it's ad-hoc, but it's better than nothing at all. I think it should be safe to cache packs/indices because they're supposed to be write-once data.
>>
>>
>> I think you can arguably cache any downloaded content for the lifetime
>> of the lock in the outer scope. As long as you take care of invalidating
>> cached data that you just uploaded. (So if you have something cached by
>> 'get' or 'readv()' then it should be invalided/updated by a 'put' call.)
>
> Yes, I'm invalidating the cache on put, and on rename/move calls. But
> looking at logs it seems only put matters. Also looking at logs it
> seems all I need is to cache pack files, because bzr doesn't seem to
> request indices more than 1 time.

It will really depend on the action being performed. "bzr log" over the
whole history will make repeated passes to the .rix files. Etc.

>
>> Note that this will still be significantly more inefficient that sftp,
>> because you aren't actually requesting a subset. You are just avoiding
>> requesting the same content twice. (So if you request 10 bytes 10 times
>> in a 1MB file, you'll still download 1MB, but that is certainly better
>> than 10MB.)
>
> Based on my measurements now cached ftp and sftp took roughly the same
> time to make the full pull. For partial pull of 1 revision (as in the
> initial bug report) now it took 53 seconds instead of 4 minutes. And
> sftp partial pull of the same data took 62 seconds for me. So they're
> actually very close. I think it depends on how data packed actually
> into pack file. Also there is noticeable delay in sftp case to
> establish the connection at the beginning of transaction (up to 5
> seconds, if no more).

The numbers will depend a lot on your latency versus bandwidth. I also
haven't actually looked at your plugin yet.

>
>> If you do the cache invalidation, I think we would be fine bringing that
>> into bzr core.
>
> I'll prepare a patch then.
>
> One question: do you think it's OK to explicitly cache only pack files
> (I'm checking the file extension now)? It seems as bad practice to
> bring the knowledge about packs into transport layer. Another way may
> be to cache only big files, greater than some threshold, e.g. 64KiB or
> so, so we won't cache lock info files, branch.conf, last-revision and
> pack-names.
>

The things you want to cache are accessed via 'readv' rather than via
'get' directly. I believe in the FTP transport, it doesn't actually
implement readv, so it falls back to the default behavior which calls
get + seek + read (which works for the Local version, not so well for
everything else).

If you just implemented readv() and then only cached requests coming
from there, that would work pretty well.

Note also that you'll want to check into the append code a bit, too.
Sometimes we'll readv() from something we just appended to.

I think that would be a better option than trying to find a heuristic
based on file size.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx1HQUACgkQJdeBCYSNAAO3CwCgoOx2JCZIlgLMvWujtHWislb3
qKAAnAvNWQccXoqF8BuSeiPLwiyncZT9
=LRpN
-----END PGP SIGNATURE-----