Breezy

Bug #622566
Comment #10

Comment 10 for bug 622566

Revision history for this message

Alexander Belchenko (bialix) wrote on 2010-08-25: Re: [Bug 622566] Re: ftp access inefficiency

#10

John Arbash Meinel пишет:
> On 8/25/2010 12:49 AM, Alexander Belchenko wrote:
>> John Arbash Meinel пишет:
>>> On 8/24/2010 4:02 PM, Alexander Belchenko wrote:
>>>> I've coded a simple cached FTP transport on top of standard FTP transport in bzrlib, the plugin is available at lp:~bialix/+junk/kftp
>>>> and you need to use kftp://host/path URL instead of ftp://host/path.
>>>> It works good and indeed reduce the tme for pull and even push. I know it's ad-hoc, but it's better than nothing at all. I think it should be safe to cache packs/indices because they're supposed to be write-once data.
>>>
>>> I think you can arguably cache any downloaded content for the lifetime
>>> of the lock in the outer scope. As long as you take care of invalidating
>>> cached data that you just uploaded. (So if you have something cached by
>>> 'get' or 'readv()' then it should be invalided/updated by a 'put' call.)
>> Yes, I'm invalidating the cache on put, and on rename/move calls. But
>> looking at logs it seems only put matters. Also looking at logs it
>> seems all I need is to cache pack files, because bzr doesn't seem to
>> request indices more than 1 time.
>
> It will really depend on the action being performed. "bzr log" over the
> whole history will make repeated passes to the .rix files. Etc.

OK, it makes sense.

>>> Note that this will still be significantly more inefficient that sftp,
>>> because you aren't actually requesting a subset. You are just avoiding
>>> requesting the same content twice. (So if you request 10 bytes 10 times
>>> in a 1MB file, you'll still download 1MB, but that is certainly better
>>> than 10MB.)
>> Based on my measurements now cached ftp and sftp took roughly the same
>> time to make the full pull. For partial pull of 1 revision (as in the
>> initial bug report) now it took 53 seconds instead of 4 minutes. And
>> sftp partial pull of the same data took 62 seconds for me. So they're
>> actually very close. I think it depends on how data packed actually
>> into pack file. Also there is noticeable delay in sftp case to
>> establish the connection at the beginning of transaction (up to 5
>> seconds, if no more).
>
> The numbers will depend a lot on your latency versus bandwidth. I also
> haven't actually looked at your plugin yet.

I hope I can assume that my ftp vs sftp access to the same server over
the same network has roughly the same latency/bandwidth.

The main part of my plugin here:
http://bazaar.launchpad.net/~bialix/%2Bjunk/kftp/annotate/head:/cached_ftp.py
it wraps several methods of FtpTransport to insert cache.

Actually it seems that creating protocol decorator a-la "cached+ftp://"
would be better, but current implementation is mostly quick hack and
proof of concept.

>>> If you do the cache invalidation, I think we would be fine bringing that
>>> into bzr core.
>> I'll prepare a patch then.
>>
>> One question: do you think it's OK to explicitly cache only pack files
>> (I'm checking the file extension now)? It seems as bad practice to
>> bring the knowledge about packs into transport layer. Another way may
>> be to cache only big files, greater than some threshold, e.g. 64KiB or
>> so, so we won't cache lock info files, branch.conf, last-revision and
>> pack-names.
>>
>
> The things you want to cache are accessed via 'readv' rather than via
> 'get' directly. I believe in the FTP transport, it doesn't actually
> implement readv, so it falls back to the default behavior which calls
> get + seek + read (which works for the Local version, not so well for
> everything else).

I'll take a look.

If I understand correctly I should implement _readv method for
FtpTransport to override following default implementation:

def _readv(self, relpath, offsets):
"""Get parts of the file at the given relative path.

         :param relpath: The path to read.
         :param offsets: A list of (offset, size) tuples.
         :return: A list or generator of (offset, data) tuples
         """
         if not offsets:
             return

fp = self.get(relpath)
return self._seek_and_read(fp, offsets, relpath)

This implementation uses transport.get() as I can see.

Does bzrlib always uses readv and never get? Or readv is used only for
partial reads from pack/indices files? In the latter case indeed I
should insert caching in the _readv(), and not into get().

Am I understand this correctly?

> If you just implemented readv() and then only cached requests coming
> from there, that would work pretty well.

> Note also that you'll want to check into the append code a bit, too.
> Sometimes we'll readv() from something we just appended to.

append method invalidates the cache at the moment.

> I think that would be a better option than trying to find a heuristic
> based on file size.

Does "that" mean "readv"?

Thank you.
--
All the dude wanted was his rug back

John Arbash Meinel пишет:
> On 8/25/2010 12:49 AM, Alexander Belchenko wrote:
>> John Arbash Meinel пишет:
>>> On 8/24/2010 4:02 PM, Alexander Belchenko wrote:
>>>> I've coded a simple cached FTP transport on top of standard FTP transport in bzrlib, the plugin is available at lp:~bialix/+junk/kftp
>>>> and you need to use kftp://host/path URL instead of ftp://host/path.
>>>> It works good and indeed reduce the tme for pull and even push. I know it's ad-hoc, but it's better than nothing at all. I think it should be safe to cache packs/indices because they're supposed to be write-once data.
>>>
>>> I think you can arguably cache any downloaded content for the lifetime
>>> of the lock in the outer scope. As long as you take care of invalidating
>>> cached data that you just uploaded. (So if you have something cached by
>>> 'get' or 'readv()' then it should be invalided/updated by a 'put' call.)
>> Yes, I'm invalidating the cache on put, and on rename/move calls. But
>> looking at logs it seems only put matters. Also looking at logs it
>> seems all I need is to cache pack files, because bzr doesn't seem to
>> request indices more than 1 time.
> 
> It will really depend on the action being performed. "bzr log" over the
> whole history will make repeated passes to the .rix files. Etc.

OK, it makes sense.

>>> Note that this will still be significantly more inefficient that sftp,
>>> because you aren't actually requesting a subset. You are just avoiding
>>> requesting the same content twice. (So if you request 10 bytes 10 times
>>> in a 1MB file, you'll still download 1MB, but that is certainly better
>>> than 10MB.)
>> Based on my measurements now cached ftp and sftp took roughly the same
>> time to make the full pull. For partial pull of 1 revision (as in the
>> initial bug report) now it took 53 seconds instead of 4 minutes. And
>> sftp partial pull of the same data took 62 seconds for me. So they're
>> actually very close. I think it depends on how data packed actually
>> into pack file. Also there is noticeable delay in sftp case to
>> establish the connection at the beginning of transaction (up to 5
>> seconds, if no more).
> 
> The numbers will depend a lot on your latency versus bandwidth. I also
> haven't actually looked at your plugin yet.

I hope I can assume that my ftp vs sftp access to the same server over 
the same network has roughly the same latency/bandwidth.

The main part of my plugin here:
http://bazaar.launchpad.net/~bialix/%2Bjunk/kftp/annotate/head:/cached_ftp.py
it wraps several methods of FtpTransport to insert cache.

Actually it seems that creating protocol decorator a-la "cached+ftp://" 
would be better, but current implementation is mostly quick hack and 
proof of concept.

>>> If you do the cache invalidation, I think we would be fine bringing that
>>> into bzr core.
>> I'll prepare a patch then.
>>
>> One question: do you think it's OK to explicitly cache only pack files
>> (I'm checking the file extension now)? It seems as bad practice to
>> bring the knowledge about packs into transport layer. Another way may
>> be to cache only big files, greater than some threshold, e.g. 64KiB or
>> so, so we won't cache lock info files, branch.conf, last-revision and
>> pack-names.
>>
> 
> The things you want to cache are accessed via 'readv' rather than via
> 'get' directly. I believe in the FTP transport, it doesn't actually
> implement readv, so it falls back to the default behavior which calls
> get + seek + read (which works for the Local version, not so well for
> everything else).

I'll take a look.

If I understand correctly I should implement _readv method for 
FtpTransport to override following default implementation:

def _readv(self, relpath, offsets):
         """Get parts of the file at the given relative path.

fp = self.get(relpath)
         return self._seek_and_read(fp, offsets, relpath)