John Arbash Meinel пишет:
> On 8/25/2010 12:49 AM, Alexander Belchenko wrote:
>> John Arbash Meinel пишет:
>>> On 8/24/2010 4:02 PM, Alexander Belchenko wrote:
>>>> I've coded a simple cached FTP transport on top of standard FTP transport in bzrlib, the plugin is available at lp:~bialix/+junk/kftp
>>>> and you need to use kftp://host/path URL instead of ftp://host/path.
>>>> It works good and indeed reduce the tme for pull and even push. I know it's ad-hoc, but it's better than nothing at all. I think it should be safe to cache packs/indices because they're supposed to be write-once data.
>>>
>>> I think you can arguably cache any downloaded content for the lifetime
>>> of the lock in the outer scope. As long as you take care of invalidating
>>> cached data that you just uploaded. (So if you have something cached by
>>> 'get' or 'readv()' then it should be invalided/updated by a 'put' call.)
>> Yes, I'm invalidating the cache on put, and on rename/move calls. But
>> looking at logs it seems only put matters. Also looking at logs it
>> seems all I need is to cache pack files, because bzr doesn't seem to
>> request indices more than 1 time.
>
> It will really depend on the action being performed. "bzr log" over the
> whole history will make repeated passes to the .rix files. Etc.
OK, it makes sense.
>>> Note that this will still be significantly more inefficient that sftp,
>>> because you aren't actually requesting a subset. You are just avoiding
>>> requesting the same content twice. (So if you request 10 bytes 10 times
>>> in a 1MB file, you'll still download 1MB, but that is certainly better
>>> than 10MB.)
>> Based on my measurements now cached ftp and sftp took roughly the same
>> time to make the full pull. For partial pull of 1 revision (as in the
>> initial bug report) now it took 53 seconds instead of 4 minutes. And
>> sftp partial pull of the same data took 62 seconds for me. So they're
>> actually very close. I think it depends on how data packed actually
>> into pack file. Also there is noticeable delay in sftp case to
>> establish the connection at the beginning of transaction (up to 5
>> seconds, if no more).
>
> The numbers will depend a lot on your latency versus bandwidth. I also
> haven't actually looked at your plugin yet.
I hope I can assume that my ftp vs sftp access to the same server over
the same network has roughly the same latency/bandwidth.
Actually it seems that creating protocol decorator a-la "cached+ftp://"
would be better, but current implementation is mostly quick hack and
proof of concept.
>>> If you do the cache invalidation, I think we would be fine bringing that
>>> into bzr core.
>> I'll prepare a patch then.
>>
>> One question: do you think it's OK to explicitly cache only pack files
>> (I'm checking the file extension now)? It seems as bad practice to
>> bring the knowledge about packs into transport layer. Another way may
>> be to cache only big files, greater than some threshold, e.g. 64KiB or
>> so, so we won't cache lock info files, branch.conf, last-revision and
>> pack-names.
>>
>
> The things you want to cache are accessed via 'readv' rather than via
> 'get' directly. I believe in the FTP transport, it doesn't actually
> implement readv, so it falls back to the default behavior which calls
> get + seek + read (which works for the Local version, not so well for
> everything else).
I'll take a look.
If I understand correctly I should implement _readv method for
FtpTransport to override following default implementation:
def _readv(self, relpath, offsets):
"""Get parts of the file at the given relative path.
:param relpath: The path to read.
:param offsets: A list of (offset, size) tuples.
:return: A list or generator of (offset, data) tuples
"""
if not offsets:
return
This implementation uses transport.get() as I can see.
Does bzrlib always uses readv and never get? Or readv is used only for
partial reads from pack/indices files? In the latter case indeed I
should insert caching in the _readv(), and not into get().
Am I understand this correctly?
> If you just implemented readv() and then only cached requests coming
> from there, that would work pretty well.
> Note also that you'll want to check into the append code a bit, too.
> Sometimes we'll readv() from something we just appended to.
append method invalidates the cache at the moment.
> I think that would be a better option than trying to find a heuristic
> based on file size.
Does "that" mean "readv"?
Thank you.
--
All the dude wanted was his rug back
John Arbash Meinel пишет: path.
> On 8/25/2010 12:49 AM, Alexander Belchenko wrote:
>> John Arbash Meinel пишет:
>>> On 8/24/2010 4:02 PM, Alexander Belchenko wrote:
>>>> I've coded a simple cached FTP transport on top of standard FTP transport in bzrlib, the plugin is available at lp:~bialix/+junk/kftp
>>>> and you need to use kftp://host/path URL instead of ftp://host/
>>>> It works good and indeed reduce the tme for pull and even push. I know it's ad-hoc, but it's better than nothing at all. I think it should be safe to cache packs/indices because they're supposed to be write-once data.
>>>
>>> I think you can arguably cache any downloaded content for the lifetime
>>> of the lock in the outer scope. As long as you take care of invalidating
>>> cached data that you just uploaded. (So if you have something cached by
>>> 'get' or 'readv()' then it should be invalided/updated by a 'put' call.)
>> Yes, I'm invalidating the cache on put, and on rename/move calls. But
>> looking at logs it seems only put matters. Also looking at logs it
>> seems all I need is to cache pack files, because bzr doesn't seem to
>> request indices more than 1 time.
>
> It will really depend on the action being performed. "bzr log" over the
> whole history will make repeated passes to the .rix files. Etc.
OK, it makes sense.
>>> Note that this will still be significantly more inefficient that sftp,
>>> because you aren't actually requesting a subset. You are just avoiding
>>> requesting the same content twice. (So if you request 10 bytes 10 times
>>> in a 1MB file, you'll still download 1MB, but that is certainly better
>>> than 10MB.)
>> Based on my measurements now cached ftp and sftp took roughly the same
>> time to make the full pull. For partial pull of 1 revision (as in the
>> initial bug report) now it took 53 seconds instead of 4 minutes. And
>> sftp partial pull of the same data took 62 seconds for me. So they're
>> actually very close. I think it depends on how data packed actually
>> into pack file. Also there is noticeable delay in sftp case to
>> establish the connection at the beginning of transaction (up to 5
>> seconds, if no more).
>
> The numbers will depend a lot on your latency versus bandwidth. I also
> haven't actually looked at your plugin yet.
I hope I can assume that my ftp vs sftp access to the same server over
the same network has roughly the same latency/bandwidth.
The main part of my plugin here: bazaar. launchpad. net/~bialix/ %2Bjunk/ kftp/annotate/ head:/cached_ ftp.py
http://
it wraps several methods of FtpTransport to insert cache.
Actually it seems that creating protocol decorator a-la "cached+ftp://"
would be better, but current implementation is mostly quick hack and
proof of concept.
>>> If you do the cache invalidation, I think we would be fine bringing that
>>> into bzr core.
>> I'll prepare a patch then.
>>
>> One question: do you think it's OK to explicitly cache only pack files
>> (I'm checking the file extension now)? It seems as bad practice to
>> bring the knowledge about packs into transport layer. Another way may
>> be to cache only big files, greater than some threshold, e.g. 64KiB or
>> so, so we won't cache lock info files, branch.conf, last-revision and
>> pack-names.
>>
>
> The things you want to cache are accessed via 'readv' rather than via
> 'get' directly. I believe in the FTP transport, it doesn't actually
> implement readv, so it falls back to the default behavior which calls
> get + seek + read (which works for the Local version, not so well for
> everything else).
I'll take a look.
If I understand correctly I should implement _readv method for
FtpTransport to override following default implementation:
def _readv(self, relpath, offsets):
"""Get parts of the file at the given relative path.
:param relpath: The path to read.
:param offsets: A list of (offset, size) tuples.
:return: A list or generator of (offset, data) tuples
"""
if not offsets:
return
fp = self.get(relpath) and_read( fp, offsets, relpath)
return self._seek_
This implementation uses transport.get() as I can see.
Does bzrlib always uses readv and never get? Or readv is used only for
partial reads from pack/indices files? In the latter case indeed I
should insert caching in the _readv(), and not into get().
Am I understand this correctly?
> If you just implemented readv() and then only cached requests coming
> from there, that would work pretty well.
> Note also that you'll want to check into the append code a bit, too.
> Sometimes we'll readv() from something we just appended to.
append method invalidates the cache at the moment.
> I think that would be a better option than trying to find a heuristic
> based on file size.
Does "that" mean "readv"?
Thank you.
--
All the dude wanted was his rug back