John Arbash Meinel пишет: > On 8/25/2010 12:49 AM, Alexander Belchenko wrote: >> John Arbash Meinel пишет: >>> On 8/24/2010 4:02 PM, Alexander Belchenko wrote: >>>> I've coded a simple cached FTP transport on top of standard FTP transport in bzrlib, the plugin is available at lp:~bialix/+junk/kftp >>>> and you need to use kftp://host/path URL instead of ftp://host/path. >>>> It works good and indeed reduce the tme for pull and even push. I know it's ad-hoc, but it's better than nothing at all. I think it should be safe to cache packs/indices because they're supposed to be write-once data. >>> >>> I think you can arguably cache any downloaded content for the lifetime >>> of the lock in the outer scope. As long as you take care of invalidating >>> cached data that you just uploaded. (So if you have something cached by >>> 'get' or 'readv()' then it should be invalided/updated by a 'put' call.) >> Yes, I'm invalidating the cache on put, and on rename/move calls. But >> looking at logs it seems only put matters. Also looking at logs it >> seems all I need is to cache pack files, because bzr doesn't seem to >> request indices more than 1 time. > > It will really depend on the action being performed. "bzr log" over the > whole history will make repeated passes to the .rix files. Etc. OK, it makes sense. >>> Note that this will still be significantly more inefficient that sftp, >>> because you aren't actually requesting a subset. You are just avoiding >>> requesting the same content twice. (So if you request 10 bytes 10 times >>> in a 1MB file, you'll still download 1MB, but that is certainly better >>> than 10MB.) >> Based on my measurements now cached ftp and sftp took roughly the same >> time to make the full pull. For partial pull of 1 revision (as in the >> initial bug report) now it took 53 seconds instead of 4 minutes. And >> sftp partial pull of the same data took 62 seconds for me. So they're >> actually very close. I think it depends on how data packed actually >> into pack file. Also there is noticeable delay in sftp case to >> establish the connection at the beginning of transaction (up to 5 >> seconds, if no more). > > The numbers will depend a lot on your latency versus bandwidth. I also > haven't actually looked at your plugin yet. I hope I can assume that my ftp vs sftp access to the same server over the same network has roughly the same latency/bandwidth. The main part of my plugin here: http://bazaar.launchpad.net/~bialix/%2Bjunk/kftp/annotate/head:/cached_ftp.py it wraps several methods of FtpTransport to insert cache. Actually it seems that creating protocol decorator a-la "cached+ftp://" would be better, but current implementation is mostly quick hack and proof of concept. >>> If you do the cache invalidation, I think we would be fine bringing that >>> into bzr core. >> I'll prepare a patch then. >> >> One question: do you think it's OK to explicitly cache only pack files >> (I'm checking the file extension now)? It seems as bad practice to >> bring the knowledge about packs into transport layer. Another way may >> be to cache only big files, greater than some threshold, e.g. 64KiB or >> so, so we won't cache lock info files, branch.conf, last-revision and >> pack-names. >> > > The things you want to cache are accessed via 'readv' rather than via > 'get' directly. I believe in the FTP transport, it doesn't actually > implement readv, so it falls back to the default behavior which calls > get + seek + read (which works for the Local version, not so well for > everything else). I'll take a look. If I understand correctly I should implement _readv method for FtpTransport to override following default implementation: def _readv(self, relpath, offsets): """Get parts of the file at the given relative path. :param relpath: The path to read. :param offsets: A list of (offset, size) tuples. :return: A list or generator of (offset, data) tuples """ if not offsets: return fp = self.get(relpath) return self._seek_and_read(fp, offsets, relpath) This implementation uses transport.get() as I can see. Does bzrlib always uses readv and never get? Or readv is used only for partial reads from pack/indices files? In the latter case indeed I should insert caching in the _readv(), and not into get(). Am I understand this correctly? > If you just implemented readv() and then only cached requests coming > from there, that would work pretty well. > Note also that you'll want to check into the append code a bit, too. > Sometimes we'll readv() from something we just appended to. append method invalidates the cache at the moment. > I think that would be a better option than trying to find a heuristic > based on file size. Does "that" mean "readv"? Thank you. -- All the dude wanted was his rug back