Robert Collins wrote: > On Mon, 2009-08-31 at 20:25 +0000, John A Meinel wrote: >> >> I think we should explicitly test it again. IIRC it is non trivial. I >> don't remember how long 'bzr pack' takes in a Launchpad tree, but that >> is exactly what you would be doing on the initial branch. Which seems >> very wasteful if the source is already mostly/completely packed. > > doing a full pack took 11m35 on a branch of db-devel a couple of weeks > old. So you and I seem to have very different thresholds for what is reasonable. $ time bzr branch bzr-2a/bzr.dev copy real 0m59.077s $ time bzr pack bzr-2a real 2m1.454s So it takes 2x longer to repack all the content than it did to just copy it. By that estimate, it would be 3m+ to do a 'bzr branch' outside of a shared repo. (Or 3x longer than it currently does.) Now yes, if you are only doing this sort of thing over a remote connection, the time is dominated by the network transfer time. However, I can currently get 250KB/s download, and a fully packed bzr is 37MB. So that would only take 37*1024/250/60 = 2m30s to download. So the pack time is slightly longer than the download time. If they could be done perfectly in parallel, then the increase in time would be small. My guess, though, is that it would be a fair amount longer (at least 1min, possibly the full 2min longer). > >> I *would* probably like to extract all the texts and check their >> sha1sums. But you can extract texts at about 400MB/s+, and sha1sum at >> about 60MB/s. Compression is much much slower than that. > > yup. Its just that I've looked at the code and the problem is the > heuristic about whether to make a new group needs the text size, but to > get the text size we extract the text. I'm not really sure why that would be necessary. I suppose there are lots of heuristics one could use. Certainly if we strictly needed it, we could put that data into the Factory object, since we have it in the index. Probably the big loss is that it would change the serialization on the wire for the Remote requests. > > I'm going to do some measurements: just extracting always would be the > _simplest_ way forward, also solve the fragmentation thing better than > unordered does (because its still not combining separate pushes to the > source repo), and we can land reuse in a more polished form later. > > -Rob > I'm not very concerned about 'extracting always'. I'm concerned about 'compressing always'... :) I just got a new laptop, and these are the times that I see: TIMEIT -s "b = Branch.open('bzr-2a-test/bzr.dev'); r = b.repository; r.lock_read() k = r.texts.keys() r.unlock(); " "r.lock_read() nbytes = 0 for record in r.texts.get_record_stream(k, 'unordered', True): nbytes += len(record.get_bytes_as('fulltext')) r.unlock() print nbytes " This says that I'm decompressing 2,820,077,108 or 2.6GB, and it is taking 4.43s/loop. Which is 607MB/s. Now I believe the above construction also includes the time to read the indexes, etc, but I won't guarantee that. openssl speed sha1 says: type 8192 bytes sha1 362010.20k or 362MB/s to compute the sha1 of data. So it is taking roughly 2min to compress everything, but we can *extract* all of that content in about 4s. (Not exactly fair, because we also have inventory/chk/etc content involved in a 'bzr pack'.) If I change the above loop to be: osutils.sha_string(record.get_bytes_as('fulltext')) The time changes to 12s/loop, or 224MB/s. (About right when you consider that sha1 is roughly 1/2 the speed of decompression, so you do 1 decompression and 2 sha1 ticks, or 3x original speed.) Still, extracting and sha1summing everything is an order of magnitude faster than the 120s to compress it all. I would really *like* it if we would extract all the content and make sure the (file_id, revision) => sha1sum from the inventory matches the content before it is put into the repository. The reason we don't do that now is layering. (You have to extract the records from the inventory stream, and then keep them around somewhere so that you can validate the texts stream, and we didn't agree on where that state should be kept and how it should be passed around.) And if we are already doing that, then you would have your size information, etc. John =:->