Robert Collins wrote:
> On Mon, 2009-08-31 at 20:25 +0000, John A Meinel wrote:
>>
>> I think we should explicitly test it again. IIRC it is non trivial. I
>> don't remember how long 'bzr pack' takes in a Launchpad tree, but that
>> is exactly what you would be doing on the initial branch. Which seems
>> very wasteful if the source is already mostly/completely packed.
> 
> doing a full pack took 11m35 on a branch of db-devel a couple of weeks
> old.

So you and I seem to have very different thresholds for what is reasonable.

$ time bzr branch bzr-2a/bzr.dev copy
real    0m59.077s

$ time bzr pack bzr-2a
real    2m1.454s

So it takes 2x longer to repack all the content than it did to just copy
it. By that estimate, it would be 3m+ to do a 'bzr branch' outside of a
shared repo. (Or 3x longer than it currently does.)

Now yes, if you are only doing this sort of thing over a remote
connection, the time is dominated by the network transfer time.

However, I can currently get 250KB/s download, and a fully packed bzr is
37MB. So that would only take 37*1024/250/60 = 2m30s to download. So the
pack time is slightly longer than the download time. If they could be
done perfectly in parallel, then the increase in time would be small. My
guess, though, is that it would be a fair amount longer (at least 1min,
possibly the full 2min longer).


> 
>> I *would* probably like to extract all the texts and check their
>> sha1sums. But you can extract texts at about 400MB/s+, and sha1sum at
>> about 60MB/s. Compression is much much slower than that. 
> 
> yup. Its just that I've looked at the code and the problem is the
> heuristic about whether to make a new group needs the text size, but to
> get the text size we extract the text.

I'm not really sure why that would be necessary. I suppose there are
lots of heuristics one could use.

Certainly if we strictly needed it, we could put that data into the
Factory object, since we have it in the index. Probably the big loss is
that it would change the serialization on the wire for the Remote requests.

> 
> I'm going to do some measurements: just extracting always would be the
> _simplest_ way forward, also solve the fragmentation thing better than
> unordered does (because its still not combining separate pushes to the
> source repo), and we can land reuse in a more polished form later.
> 
> -Rob
> 

I'm not very concerned about 'extracting always'. I'm concerned about
'compressing always'... :)

I just got a new laptop, and these are the times that I see:
TIMEIT -s "b = Branch.open('bzr-2a-test/bzr.dev');
r = b.repository; r.lock_read()
k = r.texts.keys()
r.unlock();
" "r.lock_read()
nbytes = 0
for record in r.texts.get_record_stream(k, 'unordered', True):
  nbytes += len(record.get_bytes_as('fulltext'))
r.unlock()
print nbytes
"

This says that I'm decompressing 2,820,077,108 or 2.6GB, and it is
taking 4.43s/loop. Which is 607MB/s.

Now I believe the above construction also includes the time to read the
indexes, etc, but I won't guarantee that.

openssl speed sha1  says:
type 8192 bytes
sha1 362010.20k

or 362MB/s to compute the sha1 of data.

So it is taking roughly 2min to compress everything, but we can
*extract* all of that content in about 4s. (Not exactly fair, because we
also have inventory/chk/etc content involved in a 'bzr pack'.)


If I change the above loop to be:
  osutils.sha_string(record.get_bytes_as('fulltext'))

The time changes to 12s/loop, or 224MB/s. (About right when you consider
that sha1 is roughly 1/2 the speed of decompression, so you do 1
decompression and 2 sha1 ticks, or 3x original speed.)

Still, extracting and sha1summing everything is an order of magnitude
faster than the 120s to compress it all.


I would really *like* it if we would extract all the content and make
sure the (file_id, revision) => sha1sum from the inventory matches the
content before it is put into the repository. The reason we don't do
that now is layering. (You have to extract the records from the
inventory stream, and then keep them around somewhere so that you can
validate the texts stream, and we didn't agree on where that state
should be kept and how it should be passed around.)

And if we are already doing that, then you would have your size
information, etc.

John
=:->