Bug #88319 “push/pull operations read too much data” : Bugs : Bazaar

Alexander Belchenko (bialix) on 2007-02-27

Changed in bzr:
importance:	Undecided → High

Revision history for this message

John A Meinel (jameinel) wrote on 2007-02-27: Re: [Bug 88319] push/pull operations read too much data

#1

Download full text (3.6 KiB)

Alexander Belchenko wrote:
> Public bug reported:
...

>
> I think bzr should lazily reads kndx files from the end by small
> portions (0.5KB-10KB) and then construct graphs in memory. Reading data
> from the end allow bzr to be optimal in the case of push/pull, because
> in this case usually only latest revisions fetched. Only in this case
> bzr could reduce amount of data received from the remote server.

I agree that we are inefficient. We are actually addressing it and more
by having a smart server. There are possibilities for only reading part
of knit indexes, but I recall there are problems between this and the
dictionary compression that we do.

Specifically, each line is stored as:

revision_id info ancestor_num :

The problem is that 'ancestor_num' is a pointer to the record that is
the ancestor. And without reading the whole index, you don't know what
you are mapping to.

There are some ways to infer things if you have a local knit with
similar information, and you are sure that the information would be
identical. So you need to read for revisions you don't know, and then
read until you get to their parents, and a little bit farther until you
get to the grand-parents so that you can make sure you understand the
reference.

For example:

rev-1 ... : # no parents
rev-2 ... 0 : # parent == rev-1
rev-3 ... 1 : # parent == rev-2
rev-4 ... 1 : # parent == rev-2
rev-5 ... 2 3 : # parents == rev-3, rev-4

At this point, you might have rev-1, rev-2, and rev-4 locally in your
knit, and then you read rev-5. You don't know what 2 and 3 are yet. So
you read a bit more and find rev-4 and rev-3, but you still don't know
that they are at offset 2 and 3 because you don't know how many lines
the remote file has. So you read back to rev-2. At which point you can
see that "rev-4" (which you know has a parent of rev-2) points to an
offset of "1". So you know that rev-2 is located at offset 1, so rev-3
is offset 2, etc.

This is complicated, though, because the format allows you to append new
records for the same revision, but the pointer continues to point to the
first entry. This is because if you have ghosts, you may fill them in
later, so you need to be able to add a new line indicating the new parents.

...

All this to say, it is pretty difficult to correctly read just from the
end of our current format. There are tricks that you can do, but it is
better to just get the smart server working, which can do things that
you would never be able to do with plain file operations.

There are other formats which would work a little bit better. For
example, you could explicitly list the parent revision id, rather than
using a number. But this adds a lot of redundancy in the file, making it
a lot bigger.

Another possibility is to add line numbers to each record, so that you have:

1 rev-1 .... :
2 rev-2 .... :

And then you have an easy way to read where you are in the file. The
downside of this is when you have to handle a write that was truncated
because of a network disconnection part-way through. So you might have:

1 rev-1 ... :
2 rev-2 ..
2 rev-3 ... :

Note that '2' doesn't end in a ':' which is our end-of-record indicator.
(Our start-of-record is a...

Alexander Belchenko wrote:
> Public bug reported:
...

> 
> I think bzr should lazily reads kndx files from the end by small
> portions (0.5KB-10KB) and then construct graphs in memory. Reading data
> from the end allow bzr to be optimal in the case of push/pull, because
> in this case usually only latest revisions fetched. Only in this case
> bzr could reduce amount of data received from the remote server.

I agree that we are inefficient. We are actually addressing it and more
by having a smart server. There are possibilities for only reading part
of knit indexes, but I recall there are problems between this and the
dictionary compression that we do.

Specifically, each line is stored as:

revision_id info ancestor_num :

The problem is that 'ancestor_num' is a pointer to the record that is
the ancestor. And without reading the whole index, you don't know what
you are mapping to.

There are some ways to infer things if you have a local knit with
similar information, and you are sure that the information would be
identical. So you need to read for revisions you don't know, and then
read until you get to their parents, and a little bit farther until you
get to the grand-parents so that you can make sure you understand the
reference.

For example:

rev-1 ... : # no parents
rev-2 ... 0 : # parent == rev-1
rev-3 ... 1 : # parent == rev-2
rev-4 ... 1 : # parent == rev-2
rev-5 ... 2 3 : # parents == rev-3, rev-4

At this point, you might have rev-1, rev-2, and rev-4 locally in your
knit, and then you read rev-5. You don't know what 2 and 3 are yet. So
you read a bit more and find rev-4 and rev-3, but you still don't know
that they are at offset 2 and 3 because you don't know how many lines
the remote file has. So you read back to rev-2. At which point you can
see that "rev-4" (which you know has a parent of rev-2) points to an
offset of "1". So you know that rev-2 is located at offset 1, so rev-3
is offset 2, etc.

This is complicated, though, because the format allows you to append new
records for the same revision, but the pointer continues to point to the
first entry. This is because if you have ghosts, you may fill them in
later, so you need to be able to add a new line indicating the new parents.

...

All this to say, it is pretty difficult to correctly read just from the
end of our current format. There are tricks that you can do, but it is
better to just get the smart server working, which can do things that
you would never be able to do with plain file operations.

There are other formats which would work a little bit better. For
example, you could explicitly list the parent revision id, rather than
using a number. But this adds a lot of redundancy in the file, making it
a lot bigger.

Another possibility is to add line numbers to each record, so that you have:

1 rev-1 .... :
2 rev-2 .... :

And then you have an easy way to read where you are in the file. The
downside of this is when you have to handle a write that was truncated
because of a network disconnection part-way through. So you might have:

1 rev-1 ... :
2 rev-2 ..
2 rev-3 ... :

Note that '2' doesn't end in a ':' which is our end-of-record indicator.
(Our start-of-record is a newline).

Should that be 2 rev-3 (which it would be in our current algorithm,
because we just ignore the 2 rev-2 line since it is incomplete.).

This would be a little less expensive than storing the full revision id
for parents.

But doing partial reads easily would require a new format. And if I had
to recommend somewhere to spend effort, it would be to get the smart
server running, so that we don't have to read *any* index files.

John
=:->

Revision history for this message

John A Meinel (jameinel) wrote on 2007-04-30:

#2

This is currently being dealt with with the hpss branch.

bzr-0.16 will have some small improvements to this, but also includes all the infrastructure to make large improvements possible.

Changed in bzr:
status:	Unconfirmed → Confirmed

Revision history for this message

Robert Collins (lifeless) wrote on 2007-10-30:

#3

The core fix for this needed a new disk format, and packs provide that. The actual logic for fetch for this is fixed in my fetch branch - merge coming up soon.

Changed in bzr:
assignee:	nobody → lifeless
status:	Confirmed → Fix Committed
description:	updated

Robert Collins (lifeless) on 2007-11-12

Changed in bzr:
milestone:	none → 0.92
status:	Fix Committed → Fix Released

Bazaar

push/pull operations read too much data

Bug Description

Other bug subscribers

Remote bug watches