push/pull operations read too much data
Bug #88319 reported by
Alexander Belchenko
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Bazaar |
Fix Released
|
High
|
Robert Collins |
Bug Description
Push/pull performs set difference operations on the repository revisions list rather than graph-walking only the missing-from-tip data.
Changed in bzr: | |
importance: | Undecided → High |
Changed in bzr: | |
milestone: | none → 0.92 |
status: | Fix Committed → Fix Released |
To post a comment you must log in.
Alexander Belchenko wrote:
> Public bug reported:
...
>
> I think bzr should lazily reads kndx files from the end by small
> portions (0.5KB-10KB) and then construct graphs in memory. Reading data
> from the end allow bzr to be optimal in the case of push/pull, because
> in this case usually only latest revisions fetched. Only in this case
> bzr could reduce amount of data received from the remote server.
I agree that we are inefficient. We are actually addressing it and more
by having a smart server. There are possibilities for only reading part
of knit indexes, but I recall there are problems between this and the
dictionary compression that we do.
Specifically, each line is stored as:
revision_id info ancestor_num :
The problem is that 'ancestor_num' is a pointer to the record that is
the ancestor. And without reading the whole index, you don't know what
you are mapping to.
There are some ways to infer things if you have a local knit with
similar information, and you are sure that the information would be
identical. So you need to read for revisions you don't know, and then
read until you get to their parents, and a little bit farther until you
get to the grand-parents so that you can make sure you understand the
reference.
For example:
rev-1 ... : # no parents
rev-2 ... 0 : # parent == rev-1
rev-3 ... 1 : # parent == rev-2
rev-4 ... 1 : # parent == rev-2
rev-5 ... 2 3 : # parents == rev-3, rev-4
At this point, you might have rev-1, rev-2, and rev-4 locally in your
knit, and then you read rev-5. You don't know what 2 and 3 are yet. So
you read a bit more and find rev-4 and rev-3, but you still don't know
that they are at offset 2 and 3 because you don't know how many lines
the remote file has. So you read back to rev-2. At which point you can
see that "rev-4" (which you know has a parent of rev-2) points to an
offset of "1". So you know that rev-2 is located at offset 1, so rev-3
is offset 2, etc.
This is complicated, though, because the format allows you to append new
records for the same revision, but the pointer continues to point to the
first entry. This is because if you have ghosts, you may fill them in
later, so you need to be able to add a new line indicating the new parents.
...
All this to say, it is pretty difficult to correctly read just from the
end of our current format. There are tricks that you can do, but it is
better to just get the smart server working, which can do things that
you would never be able to do with plain file operations.
There are other formats which would work a little bit better. For
example, you could explicitly list the parent revision id, rather than
using a number. But this adds a lot of redundancy in the file, making it
a lot bigger.
Another possibility is to add line numbers to each record, so that you have:
1 rev-1 .... :
2 rev-2 .... :
And then you have an easy way to read where you are in the file. The
downside of this is when you have to handle a write that was truncated
because of a network disconnection part-way through. So you might have:
1 rev-1 ... :
2 rev-2 ..
2 rev-3 ... :
Note that '2' doesn't end in a ':' which is our end-of-record indicator.
(Our start-of-record is a...