push/pull operations read too much data

Bug #88319 reported by Alexander Belchenko
2
Affects Status Importance Assigned to Milestone
Bazaar
Fix Released
High
Robert Collins

Bug Description

Push/pull performs set difference operations on the repository revisions list rather than graph-walking only the missing-from-tip data.

Changed in bzr:
importance: Undecided → High
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 88319] push/pull operations read too much data
Download full text (3.6 KiB)

Alexander Belchenko wrote:
> Public bug reported:
...

>
> I think bzr should lazily reads kndx files from the end by small
> portions (0.5KB-10KB) and then construct graphs in memory. Reading data
> from the end allow bzr to be optimal in the case of push/pull, because
> in this case usually only latest revisions fetched. Only in this case
> bzr could reduce amount of data received from the remote server.

I agree that we are inefficient. We are actually addressing it and more
by having a smart server. There are possibilities for only reading part
of knit indexes, but I recall there are problems between this and the
dictionary compression that we do.

Specifically, each line is stored as:

revision_id info ancestor_num :

The problem is that 'ancestor_num' is a pointer to the record that is
the ancestor. And without reading the whole index, you don't know what
you are mapping to.

There are some ways to infer things if you have a local knit with
similar information, and you are sure that the information would be
identical. So you need to read for revisions you don't know, and then
read until you get to their parents, and a little bit farther until you
get to the grand-parents so that you can make sure you understand the
reference.

For example:

rev-1 ... : # no parents
rev-2 ... 0 : # parent == rev-1
rev-3 ... 1 : # parent == rev-2
rev-4 ... 1 : # parent == rev-2
rev-5 ... 2 3 : # parents == rev-3, rev-4

At this point, you might have rev-1, rev-2, and rev-4 locally in your
knit, and then you read rev-5. You don't know what 2 and 3 are yet. So
you read a bit more and find rev-4 and rev-3, but you still don't know
that they are at offset 2 and 3 because you don't know how many lines
the remote file has. So you read back to rev-2. At which point you can
see that "rev-4" (which you know has a parent of rev-2) points to an
offset of "1". So you know that rev-2 is located at offset 1, so rev-3
is offset 2, etc.

This is complicated, though, because the format allows you to append new
records for the same revision, but the pointer continues to point to the
first entry. This is because if you have ghosts, you may fill them in
later, so you need to be able to add a new line indicating the new parents.

...

All this to say, it is pretty difficult to correctly read just from the
end of our current format. There are tricks that you can do, but it is
better to just get the smart server working, which can do things that
you would never be able to do with plain file operations.

There are other formats which would work a little bit better. For
example, you could explicitly list the parent revision id, rather than
using a number. But this adds a lot of redundancy in the file, making it
a lot bigger.

Another possibility is to add line numbers to each record, so that you have:

1 rev-1 .... :
2 rev-2 .... :

And then you have an easy way to read where you are in the file. The
downside of this is when you have to handle a write that was truncated
because of a network disconnection part-way through. So you might have:

1 rev-1 ... :
2 rev-2 ..
2 rev-3 ... :

Note that '2' doesn't end in a ':' which is our end-of-record indicator.
(Our start-of-record is a...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

This is currently being dealt with with the hpss branch.

bzr-0.16 will have some small improvements to this, but also includes all the infrastructure to make large improvements possible.

Changed in bzr:
status: Unconfirmed → Confirmed
Revision history for this message
Robert Collins (lifeless) wrote :

The core fix for this needed a new disk format, and packs provide that. The actual logic for fetch for this is fixed in my fetch branch - merge coming up soon.

Changed in bzr:
assignee: nobody → lifeless
status: Confirmed → Fix Committed
description: updated
Changed in bzr:
milestone: none → 0.92
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.