Bazaar

Bug #721250
Comment #5

Comment 5 for bug 721250

Revision history for this message

John A Meinel (jameinel) wrote on 2011-02-22:

I have some experimental code that doesn't actually touch 'bzr log' in the associated branch. The results are pretty promising, but it still needs a fair amount of work to get going.

The experimental code is about finding "what paths are children of XXX", and doing so as efficiently as I can think of, as an initial building block.

Having this information, you can then do 'iter_changes()' on each rev and pass it through the set of file_ids. There are alternatives that might pass something like this into iter_changes() so it could do the filtering. The way the data is stored, I'm not really sure it would be a win.

Anyway, I also wrote a trivial script to test the performance, which looks pretty good as a starting point. Specifically, to find all file_ids that are a child of 'bzrlib', across 2000 revisions:
54.7s py time_delta_search.py --num-revs 2000 --filter
21.6s py time_delta_search.py --num-revs 2000 --no-cache
3.3s py time_delta_search.py --num-revs 2000

There are 2 basic ideas at play:

1) If the parent_id_basename_file_id map hasn't changed, then we can assume the tree shape is identical, and thus not recompute the information for every revision. According to the above numbers, only about 1 in 6 revs does the tree shape change. So we save 1/6th the computation.
2) Not using InventoryEntry objects. The way Inventory.filter() works, is that it has to realize the raw Inventory, and pull out all the bits that it cares about. Then build up a new Inventory object. There isn't a shared cache, so it has to create a new InventoryEntry for every child of the search path for every revision, whether they have changed or not.

There are other ideas to explore:

1) Do iter_changes first, and then only try to compute whether each file_id is a child, rather than computing all known children at each step. You could still use a pid_map.key() based cache. (So for all file_ids in a given delta, compute if each one is a child and cache it. Reset the cache when the pid_map.key() changes.)

If the size of the directory is big, average delta size is small, and you have frequent tree shape changes, this could be a win.
(Basically, the work done now is O(size_recursive_children * num_tree_changes), you would change it to O(size_deltas * num_tree_changes) plus presumably larger constant overheads.)

2) The pid_map is a recursive trie. So while we can detect when anything changed because the root key changes, we can also tell when only a subset changes. We could play tricks with our caching logic, so that tree-shape changes that couldn't contain the file_ids we care about, won't require rebuilding that part of the tree shape info. The fact that it is a hash trie makes this a bit tricky (vs something based on the real paths). However, tree shape changes are usually small (especially since renaming a directory is represented as a single change, not touching every child.) So there could be a fairly large savings here.

It changes the effort to O(size_tree_changes * num_tree_changes + 1*size_recursive_children) (you do one recursive children computation, then only mutate that state based on the small tree shape change.)

3) As mentioned push the computation of what changes might be interesting into iter_changes, so that it can skip comparing subtrees that would otherwise not contain interesting changes. This could avoid pulling in chk pages, which could be a win.

I have some experimental code that doesn't actually touch 'bzr log' in the associated branch. The results are pretty promising, but it still needs a fair amount of work to get going.

The experimental code is about finding "what paths are children of XXX", and doing so as efficiently as I can think of, as an initial building block.

Anyway, I also wrote a trivial script to test the performance, which looks pretty good as a starting point. Specifically, to find all file_ids that are a child of 'bzrlib', across 2000 revisions:
 54.7s py time_delta_search.py --num-revs 2000 --filter
 21.6s py time_delta_search.py --num-revs 2000 --no-cache
  3.3s py time_delta_search.py --num-revs 2000

There are 2 basic ideas at play:

There are other ideas to explore:

3) As mentioned push the computation of what changes might  be interesting into iter_changes, so that it can skip comparing subtrees that would otherwise not contain interesting changes. This could avoid pulling in chk pages, which could be a win.