Bug #721250 “bzr log . in a subdir is slow” : Bugs : Bazaar

Revision history for this message

Per Johansson (per.j) wrote on 2011-02-18:

#1

log2.callgrind Edit (300.7 KiB, text/plain)

Revision history for this message

Vincent Ladeuil (vila) wrote on 2011-02-18:

#2

Are you using the compiled extensions ?

The python code ensures compatibility where the extensions *can't* be compiled but there is a significant performance penalty there.

There should be warning about it if that's the case.

Changed in bzr:
status:	New → Incomplete

Revision history for this message

Per Johansson (per.j) wrote on 2011-02-18:

#3

log.callgrind Edit (309.6 KiB, text/plain)

Ah, right, sorry about that. I'd done python setup.py build but not make. Attaching the callgraph for the installed 2.2.1 version while I run 3.3.0 again then (won't have time to post that one until monday).

Revision history for this message

John A Meinel (jameinel) wrote on 2011-02-18:

#4

Related to bug #374730 and bug #503071

There are a few conflicting issues. Some of it depends on the average size of changes vs the average size of SUBDIR.

If you look at the callgrind, it is pretty clear that the bulk of the time is spent in _filtered_revision_trees. Which is the point where we expand SUBDIR to all files that are children of SUBDIR. for every revision.

Things that I've tried in the past. Caching that value, based on the tree's file_id=>path map. So if a given revision and its parent have identical tree shape, you don't have to re-compute what children are available.

Right now, the loop looks something like:

  for rev in requested_revisions:
    t = get_tree(rev)
    interesting = find_children_and_direct_parents(t, SUBDIR)
    inv = filter_leaving_only(t, interesting)
    delta = inv.delta(prev_inv)
    prev_inv = inv
    if is_interesting(delta):
      yield rev, delta

The other way to do it, is something like:

  for rev in requested_revisions:
    delta = get_delta(rev, rev.parent_rev)
    if is_interesting(delta, filter_on=SUBDIR):
      yield rev, filter(delta)

Or something along those lines. The former tends to do well when SUBDIR is a very small fraction of the overall tree.

The best way to do it, would be something like:

  for rev in requested_revisions:
    delta = get_partial_delta(rev, rev.parent_rev, filter_on=SUBDIR)
    if delta:
      yield rev, delta

The idea being that we could push the filtering-by-subdir into the code that is computing the changes between two revisions (so it can ignore changes outside of that)

There is enough state involved, that likely we would want a DeltaSearch object, that would be repository specific. So that CHK repos could customize it for specifics of how the data is stored. (Because filter_on=SUBDIR can then know when the tree shape doesn't change between revisions, allowing state to be cached between revisions that we care about.)

Anyway, I have a fair amount of (old) knowledge in this space, but not enough time to focus on it. I would be happy to mentor someone who wants to get into the details.

Related to bug #374730 and bug #503071

There are a few conflicting issues. Some of it depends on the average size of changes vs the average size of SUBDIR.

If you look at the callgrind, it is pretty clear that the bulk of the time is spent in _filtered_revision_trees. Which is the point where we expand SUBDIR to all files that are children of SUBDIR. for every revision.

Things that I've tried in the past. Caching that value, based on the tree's file_id=>path map. So if a given revision and its parent have identical tree shape, you don't have to re-compute what children are available.

Right now, the loop looks something like:

for rev in requested_revisions:
    t = get_tree(rev)
    interesting = find_children_and_direct_parents(t, SUBDIR)
    inv = filter_leaving_only(t, interesting)
    delta = inv.delta(prev_inv)
    prev_inv = inv
    if is_interesting(delta):
      yield rev, delta

The other way to do it, is something like:

for rev in requested_revisions:
    delta = get_delta(rev, rev.parent_rev)
    if is_interesting(delta, filter_on=SUBDIR):
      yield rev, filter(delta)

Or something along those lines. The former tends to do well when SUBDIR is a very small fraction of the overall tree.

The best way to do it, would be something like:

for rev in requested_revisions:
    delta = get_partial_delta(rev, rev.parent_rev, filter_on=SUBDIR)
    if delta:
      yield rev, delta

The idea being that we could push the filtering-by-subdir into the code that is computing the changes between two revisions (so it can ignore changes outside of that)

There is enough state involved, that likely we would want a DeltaSearch object, that would be repository specific. So that CHK repos could customize it for specifics of how the data is stored. (Because filter_on=SUBDIR can then know when the tree shape doesn't change between revisions, allowing state to be cached between revisions that we care about.)

Anyway, I have a fair amount of (old) knowledge in this space, but not enough time to focus on it. I would be happy to mentor someone who wants to get into the details.

Revision history for this message

John A Meinel (jameinel) wrote on 2011-02-22:

#5

Download full text (3.3 KiB)

I have some experimental code that doesn't actually touch 'bzr log' in the associated branch. The results are pretty promising, but it still needs a fair amount of work to get going.

The experimental code is about finding "what paths are children of XXX", and doing so as efficiently as I can think of, as an initial building block.

Having this information, you can then do 'iter_changes()' on each rev and pass it through the set of file_ids. There are alternatives that might pass something like this into iter_changes() so it could do the filtering. The way the data is stored, I'm not really sure it would be a win.

Anyway, I also wrote a trivial script to test the performance, which looks pretty good as a starting point. Specifically, to find all file_ids that are a child of 'bzrlib', across 2000 revisions:
54.7s py time_delta_search.py --num-revs 2000 --filter
21.6s py time_delta_search.py --num-revs 2000 --no-cache
3.3s py time_delta_search.py --num-revs 2000

There are 2 basic ideas at play:

1) If the parent_id_basename_file_id map hasn't changed, then we can assume the tree shape is identical, and thus not recompute the information for every revision. According to the above numbers, only about 1 in 6 revs does the tree shape change. So we save 1/6th the computation.
2) Not using InventoryEntry objects. The way Inventory.filter() works, is that it has to realize the raw Inventory, and pull out all the bits that it cares about. Then build up a new Inventory object. There isn't a shared cache, so it has to create a new InventoryEntry for every child of the search path for every revision, whether they have changed or not.

There are other ideas to explore:

1) Do iter_changes first, and then only try to compute whether each file_id is a child, rather than computing all known children at each step. You could still use a pid_map.key() based cache. (So for all file_ids in a given delta, compute if each one is a child and cache it. Reset the cache when the pid_map.key() changes.)

If the size of the directory is big, average delta size is small, and you have frequent tree shape changes, this could be a win.
(Basically, the work done now is O(size_recursive_children * num_tree_changes), you would change it to O(size_deltas * num_tree_changes) plus presumably larger constant overheads.)

2) The pid_map is a recursive trie. So while we can detect when anything changed because the root key changes, we can also tell when only a subset changes. We could play tricks with our caching logic, so that tree-shape changes that couldn't contain the file_ids we care about, won't require rebuilding that part of the tree shape info. The fact that it is a hash trie makes this a bit tricky (vs something based on the real paths). However, tree shape changes are usually small (especially since renaming a directory is represented as a single change, not touching every child.) So there could be a fairly large savings here.

It changes the effort to O(size_tree_changes * num_tree_changes + 1*size_recursive_children) (you do one recursive children computation, then only mutate that state based on the small tree shape change.)

3) As mentioned push the computation o...

I have some experimental code that doesn't actually touch 'bzr log' in the associated branch. The results are pretty promising, but it still needs a fair amount of work to get going.

The experimental code is about finding "what paths are children of XXX", and doing so as efficiently as I can think of, as an initial building block.

Having this information, you can then do 'iter_changes()' on each rev and pass it through the set of file_ids. There are alternatives that might pass something like this into iter_changes() so it could do the filtering. The way the data is stored, I'm not really sure it would be a win.

Anyway, I also wrote a trivial script to test the performance, which looks pretty good as a starting point. Specifically, to find all file_ids that are a child of 'bzrlib', across 2000 revisions:
 54.7s py time_delta_search.py --num-revs 2000 --filter
 21.6s py time_delta_search.py --num-revs 2000 --no-cache
  3.3s py time_delta_search.py --num-revs 2000

There are 2 basic ideas at play:

1) If the parent_id_basename_file_id map hasn't changed, then we can assume the tree shape is identical, and thus not recompute the information for every revision. According to the above numbers, only about 1 in 6 revs does the tree shape change. So we save 1/6th the computation.
2) Not using InventoryEntry objects. The way Inventory.filter() works, is that it has to realize the raw Inventory, and pull out all the bits that it cares about. Then build up a new Inventory object. There isn't a shared cache, so it has to create a new InventoryEntry for every child of the search path for every revision, whether they have changed or not.

There are other ideas to explore:

1) Do iter_changes first, and then only try to compute whether each file_id is a child, rather than computing all known children at each step. You could still use a pid_map.key() based cache. (So for all file_ids in a given delta, compute if each one is a child and cache it. Reset the cache when the pid_map.key() changes.)

If the size of the directory is big, average delta size is small, and you have frequent tree shape changes, this could be a win.
(Basically, the work done now is O(size_recursive_children * num_tree_changes), you would change it to O(size_deltas * num_tree_changes) plus presumably larger constant overheads.)

2) The pid_map is a recursive trie. So while we can detect when anything changed because the root key changes, we can also tell when only a subset changes. We could play tricks with our caching logic, so that tree-shape changes that couldn't contain the file_ids we care about, won't require rebuilding that part of the tree shape info. The fact that it is a hash trie makes this a bit tricky (vs something based on the real paths). However, tree shape changes are usually small (especially since renaming a directory is represented as a single change, not touching every child.) So there could be a fairly large savings here.

It changes the effort to O(size_tree_changes * num_tree_changes + 1*size_recursive_children) (you do one recursive children computation, then only mutate that state based on the small tree shape change.)

3) As mentioned push the computation of what changes might  be interesting into iter_changes, so that it can skip comparing subtrees that would otherwise not contain interesting changes. This could avoid pulling in chk pages, which could be a win.

Revision history for this message

John A Meinel (jameinel) wrote on 2011-02-22:

#6

I also added a flag to compute iter_changes() between the trees, and filter appropriately. So comparing the existing: "filter trees then iter_changes between filtered trees", to the proposed "compute filter file_ids, cache it, iter_changes and filter changes against file_ids".

The results look pretty good there (for 100 revisions):

4.2s --delta --filter
3.3s --filter
0.7s --delta
0.3s --no-delta --no-filter

2 key takeaways, --delta is cheaper for non-filtered trees, about 0.4s vs 0.9s for filtered comparisons.

Take these results with a grain of salt, though. Because adding up how many raw 'changes' were found gives different results. --delta --filter shows 610 changes in 100 revisions, --delta shows 599 changes. I'd have to look at the details to figure out what is going on.

Revision history for this message

Per Johansson (per.j) wrote on 2011-02-25:

#7

log-3.3.0-compiled.callgrind Edit (286.7 KiB, text/plain)

Thanks for looking at this. I spent most of my week at home with a cold, but finally back at work. Not sure if it's still relevant but attaching the callgrind for 3.3.0, compiled. Time output:

real 118m29.958s
user 116m10.968s
sys 2m14.741s

I'll see if I can run those things on our tree.

Changed in bzr:
status:	Incomplete → New

Per Johansson (per.j) on 2011-02-25

description:

updated

Revision history for this message

Per Johansson (per.j) wrote on 2011-02-25:

#8

I'm probably missing something obvious, but why can't it just do a normal bzr log and then only display the relevant paths? Bzr log is fast after all.

: [dev ~/src]; time ~/bazaar/bzr-2.3.0/bzr log > /dev/null
real 0m6.980s
user 0m6.232s
sys 0m0.195s

Anyway, here's the output for your script:
: [dev ~/src/head]; python ~/bazaar/2.4-log-subdir/time_delta_search.py --path sub/dir --delta --filter
Took 25.654s for 100 revisions (256 changes, 187584 file_ids) and path: sub/dir
: [dev ~/src/head]; python ~/bazaar/2.4-log-subdir/time_delta_search.py --path sub/dir --filter
Took 20.528s for 100 revisions (0 changes, 187584 file_ids) and path: sub/dir
: [dev ~/src/head]; python ~/bazaar/2.4-log-subdir/time_delta_search.py --path sub/dir --delta
Took 13.308s for 100 revisions (173 changes, 187584 file_ids) and path: sub/dir
: [dev ~/src/head]; python ~/bazaar/2.4-log-subdir/time_delta_search.py --path sub/dir
Took 3.807s for 100 revisions (0 changes, 187584 file_ids) and path: sub/dir

Those differences in the changes part doesn't really look that good...

Revision history for this message

John A Meinel (jameinel) wrote on 2011-02-25: Re: [Bug 721250] Re: bzr log . in a subdir is slow

#9

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/25/2011 5:54 AM, Per Johansson wrote:
> I'm probably missing something obvious, but why can't it just do a
> normal bzr log and then only display the relevant paths? Bzr log is fast
> after all.
>
> : [dev ~/src]; time ~/bazaar/bzr-2.3.0/bzr log > /dev/null
> real 0m6.980s
> user 0m6.232s
> sys 0m0.195s

You have to at least do "bzr log -v".

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1nxP8ACgkQJdeBCYSNAAOMCgCgiiGUpV0CdJXOftAID9u+mJtz
NPUAoIBpSo/odim8EpuIJkgd5W25JQXH
=I+pj
-----END PGP SIGNATURE-----

Revision history for this message

Per Johansson (per.j) wrote on 2011-02-25:

#10

You're right of course, but bzr log -v still only takes a bit over 9 min (couldn't find an idle machine, but seems right). But perhaps it can be made even better than that.

Jelmer Vernooij (jelmer) on 2011-02-26

Changed in bzr:
status:	New → Confirmed
importance:	Undecided → Medium
importance:	Medium → High
tags:	added: log performance

Revision history for this message

Per Johansson (per.j) wrote on 2011-08-24:

#11

bzr log -v is faster now (using python-2.6 and cython, 21670 revs):

time bzr-dev log -v > /dev/null

real 8m23.487s
user 8m3.592s
sys 0m17.853s

and the subdir log is also slightly faster:

time bzr-dev log . > /dev/null

real 115m59.288s
user 112m19.072s
sys 3m39.049s

but they're still on the same order of magnitudes.

bzr.dev revno 6093

Revision history for this message

Per Johansson (per.j) wrote on 2012-01-03:

#12

Not sure if you want these updates but it's heading in the right direction anyway, still revno 21670:

time bzr-dev log -v > /dev/null

real 6m28.964s
user 6m20.093s
sys 0m8.714s

time bzr-dev log . > /dev/null

real 100m5.384s
user 98m14.737s
sys 1m49.466s

bzr.dev revno 6409

Jelmer Vernooij (jelmer) on 2017-11-08

tags:

added: check-for-breezy

Jelmer Vernooij (jelmer) on 2017-11-12

Changed in brz:
status:	New → Triaged
importance:	Undecided → Medium
tags:	removed: check-for-breezy

Affects		Status	Importance	Assigned to	Milestone
	Bazaar	Confirmed	High	Unassigned
	Breezy	Triaged	Medium	Unassigned

Bazaar

bzr log . in a subdir is slow

Bug Description

Related branches

Other bug subscribers

Bug attachments

Remote bug watches