[PATCH] bzr rm pathologically slow with new files and no --keep
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Bazaar |
Fix Released
|
High
|
Johan Walles |
Bug Description
Try this script (in bash):
mkdir -p /tmp/bazaar-bug
cd /tmp/bazaar-bug
for i in `seq 20`; do
for j in `seq 20`; do
mkdir -p "directory$
for k in `seq 20`; do
echo "This is some content in file #$k" > directory$
done;
done;
done;
bzr init;
bzr add -q .
Now try:
$ bzr rm *
It's pathologically slow (on the order of minutes of runtime, an earlier test with 30x30x30 files showed no signs of being done after 20 minutes), probably exponential (or at least O(n^k) in the number of files). If you run it with a smaller number of files, you can see that bzr errors out in the end, warning that those files were not previously tracked and thus can't be safely removed from the working tree. However, this:
$ bzr rm --keep *
is nice and fast. This:
$ bzr rm --new *
is *not* fast. So there's some kind of very bad behaviour associated with collecting info about newness of files that makes it take ages. It seems to be bad only computationally, I haven't noticed any increased memory usage during those 20 minutes I let it run for.
Changed in bzr: | |
assignee: | nobody → Johan Walles (walles) |
status: | Triaged → In Progress |
tags: | added: performance |
I can confirm this. on a 20x20x20 setup, 'bzr rm --keep *' takes 7s on my machine, and 'bzr rm *' has not finished after 1min of execution.
And --lsprof of a canceled 'rm' shows: osutils: 493(is_ inside) osutils: 517(is_ inside_ any) osutils: 493(is_ inside) workingtree_ 4:1763( iter_changes) osutils: 517(is_ inside_ any)
2974089 0 9.6543 7.5740 bzrlib.
+2974088 0 2.0802 2.0802 +<method 'startswith' of 'str' objects>
1140 0 13.5061 3.8519 bzrlib.
+2974089 0 9.6543 7.5740 +bzrlib.
1140 0 2.4308 2.4308 <method 'difference' of 'set' objects>
2974517 0 2.0807 2.0807 <method 'startswith' of 'str' objects>
1 0 17.0079 1.0435 bzrlib.
+1140 0 13.5061 3.8519 +bzrlib.
The problem is that doing 'rm *' gives lots of paths to check, and 'is_inside_any' is not particularly gracious about this.
If you change it to:
time bzr rm directory1
it takes only 4s. However, it still ends up doing 182,174 calls to is_inside from 842 calls to is_inside_any.
It seems to be caused by this loop: files = specific_ files.differenc e(set([ path])) is_inside_ any(other_ specific_ files, path): specific_ files.add( path)
1871 import pdb; pdb.set_trace()
1872 -> for path in specific_files:
1873 other_specific_
1874 if not osutils.
1875 # this is a top level path, we must check it.
1876 search_
This is a check to make sure that any given path is not supplied by two different paths. Without --keep, it checks to see if any of the paths are considered modified. Which means it does: changes( specific_ files=all_ paths_to_ be_removed)
WT.iter_
So with (20x20x20)^2 = 64,000,000 comparisons. It probably isn't quite that many, because it stops once it hits one that *is* inside.
One possibility is to change to a smarter object for is_inside_any, that can understand prefixes and more easily skip over ones that it couldn't match.
Another possibility is to figure out if we really need to be doing this check.