[PATCH] bzr rm pathologically slow with new files and no --keep

Bug #180116 reported by Maciej Katafiasz
6
Affects Status Importance Assigned to Milestone
Bazaar
Fix Released
High
Johan Walles

Bug Description

Try this script (in bash):

mkdir -p /tmp/bazaar-bug
cd /tmp/bazaar-bug
for i in `seq 20`; do
  for j in `seq 20`; do
    mkdir -p "directory$i/subdir$j";
    for k in `seq 20`; do
      echo "This is some content in file #$k" > directory$i/subdir$j/file$k.txt;
    done;
  done;
done;
bzr init;
bzr add -q .

Now try:

$ bzr rm *

It's pathologically slow (on the order of minutes of runtime, an earlier test with 30x30x30 files showed no signs of being done after 20 minutes), probably exponential (or at least O(n^k) in the number of files). If you run it with a smaller number of files, you can see that bzr errors out in the end, warning that those files were not previously tracked and thus can't be safely removed from the working tree. However, this:

$ bzr rm --keep *

is nice and fast. This:

$ bzr rm --new *

is *not* fast. So there's some kind of very bad behaviour associated with collecting info about newness of files that makes it take ages. It seems to be bad only computationally, I haven't noticed any increased memory usage during those 20 minutes I let it run for.

Revision history for this message
John A Meinel (jameinel) wrote :

I can confirm this. on a 20x20x20 setup, 'bzr rm --keep *' takes 7s on my machine, and 'bzr rm *' has not finished after 1min of execution.

And --lsprof of a canceled 'rm' shows:
    2974089 0 9.6543 7.5740 bzrlib.osutils:493(is_inside)
   +2974088 0 2.0802 2.0802 +<method 'startswith' of 'str' objects>
       1140 0 13.5061 3.8519 bzrlib.osutils:517(is_inside_any)
   +2974089 0 9.6543 7.5740 +bzrlib.osutils:493(is_inside)
       1140 0 2.4308 2.4308 <method 'difference' of 'set' objects>
    2974517 0 2.0807 2.0807 <method 'startswith' of 'str' objects>
          1 0 17.0079 1.0435 bzrlib.workingtree_4:1763(iter_changes)
      +1140 0 13.5061 3.8519 +bzrlib.osutils:517(is_inside_any)

The problem is that doing 'rm *' gives lots of paths to check, and 'is_inside_any' is not particularly gracious about this.

If you change it to:

time bzr rm directory1

it takes only 4s. However, it still ends up doing 182,174 calls to is_inside from 842 calls to is_inside_any.

It seems to be caused by this loop:
1871 import pdb; pdb.set_trace()
1872 -> for path in specific_files:
1873 other_specific_files = specific_files.difference(set([path]))
1874 if not osutils.is_inside_any(other_specific_files, path):
1875 # this is a top level path, we must check it.
1876 search_specific_files.add(path)

This is a check to make sure that any given path is not supplied by two different paths. Without --keep, it checks to see if any of the paths are considered modified. Which means it does:
WT.iter_changes(specific_files=all_paths_to_be_removed)

So with (20x20x20)^2 = 64,000,000 comparisons. It probably isn't quite that many, because it stops once it hits one that *is* inside.

One possibility is to change to a smarter object for is_inside_any, that can understand prefixes and more easily skip over ones that it couldn't match.
Another possibility is to figure out if we really need to be doing this check.

Changed in bzr:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Johan Walles (walles) wrote :

Still there in bzr 1.13.1:

$ time bzr rm *
real 1m44.284s
user 1m42.586s
sys 0m0.184s

$ time bzr rm --keep *
real 0m5.072s
user 0m4.588s
sys 0m0.244s

Excellent repro instructions!

Revision history for this message
Johan Walles (walles) wrote :

Here's a patch that fixes the poor bzr rm * performance in bzr trunk.

The patch does two things:
* It calls osutils.minimum_path_selection() from within workingtree_4.iter_changes() instead of using a copy of osutils.minimum_path_selection() in there.
* It converts osutils.minimum_path_selection() into an O(n) algorithm.

Timings of bzr trunk without this patch (notice how it's slower than 1.13.1, not my fault though :-):
$ time rm --keep *
real 0m11.291s
user 0m7.304s
sys 0m3.880s

New timings on my system with this patch applied:
$ time bzr rm *
real 0m11.124s
user 0m7.088s
sys 0m3.920s

$ time bzr rm --keep *
real 0m9.936s
user 0m6.768s
sys 0m3.076s

So bzr rm now performs about the same with and without --keep. As a side effect, anybody else calling osutils.minimum_path_selection() will be faster as well.

  Have fun :-) //Johan

tags: added: patch
summary: - bzr rm pathologically slow with new files and no --keep
+ [PATCH] bzr rm pathologically slow with new files and no --keep
Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Re: [Bug 180116] Re: bzr rm pathologically slow with new files and no --keep

Johan Walles wrote:
> Here's a patch that fixes the poor bzr rm * performance in bzr trunk.
>
Johan,

Sweet!

Could you please send the patch to the list with a subject line that
starts with [MERGE]? We can then follow our usual review process to get
it landed.

Ian C.

Revision history for this message
Johan Walles (walles) wrote :

Done:
http://bundlebuggy.aaronbentley.com/project/bzr/request/<9b0752e70905050021y1fb86ba6yd1b1211ddb439c86%40mail.gmail.com>

Johan Walles (walles)
Changed in bzr:
assignee: nobody → Johan Walles (walles)
status: Triaged → In Progress
Revision history for this message
Johan Walles (walles) wrote :

This was just merged into trunk:
http://bazaar.launchpad.net/~bzr/bzr/trunk/revision/4343

IIUC the fix will be in bzr 1.15.

Changed in bzr:
status: In Progress → Fix Committed
Johan Walles (walles)
tags: added: performance
Revision history for this message
John A Meinel (jameinel) wrote :

landed as bzr.dev 4343, will be in 1.15rc1

Changed in bzr:
milestone: none → 1.15rc1
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.