bzr diff too slow (cpu intensive) on large projects

Bug #1006194 reported by Dimitrios Apostolou
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Bazaar
Confirmed
High
Unassigned
Breezy
Triaged
Medium
Unassigned

Bug Description

In general the experience with huge projects is sub-optimal in comparison to other VCS. A particular example: I have a --no-trees local repo of lp:gcc and a single --lightweight checkout in a local directory where I work switching in-between all my branches. Both dirs are on local disk (I've had *awful* experience when on NFS but that's not the case).

The following command takes about 2 minutes on a recent quad-core CPU and is mostly CPU-bound. Here is the full report from the 'time' utility:

$ /usr/bin/time -v bzr diff -rbranch:../notrees-repo/trunk > /dev/null
Command exited with non-zero status 1
        Command being timed: "bzr diff -rbranch:../notrees-repo/trunk"
        User time (seconds): 91.23
        System time (seconds): 12.81
        Percent of CPU this job got: 84%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:03.53
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 369488
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 30
        Minor (reclaiming a frame) page faults: 4319360
        Voluntary context switches: 8765
        Involuntary context switches: 498
        Swaps: 0
        File system inputs: 277288
        File system outputs: 880
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 1

Memory usage is also a (smaller) problem, as you can see max RSS is about 370MB, which seems too much for diffing files. I am trying hard to use bzr for gcc development but I can't say I have seen a lot of activity on all the performance or memory related bug reports.

Revision history for this message
Dimitrios Apostolou (jimis) wrote :

Bazaar (bzr) 2.5.0
  Python interpreter: /usr/bin/python2 2.7.3
  Python standard library: /usr/lib/python2.7
  Platform: Linux-3.3.7-1-ARCH-i686-with-glibc2.0
  bzrlib: /usr/lib/python2.7/site-packages/bzrlib

Revision history for this message
Dimitrios Apostolou (jimis) wrote :

Just a clue that might help, both "bzr status" and "bzr diff" with no other arguments need about 3 seconds. So it's branching to a *different* branch (even though at most 1-2 files actually differ) that imposes the overhead.

Revision history for this message
Dimitrios Apostolou (jimis) wrote :

Hmmm, more measurements:

"bzr diff -r -1" is fast (~3s, I/O bound, 93 MB RSS).

"bzr diff -r -2" is slow (~2min, CPU bound, 370 MB RSS).

I'd appreciate if you could explain what imposes this overhead.

Revision history for this message
Dimitrios Apostolou (jimis) wrote :

On a further investigation, "bzr diff -r -2" consumes most of its time seeking and reading through the (few but big) pack files.

Revision history for this message
Dimitrios Apostolou (jimis) wrote :

"bzr diff -r -2" is slow, even if only one file differs.

"bzr diff -r -2 specific_file.c" is fast.

So I'm assuming bzr consumes the CPU trying to find which files actually differ.

Revision history for this message
Martin Packman (gz) wrote :

Thanks for looking into this. I wonder there two different repository references for each branch, rather than realising the repository is shared. It may be something trickier though. Have you tried the --lsprof-file flag to dig into where the time is spent?

There's more information on these general lines in some mailing list threads, which you may find useful:

<https://lists.ubuntu.com/archives/bazaar/2011q2/072713.html>
<https://lists.ubuntu.com/archives/bazaar/2011q4/073727.html>

Changed in bzr:
status: New → Incomplete
Revision history for this message
Dimitrios Apostolou (jimis) wrote :

Thanks, I didn't know about the --lsprof-file option, I'm attaching the profiling output for "bzr diff -r-2". FWIW "bzr log -r-2" is instantaneous.

Revision history for this message
Dimitrios Apostolou (jimis) wrote :

At the first link, point 2, it is mentioned that iter_changes() has been replaced at many places with optimized_iter_changes() for iterating over the differences of two trees. Yet in my profile output I can see that iter_changes() is being used, and a lot of time is consumed in it.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Bazaar because there has been no activity for 60 days.]

Changed in bzr:
status: Incomplete → Expired
Changed in bzr:
status: Expired → New
Martin Packman (gz)
Changed in bzr:
importance: Undecided → High
status: New → Confirmed
Jelmer Vernooij (jelmer)
tags: added: check-for-breezy
Jelmer Vernooij (jelmer)
tags: added: diff performance
removed: check-for-breezy
Changed in brz:
importance: Undecided → Medium
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.