Repo.revision_history() is slow

Bug #535118 reported by BenjaminBerg
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Dulwich
Fix Released
Medium
Unassigned

Bug Description

The revision_history function is implemented in a really slow manner, which also uncovers some things in the library that seem suboptimal to me.

The main issues I see are:
  * Commit objects are compared to each other (really inefficient)
  * A search is done on a list instead of a set() or similar to see if a commit already exists
  * (sorting is done by a pretty dumb insertion sort, not sure what the effect of this is)

With the first point one issue is that comparing two commit objects needs to call the __eq__ function, and may result in the hash to be calculated (which, in theory is not neccessary as the hash was used to retrieve the data in the first place).

It is not very hard to work around these issues, by just storing the hash as a string, and then doing the checks in the function based on this. However, I am thinking that a better way may exist.

For this I considered the following:
  * Commit objects (and other objects) that are written to the repository are basically immutable. They should never be modified.
  * A commit that may be modified does not belong to any repository
  * If Commits may not be modified, the same object could be returend if an object is retrieved multiple times

Now, if that happens, ie. the same object is retrieved when repo.get_object() is called twice on one ID, then the check whether two commits are the same becomes a simple python object equality tests.

For this to work, one would need to keep a hash table that contains all objects from the repository in a hash table. Of course it needs to use weak refs, so that the objects will be freed again when they are not needed anymore.

Tags: performance
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Comparing commits just compares their hashes so this shouldn't be a problem - it's not significantly less efficient than working with those hashes as strings directly. We eventually retrieve the full contents of the commit anyway to be able see what its parents and commit timestamp are.

Commits are only retrieved once for each sha anyway, so I don't see the point of caching them.

history is a list intentionally, so it can be sorted. Optionally we could keep a set of sha's that have already been processed so that we don't have to iterate the lsit each time.

Another optimization I see is using binary search to find the right place to insert a commit sha.

Jelmer Vernooij (jelmer)
summary: - dulwich slow in places (revision_history)
+ Repo.revision_history() is slow
Revision history for this message
BenjaminBerg (benjamin-sipsolutions) wrote :

The current implementation now runs for 19minutes without returning. This implementation returns after 12 seconds cpu time with the same repository. I am not completely sure that the sorting will be correct (in the extreme case that an older commit has a newer timestamp).

def revision_history(self, head):
    """Returns a list of the commits reachable from head.

    Returns a list of commit objects. the first of which will be the commit
    of head, then following theat will be the parents.

    Raises NotCommitError if any no commits are referenced, including if the
    head parameter isn't the sha of a commit.

    XXX: work out how to handle merges.
    """
    # We build the list backwards, as parents are more likely to be older
    # than children
    try:
        pending_commits = [self.commit(head)]
    except KeyError:
        raise MissingCommitError(head)

    all_commits = set()
    sorted_commits = []
    while pending_commits != []:
        pending_commits.sort(key=lambda x:x.commit_time, reverse=True)
        commit = pending_commits.pop(0)

        if commit.id in all_commits:
            continue

        all_commits.add(commit.id)
        sorted_commits.append(commit)
        for parent in commit.parents:
            try:
                pending_commits.append(self.commit(parent))
            except KeyError:
                raise MissingCommitError(parent)
    sorted_commits.reverse()
    return sorted_commits

Revision history for this message
BenjaminBerg (benjamin-sipsolutions) wrote :

Using commit instead of commit.id for the all_commits set does not make any significant difference.

Jelmer Vernooij (jelmer)
Changed in dulwich:
status: New → Triaged
importance: Undecided → Low
Jelmer Vernooij (jelmer)
Changed in dulwich:
milestone: none → 0.7.0
Revision history for this message
Dave Borowitz (dborowitz) wrote :

Benjamin:
Thanks for the suggestions on improving the revision-walking code. One more improvement that could make a lot of difference in your implementation would be to avoid re-sorting the list on every iteration by using a priority queue (e.g. with the heapq module in Python's standard library).

I have a long patch series that implements revision-walking in a general way using a pqueue with lots of additional options. For some solid benchmark numbers: I can walk the >23k commits in git.git, starting from a loose repo and a hot disk cache, in about 8 seconds wall time. About 80% of that time is spent in commit parsing, so I think my commit-walking code is pretty fast. To put that in perspective, 'git log' on the same repo is about 10x faster.

My benchmark script is attached. I'll share my implementation when I can, but unfortunately it's at the end of a 40+ patch series at the moment.

Jelmer Vernooij (jelmer)
Changed in dulwich:
milestone: 0.7.0 → none
Jelmer Vernooij (jelmer)
Changed in dulwich:
milestone: none → 0.7.1
Jelmer Vernooij (jelmer)
tags: added: performance
Jelmer Vernooij (jelmer)
Changed in dulwich:
importance: Low → Medium
Jelmer Vernooij (jelmer)
Changed in dulwich:
milestone: 0.7.1 → none
Jelmer Vernooij (jelmer)
Changed in dulwich:
status: Triaged → Fix Committed
milestone: none → 0.7.2
Jelmer Vernooij (jelmer)
Changed in dulwich:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.