Push efficiency: objects in packs should be written as deltas

Bug #562673 reported by Jelmer Vernooij
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Dulwich
In Progress
High
Unassigned

Bug Description

reported by abderrahim in github:

I've made some patches some time ago, they are in my fork branch pack. All but the last patch are ready for me.
Basically, I've fixed and enabled the commented out code of writing deltas, the result is that it's very slow, I've tried to speed things up by various means : reuse deltas in existing packs, reusing the SequenceMatcher, etc. The last patch contains a C extension for diffing (actually finding blocks that match), I'm not sure this is the best idea, it reduced the time for a writing a pack (push to an empty repository) from 2m20 to 1m30. I still find this excessive, so I tried other things: doing a line-based (as opposed to character-diff) diff is way faster, only 10 seconds, but the pack is somewhat larger (216k vs 161k). "word" based diff (splitting on spaces) is somewhere in the middle (30 seconds, 166k). I find that this, or some "adaptive" method (like splitting on the most common char, may be the best idea. (the line-based and word based are both using pure python)

Tags: performance
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

I think we should aim to do something similar to what C git is doing - their implementation has been heavily optimized. Have you looked at what they're doing at all?

Changed in dulwich:
status: New → Triaged
importance: Undecided → Medium
importance: Medium → High
Jelmer Vernooij (jelmer)
Changed in dulwich:
milestone: none → 0.7.0
Revision history for this message
Abderrahim Kitouni (akitouni) wrote :

I've tried to look at what git does but it seems to be too different from dulwich. What's missing in (my branch of) dulwich for now is a fast implementation of pack.py:create_delta but other than that, it's pretty much useable right now. (well, this means waiting for 1m30 for packing 1k objects).
What I was discussing in my first comment is how to make it faster while still being pure python. I've tried writing a C replacement, it's much faster but still too slow IMO. By splitting words, I got reasonable performance, and I think this is the best we can get (still a bit slow, but I guess a C implementation can do the rest).

All in all, what's implemented is :
* correct algorithm for packing (using Linus' heuristic).
* reusing deltas from existing packs (I'm not sure but it seems git reuses even the fact that the file was not a delta in the pack).
* various attempts at speeding things up.

I'd like to get a review on the implementation. And any ideas on how to get more speed.

http://github.com/abderrahim/dulwich/compare/master...pack

Jelmer Vernooij (jelmer)
Changed in dulwich:
milestone: 0.7.0 → none
Jelmer Vernooij (jelmer)
tags: added: performance
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Unfortunately we can't use the C code from C git here, as it is GPLv2 only, whereas Dulwich is GPLv2 or later. :-(

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

I've attached an alternative branch to this bug - it's a fair bit simpler and seems to work properly well here.

It doesn't address two things that Abderrahim's patch does address:

 * Optimised version in C
 * Ability to write pack entries without delta'ing

It currently also lacks tests.

Jelmer Vernooij (jelmer)
Changed in dulwich:
status: Triaged → In Progress
assignee: nobody → Jelmer Vernooij (jelmer)
Jelmer Vernooij (jelmer)
Changed in dulwich:
assignee: Jelmer Vernooij (jelmer) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.