"Out of memory" error when pushing a large repository

Bug #813268 reported by Jacek Antonelli
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Dulwich
Fix Released
Medium
Jelmer Vernooij

Bug Description

When pushing a large repository, Dulwich may (depending on repository size and available system memory) crash with the following message:

  abort: out of memory
  fatal: write error: Broken pipe

I observed this when using hg-git. A related issue has been posted for hg-git: <https://github.com/schacon/hg-git/issues/203>. But, I'm certain that this is an issue in Dulwich itself.

I'm not familiar enough with Dulwich's API to make a Dulwich-only script that reproduces the problem. But, if you need a large repository to test with, you can use <https://github.com/jacek/viewer-development>, which was converted from <https://bitbucket.org/lindenlab/viewer-development/>. A word of warning: the conversion process took over 6 hours on my computer. The bug seems to occur while creating/pushing the Git packs, so perhaps you could reproduce it using only the already converted Git repository.

I tracked the crash to the write_pack_data() function in dulwich/pack.py. In particular, the following code triggers it for me:

    recency = list(objects)
    # FIXME: Somehow limit delta depth
    # FIXME: Make thin-pack optional (its not used when cloning a pack)
    # Build a list of objects ordered by the magic Linus heuristic
    # This helps us find good objects to diff against us
    magic = []
    for obj, path in recency:
        magic.append( (obj.type_num, path, 1, -obj.raw_length(), obj) )
    magic.sort()
    # Build a map of objects and their index in magic - so we can find
    # preceeding objects to diff against
    offs = {}
    for i in range(len(magic)):
        offs[magic[i][4]] = i

Creating "recency" (the list copy of "objects", which is an ObjectStoreIterator) uses a great deal of memory if the repository has many objects (mine had over 150,000). Even if creating "recency" succeeds, creating the "magic" list and "offs" dict uses even more memory.

Amusingly, all that memory is gobbled up for no purpose. None of those lists or dicts are actually used, because the only code that used them is commented out:

        #for i in range(offs[o]-window, window):
        # if i < 0 or i >= len(offs): continue
        # b = magic[i][4]
        # if b.type_num != orig_t: continue
        # base = b.as_raw_string()
        # delta = create_delta(base, raw)
        # if len(delta) < len(winner):
        # winner = delta
        # t = 6 if magic[i][2] == 1 else 7

Removing or commenting out all the code in the first chunk I pasted above, significantly reduces Dulwich's memory footprint and speeds up the pushing process. After removing that code, I was able to successfully push the repository without running out of memory. And it has no negative impact on Dulwich's behavior, since the results of that code weren't being used anyway.

In the short term, I'd recommend commenting out that code. In the long term, Dulwich should split up large repositories into several smaller packs, so that it doesn't use so much memory at once.

Revision history for this message
Jacek Antonelli (jacek-antonelli) wrote :

One more thing I forgot to mention: if the code is commented out, this line:

    for o, path in recency:

needs to be changed to this:

    for o, path in objects:

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

It shouldn't be necessary to split up large repositories into multiple pack files - rather, the memory consumption of write_pack_data should not scale with the size of the pack that's written.

Have you tried running the dulwich testsuite with this change?

Jelmer Vernooij (jelmer)
Changed in dulwich:
status: New → Fix Committed
importance: Undecided → Medium
assignee: nobody → Jelmer Vernooij (jelmer)
milestone: none → 0.7.2
Jelmer Vernooij (jelmer)
Changed in dulwich:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.