migrate-upload-data uses lots of memory.

Bug #1189808 reported by Daniel Holbach on 2013-06-11
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
D-A-T Overview
Medium
Andrew Starr-Bochicchio

Bug Description

I couldn't quite figure out why on a quick glance, but on a machine with little memory usage, 1.5G of RAM and 2G of Swap this process gets killed.

Related branches

Arg... Was this doing an initial import or just updating?

My quick off hand guess is that it's because I use Uploads.objects.bulk_create(bulk_insert). This speeds things up dramatically as it allows adding a lot of rows to the database while only calling save once instead of saving on each row, but it also means everything is loaded into one variable. Maybe there's a way to do it in batches?

Daniel Holbach (dholbach) wrote :

It was an initial import.

Daniel Holbach (dholbach) wrote :

Maybe we could do something like this (pseudo-code):

bulk_insert = []
for row in cursor.fetchall():
    if row.index % 1000:
        Uploads.objects.bulk_create(bulk_insert)
    else:
        bulk_insert.append(row)

If the cursor.fetchall() cakk is actually problematic, we would need to slice the database results.

Not sure if the above isn't actually confusing. :-)

Daniel Holbach (dholbach) wrote :

What I wrote above is wrong, we'd drop every 1000th upload. It's been a long day, but I guess you can see what I'm up to in my suggestion. :)

Changed in dat-overview:
importance: Undecided → Medium
status: New → Triaged
Changed in dat-overview:
status: Triaged → In Progress
assignee: nobody → Andrew Starr-Bochicchio (andrewsomething)

I tried to do some profiling of the memory usage of a initial import. The whole this is at:

http://paste.ubuntu.com/5885892/

This in particular seems strange:

Filename: uploads/management/commands/migrate-upload-data.py

Line # Mem usage Increment Line Contents
================================================
    93 @profile
    94 425.523 MB 0.000 MB def email_to_lp(self, e):
    95 1612.133 MB 1186.609 MB try:
    96 1612.133 MB 0.000 MB lp_person = self.launchpad.people.getByEmail(email=e)
    97 1612.105 MB -0.027 MB lpid = lp_person.name
    98 1612.105 MB 0.000 MB except:
    99 425.582 MB -1186.523 MB lpid = ''
   100 1612.133 MB 1186.551 MB return lpid

Some hints from our friends at OpenHatch:

We noticed the issue on migrate-upload-data memory consumption (https://bugs.launchpad.net/dat-overview/+bug/1189808) and have done some work on reducing the memory consumption (https://github.com/openhatch/oh-greenhouse/blob/master/greenhouse/uploads/management/commands/migrate-upload-data.py) that I think will be helpful for you.

The SQL you currently execute loads all of the data into memory. We changed the code to have a current time pointer, execute queries that grab uploads later than the current time pointer, limit size by a chunk size of 5000, and then updating the time pointer for the next query. With a chunk of 5000 it's currently only eating about 55MB (30MB for the process and 25MB for the 5000 chunk query). This stackoverflow post helped a lot: http://stackoverflow.com/questions/14144408/memory-efficient-constant-and-speed-optimized-iteration-over-a-large-table-in

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers