D-A-T Overview

migrate-upload-data uses lots of memory.

Bug #1189808 reported by Daniel Holbach on 2013-06-11

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	D-A-T Overview	In Progress	Medium	Andrew Starr-Bochicchio

Bug Description

I couldn't quite figure out why on a quick glance, but on a machine with little memory usage, 1.5G of RAM and 2G of Swap this process gets killed.

Related branches

lp:~andrewsomething/dat-overview/lp1189808

Merged into lp:dat-overview at revision 34

Daniel Holbach: Approve on 2013-07-11

Revision history for this message

Andrew Starr-Bochicchio (andrewsomething) wrote on 2013-06-11:

Arg... Was this doing an initial import or just updating?

My quick off hand guess is that it's because I use Uploads.objects.bulk_create(bulk_insert). This speeds things up dramatically as it allows adding a lot of rows to the database while only calling save once instead of saving on each row, but it also means everything is loaded into one variable. Maybe there's a way to do it in batches?

Revision history for this message

Daniel Holbach (dholbach) wrote on 2013-06-11:

It was an initial import.

Revision history for this message

Daniel Holbach (dholbach) wrote on 2013-06-11:

Maybe we could do something like this (pseudo-code):

bulk_insert = []
for row in cursor.fetchall():
    if row.index % 1000:
        Uploads.objects.bulk_create(bulk_insert)
    else:
        bulk_insert.append(row)

If the cursor.fetchall() cakk is actually problematic, we would need to slice the database results.

Not sure if the above isn't actually confusing. :-)

Revision history for this message

Andrew Starr-Bochicchio (andrewsomething) wrote on 2013-06-11:

Just dropping this here for my own reference:

http://thebuild.com/blog/2010/12/13/very-large-result-sets-in-django-using-postgresql/

Revision history for this message

Daniel Holbach (dholbach) wrote on 2013-06-11:

What I wrote above is wrong, we'd drop every 1000th upload. It's been a long day, but I guess you can see what I'm up to in my suggestion. :)

Andrew Starr-Bochicchio (andrewsomething) on 2013-06-19

Changed in dat-overview:
importance:	Undecided → Medium
status:	New → Triaged

Andrew Starr-Bochicchio (andrewsomething) on 2013-07-10

Changed in dat-overview:
status:	Triaged → In Progress
assignee:	nobody → Andrew Starr-Bochicchio (andrewsomething)

Revision history for this message

Andrew Starr-Bochicchio (andrewsomething) wrote on 2013-07-17:

I tried to do some profiling of the memory usage of a initial import. The whole this is at:

http://paste.ubuntu.com/5885892/

This in particular seems strange:

Filename: uploads/management/commands/migrate-upload-data.py

Line # Mem usage Increment Line Contents
================================================
    93 @profile
    94 425.523 MB 0.000 MB def email_to_lp(self, e):
    95 1612.133 MB 1186.609 MB try:
    96 1612.133 MB 0.000 MB lp_person = self.launchpad.people.getByEmail(email=e)
    97 1612.105 MB -0.027 MB lpid = lp_person.name
    98 1612.105 MB 0.000 MB except:
    99 425.582 MB -1186.523 MB lpid = ''
   100 1612.133 MB 1186.551 MB return lpid

Revision history for this message

Andrew Starr-Bochicchio (andrewsomething) wrote on 2013-08-13:

Some hints from our friends at OpenHatch:

We noticed the issue on migrate-upload-data memory consumption (https://bugs.launchpad.net/dat-overview/+bug/1189808) and have done some work on reducing the memory consumption (https://github.com/openhatch/oh-greenhouse/blob/master/greenhouse/uploads/management/commands/migrate-upload-data.py) that I think will be helpful for you.

The SQL you currently execute loads all of the data into memory. We changed the code to have a current time pointer, execute queries that grab uploads later than the current time pointer, limit size by a chunk size of 5000, and then updating the time pointer for the next query. With a chunk of 5000 it's currently only eating about 55MB (30MB for the process and 25MB for the 5000 chunk query). This stackoverflow post helped a lot: http://stackoverflow.com/questions/14144408/memory-efficient-constant-and-speed-optimized-iteration-over-a-large-table-in

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.