migrate-upload-data uses lots of memory.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| D-A-T Overview |
Medium
|
Andrew Starr-Bochicchio |
Bug Description
I couldn't quite figure out why on a quick glance, but on a machine with little memory usage, 1.5G of RAM and 2G of Swap this process gets killed.
Related branches
- Daniel Holbach: Approve on 2013-07-11
-
Diff: 60 lines (+22/-5)1 file modifiedoverview/uploads/management/commands/migrate-upload-data.py (+22/-5)
Daniel Holbach (dholbach) wrote : | #2 |
It was an initial import.
Daniel Holbach (dholbach) wrote : | #3 |
Maybe we could do something like this (pseudo-code):
bulk_insert = []
for row in cursor.fetchall():
if row.index % 1000:
else:
If the cursor.fetchall() cakk is actually problematic, we would need to slice the database results.
Not sure if the above isn't actually confusing. :-)
Just dropping this here for my own reference:
http://
Daniel Holbach (dholbach) wrote : | #5 |
What I wrote above is wrong, we'd drop every 1000th upload. It's been a long day, but I guess you can see what I'm up to in my suggestion. :)
Changed in dat-overview: | |
importance: | Undecided → Medium |
status: | New → Triaged |
Changed in dat-overview: | |
status: | Triaged → In Progress |
assignee: | nobody → Andrew Starr-Bochicchio (andrewsomething) |
I tried to do some profiling of the memory usage of a initial import. The whole this is at:
http://
This in particular seems strange:
Filename: uploads/
Line # Mem usage Increment Line Contents
=======
93 @profile
94 425.523 MB 0.000 MB def email_to_lp(self, e):
95 1612.133 MB 1186.609 MB try:
96 1612.133 MB 0.000 MB lp_person = self.launchpad.
97 1612.105 MB -0.027 MB lpid = lp_person.name
98 1612.105 MB 0.000 MB except:
99 425.582 MB -1186.523 MB lpid = ''
100 1612.133 MB 1186.551 MB return lpid
Some hints from our friends at OpenHatch:
We noticed the issue on migrate-upload-data memory consumption (https:/
The SQL you currently execute loads all of the data into memory. We changed the code to have a current time pointer, execute queries that grab uploads later than the current time pointer, limit size by a chunk size of 5000, and then updating the time pointer for the next query. With a chunk of 5000 it's currently only eating about 55MB (30MB for the process and 25MB for the 5000 chunk query). This stackoverflow post helped a lot: http://
Arg... Was this doing an initial import or just updating?
My quick off hand guess is that it's because I use Uploads. objects. bulk_create( bulk_insert) . This speeds things up dramatically as it allows adding a lot of rows to the database while only calling save once instead of saving on each row, but it also means everything is loaded into one variable. Maybe there's a way to do it in batches?