Produce Regular Data Dumps - full & incremental

Bug #128399 reported by Aaron Swartz
22
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Open Library
Confirmed
Medium
Anand Chitipothu

Bug Description

We need regular dumps of our database. Many people have been complaining about this.

Aaron Swartz (aaronsw)
Changed in openlibrary:
assignee: nobody → anandology
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Anand Chitipothu (anandology) wrote :

Are you talking about the dump.txt text file in the repository?

Revision history for this message
Aaron Swartz (aaronsw) wrote : Re: [Bug 128399] Re: data dump export

> Are you talking about the dump.txt text file in the repository?

No, people want dumps of all the data we've imported, not really the templates.

Revision history for this message
Aaron Swartz (aaronsw) wrote : Re: data dump export

Assigning this to Edward. Edward, can you make sure some tdb log files or something like that are available as a result of the next import you do?

Changed in openlibrary:
assignee: anandology → edward-debian
Revision history for this message
Anand Chitipothu (anandology) wrote :

I have added an action to dump all the data, but I was talking far too much time.
I think, we also need to generate a fresh tdb.log for the entire database once. I'll file a new bug for that.

Revision history for this message
Anand Chitipothu (anandology) wrote :

jsondumps for all editions and authors are generated and kept here: http://openlibrary.org/static/jsondump

Need to setup cron to generate them every week.

Changed in openlibrary:
assignee: edward-debian → anandology
Revision history for this message
solrize (solrize) wrote :

Should also include books.json.gz (the expanded dump).

Changed in openlibrary:
milestone: 1.0 → 1.7
Revision history for this message
Edward Betts (edwardbetts) wrote :

We should make this run daily.

Changed in openlibrary:
assignee: Anand Chitipothu (anandology) → Edward Betts (edwardbetts)
Revision history for this message
George (george-archive) wrote :

2010-03-29
---------------

I'm planning to generate 3 types of dumps with OL data.

Open Library Dump:
     description: Latest revisions of all documents
     filename: ol_dump_${date}.txt.gz
     columns: key, type, revision, json
     frequency: monthly
     sort-order: unspecified

Open Library Complete Dump:
     description: All revisions of all documents
     filename: ol_cdump_${date}.txt.gz
     columns: key, type, revision, json
     frequency: monthly
     sort-order: unspecified

Open Library Incremental Dump:
     description: All revisions of all documents modified in a given day
     filename: ol_idump_${date}.txt.gz
     columns: key, type, revision, json
     frequency: daily
     sort-order: modification time

Each of these dump will be stored as an item in the internet archive
cluster.

URL Format:

Even though these files are stored in IA, there will be a
openlibrary.org/* URL for each file.

I'm considering the following two url formats.

Option#1:
http://openlibrary.org/dumps/ol_dump_2010-03-31.txt.gz
http://openlibrary.org/dumps/ol_cdump_2010-03-31.txt.gz
http://openlibrary.org/dumps/ol_idump_2010-03-31.txt.gz

Option#2:
http://openlibrary.org/dumps/2010/03/ol_dump_2010-03-31.txt.gz
http://openlibrary.org/dumps/2010/03/ol_cdump_2010-03-31.txt.gz
http://openlibrary.org/dumps/2010/03/ol_idump_2010-03-31.txt.gz

Which one should we pick?

Any other suggestions/feedback?

Anand

Changed in openlibrary:
milestone: 1.7 → upstream-to-www
assignee: Edward Betts (edwardbetts) → Anand Chitipothu (anandology)
importance: Medium → Critical
Revision history for this message
George (george-archive) wrote :

Suggested location - http://openlibrary.org/data/ *

Also - are you suggesting that you'll stop producing edition/author dumps separately? Or a Works dump of some sort, separately?

George (george-archive)
summary: - data dump export
+ Produce Regular Data Dumps - full & incremental
Revision history for this message
George (george-archive) wrote :

Brewster asks: how about being less database-like and use our JSON logs instead?
Periodically, compress the JSON log and put it into an archive item.

Changed in openlibrary:
milestone: upstream-to-www → general-bucket
Revision history for this message
bradymatthews (brady-matthews) wrote :

Any updates on when a dump might happen?

Revision history for this message
George (george-archive) wrote :

Anand?

Revision history for this message
Anand Chitipothu (anandology) wrote : Re: [Bug 128399] Re: Produce Regular Data Dumps - full & incremental

Will be ready by end of this week.

Revision history for this message
Anand Chitipothu (anandology) wrote :
Revision history for this message
bradymatthews (brady-matthews) wrote :

Thanks! :D

Revision history for this message
bradymatthews (brady-matthews) wrote :

I'm looking through the works data, and I'm seeing many, many entries for Tom Sawyer. Is there an identifier in there somewhere that I can use to only get the original, or is that something I'll need to write a bot to try to figure out?

Revision history for this message
Anand Chitipothu (anandology) wrote :

Found a bug in the dump generation. Will regenerate a new dump shortly.

Revision history for this message
Anand Chitipothu (anandology) wrote :

Regenerated the dumps and uploaded them to http://www.archive.org/details/ol_dump_2010-04-30.

Revision history for this message
George (george-archive) wrote :

This is finished now, right?

Or, do you still publish the dumps manually?

Revision history for this message
George (george-archive) wrote :

Anand - is this done now?

Revision history for this message
George (george-archive) wrote :

Anand - we were under the impression (perhaps incorrectly) that you are making daily dumps. If that's not right, can you confirm for us what's happening?

Changed in openlibrary:
milestone: general-bucket → stability-july-28
Revision history for this message
Anand Chitipothu (anandology) wrote :

On 01-Jul-10, at 3:26 AM, George wrote:

> Anand - we were under the impression (perhaps incorrectly) that you
> are
> making daily dumps. If that's not right, can you confirm for us what's
> happening?

No, daily dumps is not happening right now. Only monthly dumps are
happening.

http://www.archive.org/details/ol_exports

Revision history for this message
Anand Chitipothu (anandology) wrote :

Uploading the dumps automatically to archive.org is take care now.

Generation of montly dumps is completely in place. Need to work on generating daily incremental dumps. Reducing the priority as that is not very critical.

Changed in openlibrary:
importance: Critical → Medium
Revision history for this message
George (george-archive) wrote :

Great stuff, Anand. Agreed about the priority.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.