parse amazon data

Bug #152793 reported by Aaron Swartz
4
Affects Status Importance Assigned to Milestone
Open Library
Confirmed
Medium
Edward Betts

Bug Description

There are around 6M Amazon books now up at:

http://www.archive.org/details/amazon_crawl.catalog/

They should be parsed and eventually integrated. (Also, there are another million or so since the last time you grabbed the ISBNs from here.)

Aaron Swartz (aaronsw)
Changed in openlibrary:
assignee: nobody → edward-debian
importance: Undecided → High
milestone: none → launch
status: New → Confirmed
Revision history for this message
Edward Betts (edwardbetts) wrote :

The catalog.txt file contains duplicates, for example:

0002165163 1 Amazon.com: Spinner's yarn: Books: Ian Alexander Ross Peebles
0002165163 o-0 Amazon.com: Spinner's yarn: Books: Ian Alexander Ross Peebles
0002165171 1 Amazon.com: Memoirs: Books: Jean Monnet
0002165171 o-0 Amazon.com: Memoirs: Books: Jean Monnet
000216518X 1 Amazon.com: Media Mob: Books: George Melly
000216518X o-0 Amazon.com: Media Mob: Books: George Melly
000216521X 1 Amazon.com: Old Glory an American Voyage: Books: Johnathan Raban
000216521X o-0 Amazon.com: Old Glory an American Voyage: Books: Johnathan Raban
0002165252 1 404 - Document Not Found
0002165252 o-0 404 - Document Not Found

Revision history for this message
Aaron Swartz (aaronsw) wrote : Re: [Bug 152793] Re: parse amazon data

Hmm. The catalogs for 1 and o-0 are different, so some things must
have been downloaded twice by accident.

Revision history for this message
Edward Betts (edwardbetts) wrote :

Amazon parser is working. Got some more fields to add:

has_cover_img: boolean, done
amazon_availability: string, like "In Stock.", done
list_price, amazon_price, used_price: value in $
editorial_reviews: list
more_editorial_reviews: boolean
customer_review_count: int
average_customer_review: string
other_editions: list
statistically_improbable_phrases: list, done
capitalized_phrases: list, done
tags: list - done

lists of isbns and page numbers for books cited and books citing

Revision history for this message
Aaron Swartz (aaronsw) wrote :

Isn't average customer review a float?

Revision history for this message
Edward Betts (edwardbetts) wrote :

Average customer review is a fixed point number, with one decimal place. It could be represented as a float.

Changed in openlibrary:
importance: High → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.