Crawl LC catalog for 2007 books

Bug #277512 reported by Karen Coyle
4
Affects Status Importance Assigned to Milestone
Open Library
New
Undecided
Edward Betts

Bug Description

The original LC file contained books through the end of 2006. The weekly subscription covers 2008. To obtain the LC books from 2007 we should be able to do a crawl of their catalog.

BACKGROUND:
Every LC book has an LC Catalog Number (LCCN). The first four digits (since 2002 or so) represent the year. Therefore, the records for 2007 will all begin with '2007'. This is followed by 6 digits. We can assume that these begin at '000001' and go forward. There will be about 350,000 books for the year.

METHODS:
We can use either Z39.50, requesting the MARC output based on the record number, or we may be able to use the LC stable URL for each book: http://lccn.loc.gov/2007999999. I'm not sure how to request the MARC record through this latter, but we can experiment or ask.

Revision history for this message
Edward Betts (edwardbetts) wrote :

Looks like we can get MARC XML. For example: http://lccn.loc.gov/2007000001/marcxml

This should be relatively easy to load.

Revision history for this message
Jeff Suttor (jeff-suttor) wrote :

MARCXML is available using LCCN Permalinks, http://lccn.loc.gov/#n9

to test, a simple lc_crawl.py was used to get the first 1k records for 2007. results:

* pymarc used to parse MARCXML and convert to MARC21

* 88 records returned either:
   * <error xmlns:marc="http://www.loc.gov/MARC21/slim">record not found</error>
   * <error xmlns:marc="http://www.loc.gov/MARC21/slim">Temporarily Unavailable.<a href="http://lcweb2.loc.gov/lccn/2007######">Retry</a></error>

* crawl rates need to be throttled to 1 req/2 sec or lccn server returns 500 Server Error responses for the next several requests

if this is of value, the script can be enhanced:

  * better HTTP error recovery, e.g. retry
  * better logging
  * explicit User-Agent: for transparent crawl, e.g. URI to this bug

and a larger test run.

lccn permalink faq, http://lccn.loc.gov/#5 (#5-7) indicates that non-Roman data, authority records and some misc records are not currently available with lccn permalinks. relevant?

Changed in openlibrary:
assignee: nobody → Edward Betts (edwardbetts)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.