* crawl rates need to be throttled to 1 req/2 sec or lccn server returns 500 Server Error responses for the next several requests
if this is of value, the script can be enhanced:
* better HTTP error recovery, e.g. retry
* better logging
* explicit User-Agent: for transparent crawl, e.g. URI to this bug
and a larger test run.
lccn permalink faq, http://lccn.loc.gov/#5 (#5-7) indicates that non-Roman data, authority records and some misc records are not currently available with lccn permalinks. relevant?
MARCXML is available using LCCN Permalinks, http:// lccn.loc. gov/#n9
to test, a simple lc_crawl.py was used to get the first 1k records for 2007. results:
* pymarc used to parse MARCXML and convert to MARC21
* 88 records returned either: www.loc. gov/MARC21/ slim">record not found</error> www.loc. gov/MARC21/ slim">Temporarily Unavailable.<a href="http:// lcweb2. loc.gov/ lccn/2007######"> Retry</ a></error>
* <error xmlns:marc="http://
* <error xmlns:marc="http://
* crawl rates need to be throttled to 1 req/2 sec or lccn server returns 500 Server Error responses for the next several requests
if this is of value, the script can be enhanced:
* better HTTP error recovery, e.g. retry
* better logging
* explicit User-Agent: for transparent crawl, e.g. URI to this bug
and a larger test run.
lccn permalink faq, http:// lccn.loc. gov/#5 (#5-7) indicates that non-Roman data, authority records and some misc records are not currently available with lccn permalinks. relevant?