MARC import isn't picking up URLs

Bug #151151 reported by Aaron Swartz
4
Affects Status Importance Assigned to Milestone
Open Library
Confirmed
Medium
Edward Betts

Bug Description

Also, I would like to pick up these fields that have URLs, if we can. Here's an example:

856 42 $3Publisher description
$uhttp://www.loc.gov/catdir/description/dover031/00022415.html

This should be stored so that we can generate a link out of it like:

<a
href="http://www.loc.gov/catdir/description/dover031/00022415.html">Publisher
description</a>

But more importantly, we should generate a list of these URLs and start crawling them so we can integrate the data ourselves.

Edward, how hard would it be to dump a list of 856 URLs from the LC data?

Tags: marc
Aaron Swartz (aaronsw)
Changed in openlibrary:
assignee: nobody → daniel-mybuttocks
importance: Undecided → Medium
status: New → Confirmed
Aaron Swartz (aaronsw)
Changed in openlibrary:
assignee: daniel-mybuttocks → edward-debian
Aaron Swartz (aaronsw)
description: updated
Changed in openlibrary:
milestone: none → 1.1
Revision history for this message
Edward Betts (edwardbetts) wrote :

I've just written some code to output all the URLs in the LC import it is running now. Should finish in a few hours.

Revision history for this message
Edward Betts (edwardbetts) wrote :
Download full text (7.9 KiB)

The URL scan finished. It found 147 URLs. Here they are:

http://nrs.harvard.edu/urn-3:FHCL:452603
http://hdl.loc.gov/loc.gdc/lhbtn.00087
http://nrs.harvard.edu/urn-3:FHCL:123454
http://nrs.harvard.edu/urn-3:GSE.LIBR:460212
http://name.umdl.umich.edu/ACA2444
http://name.umdl.umich.edu/AEA4624
http://www.loc.gov/catdir/toc/fy054/32025414.html
http://digital.library.wisc.edu/1711.dl/Meiklejohn.MeikExpColl
http://digital.library.wisc.edu/1711.dl/HistSciTech.RootNodule
http://purl.dlib.indiana.edu/iudl/wright2/wright2-2392
http://hdl.loc.gov/loc.music/musdi.130
http://www.loc.gov/catdir/toc/fy055/59012755.html
http://www.loc.gov/catdir/toc/onix05/59013038.html
http://www.loc.gov/catdir/toc/fy043/59014861.html
http://www.sil.si.edu/digitalcollections/hst/atlantic-cable/
http://www.loc.gov/catdir/toc/fy051/59036858.html
http://hdl.loc.gov/loc.rbc/mtfrb.59616
http://purl.access.gpo.gov/GPO/LPS33246
http://hdl.loc.gov/loc.gdc/mtfgc.1017
http://hdl.loc.gov/loc.gdc/mtfgc.1014
http://www.rand.org/publications/P/P1888/
http://www.loc.gov/catdir/description/har041/60001895.html
http://www.loc.gov/catdir/toc/fy037/60002096.html
http://www.loc.gov/catdir/toc/fy037/60004006.html
http://www.loc.gov/catdir/toc/fy052/60004513.html
http://www.loc.gov/catdir/description/hc044/60006370.html
http://www.loc.gov/catdir/toc/fy052/60008394.html
http://www.loc.gov/catdir/toc/fy055/60008475.html
http://www.loc.gov/catdir/toc/fy052/60008981.html
http://www.loc.gov/catdir/toc/fy052/60009440.html
http://hdl.loc.gov/loc.gdc/calbk.079
http://www.loc.gov/catdir/toc/fy055/60012081.html
http://www.loc.gov/catdir/toc/fy055/60012109.html
http://www.loc.gov/catdir/toc/fy052/60012259.html
http://www.loc.gov/catdir/toc/fy054/67001038.html
http://www.loc.gov/catdir/toc/fy052/67001099.html
http://www.loc.gov/catdir/toc/fy054/67001886.html
http://digital.library.wisc.edu/1711.dl/EcoNatRes.DNRBull42
http://digital.library.wisc.edu/1711.dl/EcoNatRes.DNRBull59
http://digital.library.wisc.edu/1711.dl/EcoNatRes.DNRBull60
http://digital.library.wisc.edu/1711.dl/EcoNatRes.DNRBull61
http://digital.library.wisc.edu/1711.dl/EcoNatRes.DNRBull63
http://digital.library.wisc.edu/1711.dl/EcoNatRes.DNRBull62
http://hdl.loc.gov/loc.gdc/gcesp.0013
http://www.loc.gov/catdir/toc/fy045/75509702.html
http://www.loc.gov/catdir/description/prin051/75510731.html
http://hdl.loc.gov/loc.rbc/voll.33436
http://hdl.loc.gov/loc.gdc/mtfgc.64665
http://www.law.umaryland.edu/marshall/usccr/documents/cr12si7.pdf
http://www.law.umaryland.edu/marshall/usccr/documents/cr12v943a.pdf
http://www.law.umaryland.edu/marshall/usccr/documents/cr11048.pdf
http://www.law.umaryland.edu/marshall/usccr/documents/cr12as4.pdf
http://www.law.umaryland.edu/marshall/usccr/documents/cr11047.pdf
http://www.law.umaryland.edu/marshall/usccr/documents/cr12in24a.pdf
http://www.law.umaryland.edu/marshall/usccr/documents/cr12c767.pdf
http://www.law.umaryland.edu/marshall/usccr/documents/cr12d81z.pdf
http://www.law.umaryland.edu/marshall/usccr/documents/cr12f222.pdf
http://www.loc.gov/catdir/toc/fy052/75611401.html
http://digital.library.wisc.edu/1711.dl/EcoNatRes.DNRBull77
http://digital.library.wisc.edu/1711.dl/EcoNatRes.DNRBull80
http://digital.li...

Read more...

Revision history for this message
Aaron Swartz (aaronsw) wrote : Re: [Bug 151151] Re: MARC import isn't picking up URLs

> The URL scan finished. It found 147 URLs. Here they are:

Wow, that seems really low. I found over 100K or so by scanning the LC
website. I'm crawling them now.

Revision history for this message
Aaron Swartz (aaronsw) wrote :

OK, they're downloaded in apollonius:/0/pharos/crawl/lc_catdir for
whenever we get around to it.

tags: added: marc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.