import new fulltext into SE

Bug #134164 reported by solrize
This bug report is a duplicate of:  Bug #276853: Please update totem to 2.24.1. Edit Remove
2
Affects Status Importance Assigned to Milestone
Open Library
In Progress
High
solrize

Bug Description

Current searchable fulltext all comes from an OCA snapshot taken in April, and quite a few more books have been added to OCA since then. At minimum the new books should be imported into the SE and this should be redone periodically. Best would be a way to make this happen automatically, either in real time, nightly, weekly, or whatever. But if there's a repeatable manual process, that's not so bad.

Right now there's no natural API to detect new OCA contents and the current snapshot was done with a bunch of hand-operated spidering scripts starting from an archive.org solr search. Maybe some improvements on the OCA side are possible.

solrize (solrize)
Changed in openlibrary:
assignee: nobody → solrize
importance: Undecided → Medium
solrize (solrize)
Changed in openlibrary:
status: New → Confirmed
Revision history for this message
solrize (solrize) wrote :

Per discussion with Siznax:

petabox/www/common/WorkBase.inc calls a functio updateSearchEngine() when a new book appears (this is to update the www.archive.org search engine, not the openlibrary engine). updateSearchEngine lives in petabox/www/common/SearchEngine.inc and the relevant function is update(). Siznax suggests subclassing SearchEngine but isn't sure this is feasible (might have to write a new class). Will also have to discuss with Tracey any changes to this code.

Revision history for this message
solrize (solrize) wrote :

I'm doing this now, based on a crawl that I did in late February. That is still somewhat out of date but I'm trying to make the process more repeatable, and eventually automatic.

Changed in openlibrary:
importance: Medium → High
status: Confirmed → In Progress
Revision history for this message
solrize (solrize) wrote :

Most of the new fulltext based on the Feb crawl is added but it stopped near the end due to a power outage, result is that some books are still missing (but would still be way behind even if those last few were there). Current plan is to semi-automated indexing with two processes:

1. Daily or so, use an IA solr query to find all scanned books, noticing which ones are new by maintaining a file containing locators of all those that have already been seen. (IA query might use a date range to keep result size down, but getting the full set of 250k or so scanned books takes just 2 minutes or so, not too bad for an infrequent operation like this). This could be run manually or from a cron job.

For each of these books, retrieve and save both the fulltext/pagetext and the MARC info. Make a file containing all the day's MARC records and dump it into a queue directory for infobase import by Edward.

2. Second process monitors the infobase update log (which right now doesn't exist, but there used to be something like it for tdb) and updates bibliographic index as before, but now it also notices scanned books and inserts the fulltext (gathered in previous step) into the fulltext index.

Revision history for this message
solrize (solrize) wrote :

This bug is now basically an aspect of bug #244359.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.