Real time search update crashed again

Bug #267853 reported by John Miedema on 2008-09-08
Affects Status Importance Assigned to Milestone
Open Library
Anand Chitipothu

Bug Description

When I add a new title to Open Library, it does not automatically get added to the search index. Users cannot find the titles they have just added. The ability to add titles is one of the great features of Open Library, but this feature is effectively unusable at present.

I have discussed this with Aaron and Paul. Paul has explained that a background process is supposed to monitor manual updates to add them to the index. The process is often not running. I have yet to successfully add a title and have it available to search.

IMHO, titles should always be available to search immediately after they have been added. I can't imagine users would be interested in adding titles if that is not the case. Is is possible to automatically add titles to the search index as soon as the user adds them?

Related branches

solrize (solrize) wrote :

Please try it now. Books added won't show up in the search engine instanteously but they should show up within a few minutes.

If you added a book recently and it didn't show up, can you tell me the title and author? Thanks.

solrize (solrize) wrote :

Actually, hold off testing for now, something is broken and I'm trying to fix it.

John Miedema (mail-johnmiedema) wrote :

Ok. For reference, I have created two records of the same book that do not show up with search. The first one was created in June of this year. The second was created recently.

solrize (solrize) wrote :

Thanks. The one added in June probably was from before we started capturing updates and so it won't show up til the coming full re-import. The one added more recently (it says Sept 4) should have been imported incrementally, but for some reason it doesn't appear in the import log. I will see if I can find the surrounding update activity from it was added, which should help figure out what's going wrong.

On Tue, Sep 9, 2008 at 6:07 AM, solrize <email address hidden> wrote:
> Thanks. The one added in June probably was from before we started
> capturing updates and so it won't show up til the coming full re-import.

I think, you have already re-imported once using the JSON dump that I provided.

The Sept 4th book really doesn't seem to show up in the update stream. Right now I'm re-importing that full dump on h02 (it should finish tomorrow) and will check then for both books mentioned.

Changed in openlibrary:
assignee: nobody → solrize
importance: Undecided → High
status: New → In Progress
solrize (solrize) wrote :

The h02 import finished (doesn't have new additions since the json dump). I'll test it and if it works then I'll add the additions (same batch of additions as last time, i.e. there will still be books missing until a new dump is available). FWIW, this import was much faster than previous dumps due to using a newer solr/lucene version and running on more cores. If h02 or similar hardware stays available then we can do these imports with significantly less hassle than before.

solrize (solrize) wrote :

Realtime update was working and up to date until a few days ago when it crashed for reasons not yet completely diagnosed. Working on it.

Also, changing title of bug. This is about real time update including for books added by the import bot, not just manual additions.

solrize (solrize) wrote :

Finally got to spend several hours trying to reproduce the last crash, without success, but the search engine should be caught up with new imports now. The realtime updater is running and pointed at the production solr; will keep an eye on it for further crashes.

solrize (solrize) wrote :

crashed again, record from log had a missing timestamp. I made a patch to treat missing timestamps as 0, and restarted updater. But these records are malformed, so a fix further upstream is needed.

solrize (solrize) wrote :

Crash reoccurred. It looks like there are records with intact timestamps that are garbled, so they mis-parse and get converted to error objects with no timestamp. One is attached. Notice at location 4116, right after "Mark D. Meyerson", it looks like the record is truncated and a new one begins. Is it like that in the original logs on pharosdb, i.e. could this be the result of data corruption over the network? I doubt this, but it's possible.

solrize (solrize) wrote :

There have been a few more records like this since last night. They all seem to mention Mark D. Meyerson. I haven't looked at them more closely than that.

solrize (solrize) wrote :

assign to Anand since it's a data problem from the update log.

Changed in openlibrary:
assignee: solrize → anandology
Edward Betts (edwardbetts) wrote :

Is this fixed?

solrize (solrize) wrote :

I can't be sure there are no more errors in the update stream, but the index has been rebuilt from json dumps a few times since this bug was opened, and search seems to find the books mentioned in the comments.

Changed in openlibrary:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers