Search failure for words always appearing with punctuation

Bug #1503654 reported by M Tyson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Internet Archive BookReader
New
Undecided
Unassigned

Bug Description

Summary: For at least some indexed texts, a class of searches fail when they shouldn't.

The problem appears to be that the search mechanism doesn't find words that only occur in the text with adjacent (following or preceding) punctuation.

For instance, if the entire text was "One fish, two fish, two, three." a search for "one" would succeed. A search for "two" would find both instances (note that one of the words has no adjacent punctuation). A search for "fish" (or "fish,") would fail as there are no instances of "fish" without adjacent punctuation.

In most texts, most words and most names would appear at some time without adjacent punctuation, so most searches work. But in books that are useful to genealogists, names often appear only once. If punctuation is involved (such as in a comma-separated list or a period at the end of the sentence), the name may be unsearchable.

Recently a well-known genealogy blog issue was dedicated to working around an "Internet Archive search bug".
http://www.ancestryinsider.org/2015/09/how-to-navigate-around-internet-archive.html

An example is
https://archive.org/stream/cu31924028848483#page/n95/mode/2up
The text on the book's page 73 begins with "Jonas Messerly,". If you search for "Jonas", the search is successful. If you search for "Messerly" or "Messerly,", the search fails to find anything.
(Note that "Kidwell." on the previous page is missed in a search for "Kidwell" but both "Paul " and "Paul." are found in a search for "Paul". On the book's page 57, the instance of '"Das' (with preceding quote symbol) is not found by a search.)

The ABBYY information indicates that "Messerly", "Kidwell", and both instances of "Paul" were properly recognized by the OCR.

This book is from the Cornell University Library collection (70,000 books added in 2009), digitizing sponsor MSN. I believe many other books may have been processed similarly and suffer from the same issue. The blog author seems to feel this is a general issue.

M Tyson (tyson-3)
summary: - Failed search for text originally tokenized with punctuation
+ Search failure for words always appearing with punctuation
M Tyson (tyson-3)
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.