periods should be normalized to empty string for search

Bug #965430 reported by Galen Charlton on 2012-03-26
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Evergreen
Medium
Unassigned

Bug Description

At present, search_normalizer() and naco_normalizer() map periods to blanks, so using the default index definitions, strings that contain periods would get normalized like this:

"U.S.S.R." => "U S S R"
"USSR" => "USSR"

It would be desirable to allow "U.S.S.R" and "USSR" to retrieve the same sets of records, and one way of implementing this would be to tweak search_normalizer() to either collapse all periods to empty strings.

So, as a discussion item: does anybody see any pitfalls?

Evergreen: 2.0 and later

Galen Charlton (gmc) on 2012-03-26
Changed in evergreen:
importance: Undecided → Wishlist
milestone: none → 2.2.0beta1
tags: added: indexing search
Mike Rylander (mrylander) wrote :

I wonder if we should instead add a collapse_periods normalizer and put that before search_normalize on the appropriate fields?

Galen Charlton (gmc) wrote :

That would deal with the immediate problem -- though a naive implementation of collapse_periods is simply replace(".", "").

Mike Rylander (mrylander) wrote :

Indeed it would.

FWIW, my concern is about, for example, author names where (IIRC) cataloging rules say to use spaces [ex: "Rowling, J. K."], so normalizing an incorrectly cataloged name [ex: "Rowling, J.K."] with spaces would be better for the end user.

Elaine Hardy (ehardy) wrote :

I think the end user expects to be able to search any abbreviation -- USA, USSR, JK Rowling, etc and retrieve results for all permutations, whether a period or no period, with or without a space. Which means the permutations of USA, U S A, U. S. A. and U.S.A. would be expected to be retrieved for any one of those used as the search term. While that may be unrealistic, I think that is the expectation.

While the space is required in controlled fields such as the 1xx and 7xx fields, other fields are transcribed as they appear on the piece. If the title page has J.K. Rowling, the 245 |c will have J.K. Rowling, for example.

Changed in evergreen:
milestone: 2.2.0beta1 → 2.2.0rc1
Changed in evergreen:
milestone: 2.2.0rc1 → 2.2.0
Changed in evergreen:
milestone: 2.2.0 → 2.2.1
Changed in evergreen:
milestone: 2.2.1 → 2.3.0-alpha2
Changed in evergreen:
milestone: 2.3.0-alpha2 → 2.3.0-beta1
Changed in evergreen:
milestone: 2.3.0-beta1 → none
Changed in evergreen:
status: New → Incomplete
Changed in evergreen:
status: Incomplete → Triaged
Changed in evergreen:
status: Triaged → Confirmed
importance: Wishlist → Medium
Kathy Lussier (klussier) wrote :

Six years later, I concur with what Elaine said in comment #4.

Mike Rylander (mrylander) wrote :

And, six years on, I have another option to offer: a normalizer that recognizes abbreviations that include periods and expands them to various forms in the same way that the ISBN normalizer provides both 10 and 13 digit versions for indexing, so the search works either way.

Bill Erickson (berick) wrote :

+1 to Mike's normalizer-expansion suggestion. We handle this locally by indexing some fields twice, one with and one without periods, which adds index bulk.

Elaine Hardy (ehardy) wrote :

+1

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers