periods should be normalized to empty string for search

Bug #965430 reported by Galen Charlton
36
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Evergreen
Confirmed
Medium
Unassigned

Bug Description

At present, search_normalizer() and naco_normalizer() map periods to blanks, so using the default index definitions, strings that contain periods would get normalized like this:

"U.S.S.R." => "U S S R"
"USSR" => "USSR"

It would be desirable to allow "U.S.S.R" and "USSR" to retrieve the same sets of records, and one way of implementing this would be to tweak search_normalizer() to either collapse all periods to empty strings.

So, as a discussion item: does anybody see any pitfalls?

Evergreen: 2.0 and later

Galen Charlton (gmc)
Changed in evergreen:
importance: Undecided → Wishlist
milestone: none → 2.2.0beta1
tags: added: indexing search
Revision history for this message
Mike Rylander (mrylander) wrote :

I wonder if we should instead add a collapse_periods normalizer and put that before search_normalize on the appropriate fields?

Revision history for this message
Galen Charlton (gmc) wrote :

That would deal with the immediate problem -- though a naive implementation of collapse_periods is simply replace(".", "").

Revision history for this message
Mike Rylander (mrylander) wrote :

Indeed it would.

FWIW, my concern is about, for example, author names where (IIRC) cataloging rules say to use spaces [ex: "Rowling, J. K."], so normalizing an incorrectly cataloged name [ex: "Rowling, J.K."] with spaces would be better for the end user.

Revision history for this message
Elaine Hardy (ehardy) wrote :

I think the end user expects to be able to search any abbreviation -- USA, USSR, JK Rowling, etc and retrieve results for all permutations, whether a period or no period, with or without a space. Which means the permutations of USA, U S A, U. S. A. and U.S.A. would be expected to be retrieved for any one of those used as the search term. While that may be unrealistic, I think that is the expectation.

While the space is required in controlled fields such as the 1xx and 7xx fields, other fields are transcribed as they appear on the piece. If the title page has J.K. Rowling, the 245 |c will have J.K. Rowling, for example.

Changed in evergreen:
milestone: 2.2.0beta1 → 2.2.0rc1
Changed in evergreen:
milestone: 2.2.0rc1 → 2.2.0
Changed in evergreen:
milestone: 2.2.0 → 2.2.1
Changed in evergreen:
milestone: 2.2.1 → 2.3.0-alpha2
Changed in evergreen:
milestone: 2.3.0-alpha2 → 2.3.0-beta1
Changed in evergreen:
milestone: 2.3.0-beta1 → none
Changed in evergreen:
status: New → Incomplete
Changed in evergreen:
status: Incomplete → Triaged
Changed in evergreen:
status: Triaged → Confirmed
importance: Wishlist → Medium
Revision history for this message
Kathy Lussier (klussier) wrote :

Six years later, I concur with what Elaine said in comment #4.

Revision history for this message
Mike Rylander (mrylander) wrote :

And, six years on, I have another option to offer: a normalizer that recognizes abbreviations that include periods and expands them to various forms in the same way that the ISBN normalizer provides both 10 and 13 digit versions for indexing, so the search works either way.

Revision history for this message
Bill Erickson (berick) wrote :

+1 to Mike's normalizer-expansion suggestion. We handle this locally by indexing some fields twice, one with and one without periods, which adds index bulk.

Revision history for this message
Elaine Hardy (ehardy) wrote :

+1

Revision history for this message
Terran McCanna (tmccanna) wrote :

2 years later, adding a +1

Revision history for this message
Garry Collum (gcollum) wrote :

Adding another +1.

This seems to also affect the expert search (search the marc record). If a value has an embedded period, at least in the 020c, the value cannot be retrieved. An example is record 209 in the concerto data. 020c contains a value of $7.35. No results are returned using search terms 7.35, 7 35, or 735. If you remove the period ($735), you can retrieve the record using 735.

However, the same record contains and 852y, with the value 12.00. This can be retrieved using 12 00, as a search term.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.