Evergreen

periods should be normalized to empty string for search

Bug #965430 reported by Galen Charlton on 2012-03-26

This bug affects 7 people

Affects		Status	Importance	Assigned to	Milestone
	Evergreen	Confirmed	Medium	Unassigned

Bug Description

At present, search_normalizer() and naco_normalizer() map periods to blanks, so using the default index definitions, strings that contain periods would get normalized like this:

"U.S.S.R." => "U S S R"
"USSR" => "USSR"

It would be desirable to allow "U.S.S.R" and "USSR" to retrieve the same sets of records, and one way of implementing this would be to tweak search_normalizer() to either collapse all periods to empty strings.

So, as a discussion item: does anybody see any pitfalls?

Evergreen: 2.0 and later

Tags:

Galen Charlton (gmc) on 2012-03-26

Changed in evergreen:
importance:	Undecided → Wishlist
milestone:	none → 2.2.0beta1
tags:	added: indexing search

Revision history for this message

Mike Rylander (mrylander) wrote on 2012-03-26:

I wonder if we should instead add a collapse_periods normalizer and put that before search_normalize on the appropriate fields?

Revision history for this message

Galen Charlton (gmc) wrote on 2012-03-26:

That would deal with the immediate problem -- though a naive implementation of collapse_periods is simply replace(".", "").

Revision history for this message

Mike Rylander (mrylander) wrote on 2012-03-27:

Indeed it would.

FWIW, my concern is about, for example, author names where (IIRC) cataloging rules say to use spaces [ex: "Rowling, J. K."], so normalizing an incorrectly cataloged name [ex: "Rowling, J.K."] with spaces would be better for the end user.

Revision history for this message

Elaine Hardy (ehardy) wrote on 2012-03-28:

I think the end user expects to be able to search any abbreviation -- USA, USSR, JK Rowling, etc and retrieve results for all permutations, whether a period or no period, with or without a space. Which means the permutations of USA, U S A, U. S. A. and U.S.A. would be expected to be retrieved for any one of those used as the search term. While that may be unrealistic, I think that is the expectation.

While the space is required in controlled fields such as the 1xx and 7xx fields, other fields are transcribed as they appear on the piece. If the title page has J.K. Rowling, the 245 |c will have J.K. Rowling, for example.

Jason Stephenson (jstephenson) on 2012-04-28

Changed in evergreen:
milestone:	2.2.0beta1 → 2.2.0rc1

Jason Stephenson (jstephenson) on 2012-05-15

Changed in evergreen:
milestone:	2.2.0rc1 → 2.2.0

Jason Stephenson (jstephenson) on 2012-06-13

Changed in evergreen:
milestone:	2.2.0 → 2.2.1

Jason Stephenson (jstephenson) on 2012-07-09

Changed in evergreen:
milestone:	2.2.1 → 2.3.0-alpha2

Jason Stephenson (jstephenson) on 2012-07-19

Changed in evergreen:
milestone:	2.3.0-alpha2 → 2.3.0-beta1

Jason Stephenson (jstephenson) on 2012-08-03

Changed in evergreen:
milestone:	2.3.0-beta1 → none

Jason Stephenson (jstephenson) on 2012-12-12

Changed in evergreen:
status:	New → Incomplete

Jason Stephenson (jstephenson) on 2012-12-12

Changed in evergreen:
status:	Incomplete → Triaged

Jason Stephenson (jstephenson) on 2012-12-13

Changed in evergreen:
status:	Triaged → Confirmed
importance:	Wishlist → Medium

Revision history for this message

Kathy Lussier (klussier) wrote on 2018-10-23:

Six years later, I concur with what Elaine said in comment #4.

Revision history for this message

Mike Rylander (mrylander) wrote on 2018-10-23:

And, six years on, I have another option to offer: a normalizer that recognizes abbreviations that include periods and expands them to various forms in the same way that the ISBN normalizer provides both 10 and 13 digit versions for indexing, so the search works either way.

Revision history for this message

Bill Erickson (berick) wrote on 2018-10-23:

+1 to Mike's normalizer-expansion suggestion. We handle this locally by indexing some fields twice, one with and one without periods, which adds index bulk.

Revision history for this message

Elaine Hardy (ehardy) wrote on 2018-10-23:

Revision history for this message

Terran McCanna (tmccanna) wrote on 2021-02-10:

2 years later, adding a +1

Revision history for this message

Garry Collum (gcollum) wrote on 2021-03-08:

#10

Adding another +1.

This seems to also affect the expert search (search the marc record). If a value has an embedded period, at least in the 020c, the value cannot be retrieved. An example is record 209 in the concerto data. 020c contains a value of $7.35. No results are returned using search terms 7.35, 7 35, or 735. If you remove the period ($735), you can retrieve the record using 735.

However, the same record contains and 852y, with the value 12.00. This can be retrieved using 12 00, as a search term.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1039149

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.