Browse index punctuation and capitalization causing multiple entries

Bug #1350831 reported by Don Butterworth on 2014-07-31
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Evergreen
Medium
Unassigned

Bug Description

2.5.2

In an Author Browse index, a period (full stop) at the end of a name results in two files on the results screen.

Example:
Rees, Paul S. (Paul Stromberg) (at least 100)
Rees, Paul S. (Paul Stromberg). (2)

When it is the last character at the end of a phrase, in all Browse Indexes, a full stop should be treated as a blank for indexing purposes in the results screen.

Mike Rylander (mrylander) wrote :

This can be addressed by configuration, specifically indexing normalization, though I don't think there's a staff UI for that at this point.

Yamil (ysuarez) wrote :

Should this be brought up to the cataloging or general list to see if we should change the default configuration value moving forward.

Also, if someone posts how to make the change on the back end, I could add it as a "tip" in the official docs.

tags: added: authority cataloging
Kathy Lussier (klussier) wrote :

I can say that everyone I have spoken to (this goes beyond MassLNC consortia) have said they would prefer entries like these to collapse into one. It's not just period, but also those that have a forward slash at the end or ones that vary in punctuation.

+1 from me to make it part of the default configuration. Looks like I've been consistent on this point - https://bugs.launchpad.net/evergreen/+bug/1177810/comments/17

Capitalization is also causing multiple entries. Example:

Title Browse Index

I believe in the Church (2)
I believe in the church (4)

Srey Seng (sreyseng) wrote :

I am not sure where to make the indexing normalization changes as implied in the comments in the back-end.

But, was able to fold "duplicate" entries into one by modifying the re-ingest (for when deciding whether to insert a new browse entry or not) to compare only on sort_value from the browse_entry table, instead of on both the value and the sort_value.

With the original comparison, because the insertion criteria is based on both the actual value and the sort_value, even if the sort_value (normalized version) is the same, the value would be different and cause a new insertion into the browse_table, resulting in similar entries appearing in browse results.

With this workaround however, as long as the sort_value or normalized version is the same, the entries are considered the same and will not result in a new insertion into the browse table. However, a potential downside is if, for example, you have three similar entries differing on punctuations, the one that gets ingested first will be the one that displays in browse results (as the rest will get folded into that).

This workaround requires at the very least a re-ingest of the browse entries (if not a total wipe of the browse entries + the re-ingest).

Kathy Lussier (klussier) on 2014-12-15
Changed in evergreen:
status: New → Confirmed
importance: Undecided → Medium
Kathy Lussier (klussier) wrote :

Adding a note that we no longer see duplicate entries for authors due to an ending period. Bug 1308090.

We still have distinct browse entries for headings that differ only by capitalization. I'm also not sure if there is other ending punctuation that causes problems. I'm going to update the title of this bug to address the capitalization issue.

summary: - Browse index punctuation causing multiple entries
+ Browse index punctuation and capitalization causing multiple entries
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers