Bug #1350831 “Browse index punctuation and capitalization causin...” : Bugs : Evergreen

Revision history for this message

Mike Rylander (mrylander) wrote on 2014-07-31:

#1

This can be addressed by configuration, specifically indexing normalization, though I don't think there's a staff UI for that at this point.

Revision history for this message

Yamil (ysuarez) wrote on 2014-07-31:

#2

Should this be brought up to the cataloging or general list to see if we should change the default configuration value moving forward.

Also, if someone posts how to make the change on the back end, I could add it as a "tip" in the official docs.

tags:

added: authority cataloging

Revision history for this message

Kathy Lussier (klussier) wrote on 2014-07-31:

#3

I can say that everyone I have spoken to (this goes beyond MassLNC consortia) have said they would prefer entries like these to collapse into one. It's not just period, but also those that have a forward slash at the end or ones that vary in punctuation.

+1 from me to make it part of the default configuration. Looks like I've been consistent on this point - https://bugs.launchpad.net/evergreen/+bug/1177810/comments/17

Revision history for this message

Don Butterworth (don-butterworth) wrote on 2014-08-05:

#4

Capitalization is also causing multiple entries. Example:

Title Browse Index

I believe in the Church (2)
I believe in the church (4)

Revision history for this message

Srey Seng (sreyseng) wrote on 2014-09-18:

#5

I am not sure where to make the indexing normalization changes as implied in the comments in the back-end.

But, was able to fold "duplicate" entries into one by modifying the re-ingest (for when deciding whether to insert a new browse entry or not) to compare only on sort_value from the browse_entry table, instead of on both the value and the sort_value.

With the original comparison, because the insertion criteria is based on both the actual value and the sort_value, even if the sort_value (normalized version) is the same, the value would be different and cause a new insertion into the browse_table, resulting in similar entries appearing in browse results.

With this workaround however, as long as the sort_value or normalized version is the same, the entries are considered the same and will not result in a new insertion into the browse table. However, a potential downside is if, for example, you have three similar entries differing on punctuations, the one that gets ingested first will be the one that displays in browse results (as the rest will get folded into that).

This workaround requires at the very least a re-ingest of the browse entries (if not a total wipe of the browse entries + the re-ingest).

Kathy Lussier (klussier) on 2014-12-15

Changed in evergreen:
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

Kathy Lussier (klussier) wrote on 2017-02-28:

#6

Adding a note that we no longer see duplicate entries for authors due to an ending period. Bug 1308090.

We still have distinct browse entries for headings that differ only by capitalization. I'm also not sure if there is other ending punctuation that causes problems. I'm going to update the title of this bug to address the capitalization issue.

summary:

- Browse index punctuation causing multiple entries
+ Browse index punctuation and capitalization causing multiple entries

Revision history for this message

Blake GH (bmagic) wrote on 2019-11-08:

#7

Based on comment #5

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/blake/LP1350831_Browse_index_punctuation_and_capitalization_causing_multiple_entries

Fixed it for us. Seems pretty straight forward. Although I would imagine a more elegant solution involving config.index_normalize might be more appropriate. Thoughts?

Blake GH (bmagic) on 2020-01-07

tags:

added: pullrequest

Revision history for this message

Gina Monti (gmonti90) wrote on 2020-02-21:

#8

I tested the code and sign off. Gina Monti, email: <email address hidden>

Jeff Davis (jdavis-sitka) on 2020-02-21

tags:

added: signedoff

Revision history for this message

Mike Rylander (mrylander) wrote on 2020-02-21:

#9

While this does have the effect intended, I argue that it's for the wrong reason. The sort value can be the same for different "real" values because of more than just casing and punctuation. In particular, non-filing characters may be present in one "real" value but not another, but the sort value will be the same. In that case, one of the "real" values with, say, the word "the" at the front will be ignored incorrectly.

If ISBD punctuation should not play a role in browse entry inclusion, then we should use a normalizer to remove that punctuation. For authority records, the normalizer mappings live in authority.heading_field_norm_map and for bib-side indexing they live in config.metabib_field_index_norm_map.

I've removed the signedoff and pullrequest tags because merging this will cause incorrect behavior for some data.

tags:	added: needsdiscussion removed: signedoff
tags:	removed: pullrequest

Revision history for this message

Elaine Hardy (ehardy) wrote on 2020-02-21:

#10

If the main purpose of the browse search is to support user discover, then having multiple entries for what is in reality the same thing is not helpful. Particularly if those differences are due to errors in cataloging. While I often use browse search to find errors and having punctuation and capitalization differences create separate entries is a big help for me, having those separate entries does not serve the average user and really only serves catalogers.

There are many reasons why entries in the browse list are separate, some of which will not be fixed by this and some of which will persist because of the nature of the data; however, getting to a cleaner browse list is in the best interest of patrons so I would argue for the reinstatement of the signed off and pull requests.

Revision history for this message

Mike Rylander (mrylander) wrote on 2020-02-21:

#11

Thanks Elaine, I think we all agree that we want a better outcome for patrons.

However, there are and will continue to be browse entries that are provably not the "same thing" to end users, and should all exist and have different linked endpoints, and which have *different display values*, but also have *identical sort values*. The branch above won't allow those, and in fact will *collapse* all those different things into one entry, pretending they are the same.

I'll be as blunt as I possible can regading the patch offered, as a purely technical matter: it is a hammer looking for a nail, and only makes things worse for both users and code maintainer down the road. It shows a lack of understanding of the existing code, both in intent and extent, and it will drop valid, differentiated browse entries, decreasing user discovery.

If we simply want ISBD punctuation to be removed from the [end of] user-visible strings then we should do that, and we already have a way to do that without breaking anything else. I've pointed to where the mapping could be added to stock seed data, and if I can find time to come up with a good normalizer and mapping set for the stock data I will it, but regardless I'm going to continue to suggest we not add new landmines to trip over in the future.

Revision history for this message

Jeff Davis (jdavis-sitka) wrote on 2020-02-21:

#12

Mike, do you have examples in mind of browse entries with identical sort values but meaningfully different display values? In the case of non-filing characters, it would be a defensible choice to ignore those anyway -- the distinction between "Lord of the Rings" and "The Lord of the Rings" is not meaningful or even desirable for many users. But of course there may be cases I'm not thinking of where the distinction does matter.

Also, just a note that ISBD punctuation isn't the only issue. Inconsistent capitalization has also been identified as a problem.

Revision history for this message

Mike Rylander (mrylander) wrote on 2020-02-21:

#13

Thanks for asking, Jeff.

Sure, the first one I thought might be an example, title browse search for "big bang", was indeed one. There are different titles under "Big bang" and "The big bang", unsurprisingly. Then I tried a title browse for "girl", and found "A Girl" and "The Girl" and "Girl", all different titles by different authors (and different formats). Similarly, I did a title browse for "stone" and got actually-different titles (and formats) of "stone" and "the stone".

That's just the easy stuff that I guessed might happen in a couple minutes on a test server. I can probably come up with some cross-language ones as well, which would be even more problematic for patrons, IMO.

Anyway, the point of all this is I believe we should fix the issue at the correct layer (which I see as configurably defining the meaning of "sameness" by normalizing "real value" strings in predetermined and predictable ways -- read: normalizers), and I strongly disagree that the offered patch is indeed changing things at the correct layer.

For the room generally: is it perhaps the issue that nobody wants to tackle creating (or verifying we have already) an appropriate ISBD-trimming normalizer? I really don't want to be re-implementing browse in a year or two because "it's broken and doesn't respect the data as cataloged".

Revision history for this message

Mike Rylander (mrylander) wrote on 2020-02-22:

#14

To follow up with some more thoughts ...

It's also true that we may need different display values (that is, what we should be showing the user) which can from different fields (subject vs title vs author) but the sort value may very well normalize to the same string. If we only compare uniqueness on the sort value then the different display variants will be lost, and we'll get the "wrong" title, or subject, or whatever displayed.

And when, in some glorious future, most evergreen instance use authority linking and controlled values (where the *real* value of browse becomes apparent) then ignoring the richness of the data (read: complexity of handling the data) will really bite us, especially in the from-different-fields case.

Revision history for this message

Jeff Davis (jdavis-sitka) wrote on 2020-02-22:

#15

What one person calls "ignoring the richness of the data," another person might call "not exposing the user to the messiness of the data." :)

In our catalogue, title browse for "big bang" shows the following results:

Big bang (1)
The Big Bang (1)
The Big bang (1)
The big bang (4)

We'd prefer for at least the last three entries to be collapsed into a single entry. This means separate titles would be grouped together, but that's already the case: that last entry includes three unrelated works, one of which also falls under the second-last entry due to having multiple bib records for that work. If we wanted to disambiguate them, we could add a statement of responsibility to the browse field definition or something. But for now, having a single entry for all those different works would be an improvement.

If I understand correctly, to do this by normalizing the display values directly, we would need at least two normalizers:

(1) Strip trailing punctuation when appropriate. We could use the existing "Trim Trailing Punctuation" normalizer, but it literally just strips a single trailing comma or period from the end of the string, which doesn't cover some common ISBD punctuation. It's also not smart enough to avoid trimming the last period from "Ph.D." and so on.

(2) Normalize capitalization. There isn't currently a normalizer for this that is appropriate for user-displayed strings, and I'm not sure how feasible it is to come up with one. In English-language contexts I'd vote for normalizing to title case ("The Big Bang") except for name and subject browse entries, but other languages/locales would need different rules, which may be difficult to implement with a simple algorithm -- French title case gets tricky pretty quickly, for example.

It seems difficult to do this correctly.

Thinking aloud, fingerprinting browse fields and using whatever we already have for that fingerprint as our display value (even if it's not a perfect match) is a feasible alternative which may produce better results in most cases than what we currently have. So "The Big Bang" and "The Big bang" and "The big bang" would all get "the big bang" as a fingerprint, and all would appear in title browse under "The Big Bang" if that's the browse entry that already exists for that fingerprint. That's not dissimilar to what Blake's branch is trying to do, but hopefully with fewer pitfalls.

What one person calls "ignoring the richness of the data," another person might call "not exposing the user to the messiness of the data." :)

In our catalogue, title browse for "big bang" shows the following results:

Big bang (1)
The Big Bang (1)
The Big bang (1)
The big bang (4)

We'd prefer for at least the last three entries to be collapsed into a single entry.  This means separate titles would be grouped together, but that's already the case: that last entry includes three unrelated works, one of which also falls under the second-last entry due to having multiple bib records for that work.  If we wanted to disambiguate them, we could add a statement of responsibility to the browse field definition or something.  But for now, having a single entry for all those different works would be an improvement.

If I understand correctly, to do this by normalizing the display values directly, we would need at least two normalizers:

(1) Strip trailing punctuation when appropriate.  We could use the existing "Trim Trailing Punctuation" normalizer, but it literally just strips a single trailing comma or period from the end of the string, which doesn't cover some common ISBD punctuation.  It's also not smart enough to avoid trimming the last period from "Ph.D." and so on.

(2) Normalize capitalization.  There isn't currently a normalizer for this that is appropriate for user-displayed strings, and I'm not sure how feasible it is to come up with one.  In English-language contexts I'd vote for normalizing to title case ("The Big Bang") except for name and subject browse entries, but other languages/locales would need different rules, which may be difficult to implement with a simple algorithm -- French title case gets tricky pretty quickly, for example.

It seems difficult to do this correctly.

Thinking aloud, fingerprinting browse fields and using whatever we already have for that fingerprint as our display value (even if it's not a perfect match) is a feasible alternative which may produce better results in most cases than what we currently have.  So "The Big Bang" and "The Big bang" and "The big bang" would all get "the big bang" as a fingerprint, and all would appear in title browse under "The Big Bang" if that's the browse entry that already exists for that fingerprint.  That's not dissimilar to what Blake's branch is trying to do, but hopefully with fewer pitfalls.

Revision history for this message

Mike Rylander (mrylander) wrote on 2020-02-22:

#16

Thanks, Jeff.

To be clear, by richness of the data I'm referring to what can be encoded (and made use of), not how well it is, in fact, encoded.

Re "the big bang", would you also want "a big bang" to fold into "big bang", especially under a title browse? I'm not saying that exists in your instance, but I am saying it would fold in, given proper non-filing characters. I'd certainly argue they should be different display strings -- see my "girl"/"a girl"/"the girl" example.

I can certainly see a case (heh) being made for optional case-insensitive comparison on the display field, though I think that would need to be configurable at least per browse class, if not per field. Case may matter for differentiating names in the author class, say. I can think of at least 2 different ways to do that in a performant manner, either using citext or our evergreen.lowercase() function and an additional index. And, as you say, there's titlecasing the display field as a normalizer, which has all the language-oriented algorithmic pitfalls you allude to, but might be a perfectly reasonable choice for some catalogs to make. We just shouldn't force it on them.

As for trailing ISBD punctuation in an author field -- which, I want to highlight, is the specific original driver for this LP bug -- that actually seems simple. We just remove trailing punctuation that follows a non-word character. That won't strip the period at the end of "Ph.D." but will strip the OP's "." following a ")". It'll also strip commas following periods (ex: "Rowling, J.K.,") which are a common source of this issue in a couple instances I've looked at this morning.

There is also the case of dangling ISBD punct at the end of titles, where there should be a statement of authority (probably) but for whatever reason there is not. The above would handle this because those are always (supposed to be, and seem to be in examples I can find in the wild) preceded by a space.

All of this still doesn't address differing normalization rules for different fields. Do we want to add articles to the display of subjects because a title normalize away an article in the sort value due to non-filing characters?

Thanks, Jeff.

To be clear, by richness of the data I'm referring to what can be encoded (and made use of), not how well it is, in fact, encoded.

Re "the big bang", would you also want "a big bang" to fold into "big bang", especially under a title browse?   I'm not saying that exists in your instance, but I am saying it would fold in, given proper non-filing characters.  I'd certainly argue they should be different display strings -- see my "girl"/"a girl"/"the girl" example.

I can certainly see a case (heh) being made for optional case-insensitive comparison on the display field, though I think that would need to be configurable at least per browse class, if not per field.  Case may matter for differentiating names in the author class, say.  I can think of at least 2 different ways to do that in a performant manner, either using citext or our evergreen.lowercase() function and an additional index.  And, as you say, there's titlecasing the display field as a normalizer, which has all the language-oriented algorithmic pitfalls you allude to, but might be a perfectly reasonable choice for some catalogs to make.  We just shouldn't force it on them.

As for trailing ISBD punctuation in an author field -- which, I want to highlight, is the specific original driver for this LP bug -- that actually seems simple.  We just remove trailing punctuation that follows a non-word character.  That won't strip the period at the end of "Ph.D." but will strip the OP's "." following a ")".  It'll also strip commas following periods (ex: "Rowling, J.K.,") which are a common source of this issue in a couple instances I've looked at this morning.

There is also the case of dangling ISBD punct at the end of titles, where there should be a statement of authority (probably) but for whatever reason there is not. The above would handle this because those are always (supposed to be, and seem to be in examples I can find in the wild) preceded by a space.

All of this still doesn't address differing normalization rules for different fields.  Do we want to add articles to the display of subjects because a title normalize away an article in the sort value due to non-filing characters?

Revision history for this message

Mike Rylander (mrylander) wrote on 2020-02-22:

#17

Whoa, hold the phone! Side quest time!

Jeff, you're primary examples ("lord of the rings", and your instance's version of "big bang") are about leading articles are harmful, at least in your EG instance, and you'd like to ignore them for browse (both sort, as specified by cataloging rules, AND for display, seen as an improvement for the patron), is that right?

Assuming so (and ignoring case for the moment) you can already do that. Assuming a stock title|browse definition (adjust as needed), all you need to do is:

UPDATE config.metabib_field SET browse_xpath = browse_sort_xpath WHERE field_class='title' and name='browse';

Revision history for this message

Mike Rylander (mrylander) wrote on 2020-02-22:

#18

After my side quest I decided to step back, re-read the whole thread, and take another look at the situation globally.

First, I think we missed or lost sight of the fact that that this bug is a duplicate of bug 1308090, which already fixed the OP issue (though, looking at the code, I think it could be adjusted to protect the "Ph.D." case, and other dotted abbreviations. I will offer a branch to address that within the next few days.). I suspect that Blake does not have the fix from that bug in his installation if his current complaint is about trailing ISBD on author fields, which is what this bug is about. (If not, there needed to be a new LP bug anyway because his patch is confusing things.)

With the browse and browse-sort xpath (there since day-1 of browse) and existing normalizers (see bug 1308090), case differences are really the only remaining "duplicate" issue that can't be covered by the code as it stands today. I recommend that we mark this bug as a duplicate of 1308090, which the OP definitely was to begin with, and open a new LP bug to consider proper ways of ignoring case differences in display values for browse entries, because that's what's left, and is fixable without breaking browse generally.

IOW, I think too many (already addressed) wires are getting crossed on this bug now.

Objections to that?

Revision history for this message

Mike Rylander (mrylander) wrote on 2020-02-24:

#19

As promised in the top half of comment #18, please see bug 1864507 which builds on the fix on bug 1308090 to protect some types of trailing punctuation from removal (particularly, dotted abbreviations) and expands it to cover dangling colons and slashes on title fields.

I'll be looking at optional case-insensitive display value comparison soon, and update here when I have something to share.

Revision history for this message

Mike Rylander (mrylander) wrote on 2020-02-24:

#20

And, as promised in comment #19, please see bug 1864516 which adds the ability to ignore case when looking for pre-existing browse entries.

I'm going to mark this bug as wont-fix since there are now two specifically targeted bugs with pullrequest tags that, AFAICT, address the remaining issues that are tangled up in this bug. If there are other cases, edge or otherwise, unrelated to either trailing ISBD punctuation (bug 1864507) or the ability to case-fold when looking for a pre-existing browse entry (bug 1864516), please open a new bug with each.

Thanks, all.

Changed in evergreen:
status:	Confirmed → Won't Fix

Evergreen

Browse index punctuation and capitalization causing multiple entries

Bug Description

Other bug subscribers

Remote bug watches