Comment 15 for bug 1350831

Revision history for this message
Jeff Davis (jdavis-sitka) wrote :

What one person calls "ignoring the richness of the data," another person might call "not exposing the user to the messiness of the data." :)

In our catalogue, title browse for "big bang" shows the following results:

Big bang (1)
The Big Bang (1)
The Big bang (1)
The big bang (4)

We'd prefer for at least the last three entries to be collapsed into a single entry. This means separate titles would be grouped together, but that's already the case: that last entry includes three unrelated works, one of which also falls under the second-last entry due to having multiple bib records for that work. If we wanted to disambiguate them, we could add a statement of responsibility to the browse field definition or something. But for now, having a single entry for all those different works would be an improvement.

If I understand correctly, to do this by normalizing the display values directly, we would need at least two normalizers:

(1) Strip trailing punctuation when appropriate. We could use the existing "Trim Trailing Punctuation" normalizer, but it literally just strips a single trailing comma or period from the end of the string, which doesn't cover some common ISBD punctuation. It's also not smart enough to avoid trimming the last period from "Ph.D." and so on.

(2) Normalize capitalization. There isn't currently a normalizer for this that is appropriate for user-displayed strings, and I'm not sure how feasible it is to come up with one. In English-language contexts I'd vote for normalizing to title case ("The Big Bang") except for name and subject browse entries, but other languages/locales would need different rules, which may be difficult to implement with a simple algorithm -- French title case gets tricky pretty quickly, for example.

It seems difficult to do this correctly.

Thinking aloud, fingerprinting browse fields and using whatever we already have for that fingerprint as our display value (even if it's not a perfect match) is a feasible alternative which may produce better results in most cases than what we currently have. So "The Big Bang" and "The Big bang" and "The big bang" would all get "the big bang" as a fingerprint, and all would appear in title browse under "The Big Bang" if that's the browse entry that already exists for that fingerprint. That's not dissimilar to what Blake's branch is trying to do, but hopefully with fewer pitfalls.