Comment 16 for bug 1350831

Revision history for this message
Mike Rylander (mrylander) wrote :

Thanks, Jeff.

To be clear, by richness of the data I'm referring to what can be encoded (and made use of), not how well it is, in fact, encoded.

Re "the big bang", would you also want "a big bang" to fold into "big bang", especially under a title browse? I'm not saying that exists in your instance, but I am saying it would fold in, given proper non-filing characters. I'd certainly argue they should be different display strings -- see my "girl"/"a girl"/"the girl" example.

I can certainly see a case (heh) being made for optional case-insensitive comparison on the display field, though I think that would need to be configurable at least per browse class, if not per field. Case may matter for differentiating names in the author class, say. I can think of at least 2 different ways to do that in a performant manner, either using citext or our evergreen.lowercase() function and an additional index. And, as you say, there's titlecasing the display field as a normalizer, which has all the language-oriented algorithmic pitfalls you allude to, but might be a perfectly reasonable choice for some catalogs to make. We just shouldn't force it on them.

As for trailing ISBD punctuation in an author field -- which, I want to highlight, is the specific original driver for this LP bug -- that actually seems simple. We just remove trailing punctuation that follows a non-word character. That won't strip the period at the end of "Ph.D." but will strip the OP's "." following a ")". It'll also strip commas following periods (ex: "Rowling, J.K.,") which are a common source of this issue in a couple instances I've looked at this morning.

There is also the case of dangling ISBD punct at the end of titles, where there should be a statement of authority (probably) but for whatever reason there is not. The above would handle this because those are always (supposed to be, and seem to be in examples I can find in the wild) preceded by a space.

All of this still doesn't address differing normalization rules for different fields. Do we want to add articles to the display of subjects because a title normalize away an article in the sort value due to non-filing characters?