ASCII apostrophe and Unicode right single quotation mark should be normalized

Bug #1657171 reported by Dan Scott on 2017-01-17
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Evergreen
Status tracked in Master
3.1
Medium
Unassigned
3.2
Medium
Unassigned
Master
Medium
Unassigned

Bug Description

* Evergreen 2.10

Given two records with the following 245 $a:

Les faces cachées de lʹintervention en situation de crise
Les faces cachées de l'intervention en situation de crise

... one would expect a search for "Les faces cachées de l'intervention en situation de crise" to turn up both records.

However, as given away by the bug title, the first record uses Unicode U+2019, and does not get returned in the search results. Unexpected! Sad!

According to a few sources such as https://www.quora.com/Punctuation-Why-is-the-right-single-quote-U+2019-and-not-the-semantically-distinct-apostrophe-U+0027-the-preferred-apostrophe-character-in-Unicode is the preferred way to represent apostrophes in text, vs. ye olde ASCII-era U+0027.

Long story short, most people are going to type U+0027, so we should probably normalize U+2019 to U+0027 when indexing.

Dan Scott (denials) on 2017-01-17
description: updated
Mike Rylander (mrylander) wrote :

This[1] is licensed under the Apache 2.0, which is compatible with the GPL, so maybe we should steal some of the normalizations from it and add them to our search_normalize() function?

[1] https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/scripts/normalize-punctuation.perl

Kathy Lussier (klussier) wrote :

We just stumbled across this issue too. If a fix is applied, it would be good to add it to the naco_normalize function as well as search_normalize for those of us who are using NACO.

Changed in evergreen:
status: New → Confirmed
importance: Undecided → Medium
Elizabeth Thomsen (et-8) wrote :

We see this problem with users copying and pasting text from other sources.

Also note that curly quotes are ignored and not treated as quotation marks to search a phrase: “new england” is searched as new england

Michele Morgan (mmorgan) on 2018-10-03
Changed in evergreen:
assignee: nobody → Michele Morgan (mmorgan)
Michele Morgan (mmorgan) on 2018-10-12
tags: added: pullrequest search
Changed in evergreen:
assignee: Michele Morgan (mmorgan) → nobody
Michele Morgan (mmorgan) wrote :

Working branch is at:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=commit;h=6c598d48cbe06ca01f6907b16df5e4a9c4d27999

The patch changes search_normalize and naco_normalize to make the following replacements: right and left leaning single quotes with U+0027, double quotes with U=0022

The following are replaced with U+0027 - Apostrophe

U+2018 - Left single quotation mark
U+2019 - Right single quotation mark
U+201B - single high-reversed-9 quotation mark
U+FF07 - Fullwidth Apostrophe
U+201A - single Low-9 Quotation Mark

The following are replaced with U+0022 - Quotation Mark

U+201C - Left double quotation mark
U+201D - Right double quotation mark
U+201F - Double high-reversed-9 quotation mark
U+FF0C - Fullwidth Quotation Mark
U+201E - Double Low-9 Quotation Mark
U+2E42 - Double Low Reversed-9 quotation mark

This fix does not address the curly quoted phrase search issue from comment #3, which is far less common and can be handled in a separate bug.

Michele Morgan (mmorgan) wrote :

Adding a link to a relevant IRC discussion stating that iOS browsers are using the "pretty" apostrophe by default (Unicode U-2019, UTF-8 e2 80 99):

http://irc.evergreen-ils.org/evergreen/2018-12-27#i_389512

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers