ASCII apostrophe and Unicode right single quotation mark should be normalized

Bug #1657171 reported by Dan Scott
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Evergreen
Status tracked in Main
3.1
Won't Fix
Medium
Unassigned
3.2
Won't Fix
Medium
Unassigned
Main
Fix Released
Medium
Unassigned

Bug Description

* Evergreen 2.10

Given two records with the following 245 $a:

Les faces cachées de lʹintervention en situation de crise
Les faces cachées de l'intervention en situation de crise

... one would expect a search for "Les faces cachées de l'intervention en situation de crise" to turn up both records.

However, as given away by the bug title, the first record uses Unicode U+2019, and does not get returned in the search results. Unexpected! Sad!

According to a few sources such as https://www.quora.com/Punctuation-Why-is-the-right-single-quote-U+2019-and-not-the-semantically-distinct-apostrophe-U+0027-the-preferred-apostrophe-character-in-Unicode is the preferred way to represent apostrophes in text, vs. ye olde ASCII-era U+0027.

Long story short, most people are going to type U+0027, so we should probably normalize U+2019 to U+0027 when indexing.

Tags: search
Dan Scott (denials)
description: updated
Revision history for this message
Mike Rylander (mrylander) wrote :

This[1] is licensed under the Apache 2.0, which is compatible with the GPL, so maybe we should steal some of the normalizations from it and add them to our search_normalize() function?

[1] https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/scripts/normalize-punctuation.perl

Revision history for this message
Kathy Lussier (klussier) wrote :

We just stumbled across this issue too. If a fix is applied, it would be good to add it to the naco_normalize function as well as search_normalize for those of us who are using NACO.

Changed in evergreen:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Elizabeth Thomsen (et-8) wrote :

We see this problem with users copying and pasting text from other sources.

Also note that curly quotes are ignored and not treated as quotation marks to search a phrase: “new england” is searched as new england

Michele Morgan (mmorgan)
Changed in evergreen:
assignee: nobody → Michele Morgan (mmorgan)
Michele Morgan (mmorgan)
tags: added: pullrequest search
Changed in evergreen:
assignee: Michele Morgan (mmorgan) → nobody
Revision history for this message
Michele Morgan (mmorgan) wrote :

Working branch is at:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=commit;h=6c598d48cbe06ca01f6907b16df5e4a9c4d27999

The patch changes search_normalize and naco_normalize to make the following replacements: right and left leaning single quotes with U+0027, double quotes with U=0022

The following are replaced with U+0027 - Apostrophe

U+2018 - Left single quotation mark
U+2019 - Right single quotation mark
U+201B - single high-reversed-9 quotation mark
U+FF07 - Fullwidth Apostrophe
U+201A - single Low-9 Quotation Mark

The following are replaced with U+0022 - Quotation Mark

U+201C - Left double quotation mark
U+201D - Right double quotation mark
U+201F - Double high-reversed-9 quotation mark
U+FF0C - Fullwidth Quotation Mark
U+201E - Double Low-9 Quotation Mark
U+2E42 - Double Low Reversed-9 quotation mark

This fix does not address the curly quoted phrase search issue from comment #3, which is far less common and can be handled in a separate bug.

Revision history for this message
Michele Morgan (mmorgan) wrote :

Adding a link to a relevant IRC discussion stating that iOS browsers are using the "pretty" apostrophe by default (Unicode U-2019, UTF-8 e2 80 99):

http://irc.evergreen-ils.org/evergreen/2018-12-27#i_389512

Revision history for this message
Rogan Hamby (rogan-hamby) wrote :

Worked in my testing.

sign off pushed to user/rogan/lp1657171_signoff

tags: added: signedoff
Revision history for this message
Chris Sharp (chrissharp123) wrote :

Pushed to master. Thanks, Michele and Rogan!

tags: removed: pullrequest signedoff
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.