ASCII apostrophe and Unicode right single quotation mark should be normalized
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Evergreen | Status tracked in Main | |||||
3.1 |
Won't Fix
|
Medium
|
Unassigned | |||
3.2 |
Won't Fix
|
Medium
|
Unassigned | |||
Main |
Fix Released
|
Medium
|
Unassigned |
Bug Description
* Evergreen 2.10
Given two records with the following 245 $a:
Les faces cachées de lʹintervention en situation de crise
Les faces cachées de l'intervention en situation de crise
... one would expect a search for "Les faces cachées de l'intervention en situation de crise" to turn up both records.
However, as given away by the bug title, the first record uses Unicode U+2019, and does not get returned in the search results. Unexpected! Sad!
According to a few sources such as https:/
Long story short, most people are going to type U+0027, so we should probably normalize U+2019 to U+0027 when indexing.
description: | updated |
Changed in evergreen: | |
assignee: | nobody → Michele Morgan (mmorgan) |
tags: | added: pullrequest search |
Changed in evergreen: | |
assignee: | Michele Morgan (mmorgan) → nobody |
tags: | removed: pullrequest signedoff |
This[1] is licensed under the Apache 2.0, which is compatible with the GPL, so maybe we should steal some of the normalizations from it and add them to our search_normalize() function?
[1] https:/ /github. com/ciprian- chelba/ 1-billion- word-language- modeling- benchmark/ blob/master/ scripts/ normalize- punctuation. perl