Comment 7 for bug 235986

Revision history for this message
Colin Watson (cjwatson) wrote :

So, at the risk of being the developer who bores everyone with Unicode corner cases, which of these characters should be considered case-insensitively equivalent?

 1) I (U+0047 LATIN CAPITAL LETTER I)
 2) i (U+0069 LATIN SMALL LETTER I)
 3) İ (U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE)
 4) ı (U+0131 LATIN SMALL LETTER DOTLESS I)

The answer, of course, is that it depends on the language: you'll get a different answer if you ask a Turkish speaker than you probably will from an English speaker. In (say) en_GB.UTF-8, ulower follows a reasonable extension of the English rules: it folds 1), 2), and 3) to "i", and folds 4) to itself. But if we naïvely used that for Turkish text then a search for the upper-case version of a Turkish word containing the lower-case dotless "I" would not match, and vice versa. (Yes, this is actually a problem and it's one that Turkic language speakers have to run around filing bugs for; let's not make their life harder when we can anticipate the problem.)

Given that this is Translations, we know the language and we really ought to make the case-folding in the index be language-sensitive, rather than just applying some kind of generic rules which will be wrong for some languages. It would thus be wrong to handle this just by changing the database locale, which is too big a hammer and can't be customised per-row. Instead, we need a version of ulower that's context-dependent, and then reindex individual rows based on that plus the language.

(Unfortunately I'm not sure whether there's a straightforward way to do context-dependent case conversion in Python, so this might require some work.)