translation search is case-sensitive for non-ascii characters

Bug #235986 reported by Matthew Paul Thomas
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
High
Unassigned

Bug Description

At <http://launchpad.net/ubuntu/hardy/+source/bluez-gnome/+pots/bluetooth-manager/ru/+translate>, the first string contains the word "категории". Searching for "категории" returns this string as expected. But searching for the same string in capital letters, "КАТЕГОРИИ", does not return the string.

The search function is correctly case-insensitive for strings of some Latin characters.

[Originally reported by Артём Попов in launchpad-users@.]

description: updated
Changed in rosetta:
status: New → Confirmed
Adi Roiban (adiroiban)
description: updated
Revision history for this message
Adi Roiban (adiroiban) wrote : Re: Search is case-sensitive for some characters

Here is a new example

Try to find the following strings on Tomboy, Romanian translation:

First seach „șab” -> 5 results
https://translations.edge.launchpad.net/ubuntu/hardy/+source/tomboy/+pots/tomboy/ro/+translate?batch=10&show=all&search=%C8%99ab

Second serach „Șab” -> 2 results
https://translations.edge.launchpad.net/ubuntu/hardy/+source/tomboy/+pots/tomboy/ro/+translate?batch=10&show=all&search=%C8%98ab

I tried with other characters like ț,Ț,ă,Ă,î,Î and I got the same behaviour.

In the case or Romanian language for all non-ascii characters the serchs is not case-insensitive.

Revision history for this message
Данило Шеган (danilo) wrote :

This is related to how our Postgres is set-up (uses C locale), and Stuart doesn't feel comfortable with changing LC_CTYPE to UTF-8 locale with proper Unicode character mappings (Postgres used to have some problems with that in the past). FWIW, this seemed correct during testing because staging used to use en_US.UTF-8, but Stuart asked for it to be made consistent with our other production system.

I feel the best solution is to start using Unicode-enabled locales for Postgres, but would have to do some tests to make sure everything is fine like that, and that the performance hit is not too big.

Changed in rosetta:
assignee: nobody → danilo
importance: Undecided → High
Revision history for this message
Artem Popov (artfwo) wrote :

This may also be the reason behind bug 207625.

Revision history for this message
Данило Шеган (danilo) wrote :

Gary, this is the thing we discussed before: we need to get Postgres running in any Unicode-enabled locale for this bug to be fixed.

Changed in rosetta:
assignee: Данило Шеган (danilo) → nobody
Revision history for this message
Stuart Bishop (stub) wrote :

You can use ulower() instead of lower() to do case insensitive searches:

# select lower('КАТЕГОРИИ');
   lower
-----------
 КАТЕГОРИИ
(1 row)

# select ulower('КАТЕГОРИИ');
  ulower
-----------
 категории
(1 row)

Curtis Hovey (sinzui)
Changed in launchpad-foundations:
status: New → Triaged
importance: Undecided → Low
Revision history for this message
Robert Collins (lifeless) wrote : Re: Search is case-sensitive for non-ascii characters

We'll need to reindex with ulowered() content, but it should be pretty straight forward to fix this.

summary: - Search is case-sensitive for some characters
+ Search is case-sensitive for non-ascii characters
Gary Poster (gary)
tags: added: bugjam2010
summary: - Search is case-sensitive for non-ascii characters
+ translation search is case-sensitive for non-ascii characters
Revision history for this message
Colin Watson (cjwatson) wrote :

So, at the risk of being the developer who bores everyone with Unicode corner cases, which of these characters should be considered case-insensitively equivalent?

 1) I (U+0047 LATIN CAPITAL LETTER I)
 2) i (U+0069 LATIN SMALL LETTER I)
 3) İ (U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE)
 4) ı (U+0131 LATIN SMALL LETTER DOTLESS I)

The answer, of course, is that it depends on the language: you'll get a different answer if you ask a Turkish speaker than you probably will from an English speaker. In (say) en_GB.UTF-8, ulower follows a reasonable extension of the English rules: it folds 1), 2), and 3) to "i", and folds 4) to itself. But if we naïvely used that for Turkish text then a search for the upper-case version of a Turkish word containing the lower-case dotless "I" would not match, and vice versa. (Yes, this is actually a problem and it's one that Turkic language speakers have to run around filing bugs for; let's not make their life harder when we can anticipate the problem.)

Given that this is Translations, we know the language and we really ought to make the case-folding in the index be language-sensitive, rather than just applying some kind of generic rules which will be wrong for some languages. It would thus be wrong to handle this just by changing the database locale, which is too big a hammer and can't be customised per-row. Instead, we need a version of ulower that's context-dependent, and then reindex individual rows based on that plus the language.

(Unfortunately I'm not sure whether there's a straightforward way to do context-dependent case conversion in Python, so this might require some work.)

Revision history for this message
scootergrisen (scootergrisen) wrote :

Danish language have the characters a-z and æ ø å.

If i have the string "Åben" i will not find it when searching for "åben".

So for danish treat these the same way:
æ and Æ
ø and Ø
å and Å

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.