transliterate text/use collation before adding to xapian db and when searching

Bug #744914 reported by Lucian Adrian Grijincu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Translations
New
Undecided
Unassigned
software-center (Ubuntu)
Triaged
Medium
Unassigned
Precise
Won't Fix
Medium
Unassigned

Bug Description

Binary package hint: software-center

As of now software center uses str.lower() when searching in the xapian db:

utils/query.py
22: s = search_term.lower()
33: query = xapian.Query(str_to_prefix[search_prefix]+search_term.lower())

There are two problems with this:
* many languages have diacritic marks for characters but for fast typing users usually write the base character: (in Romanian: ăâșțî and ĂÂȘȚÎ are spelled AASTI by some users).

* characters in the Unicode set can appear in two forms: composed and decomposed: the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0327 (COMBINING CEDILLA) U+0043 (LATIN CAPITAL LETTER C).

To solve both problems both the text entered in the xapian db and the user's text query must be normalized.

The search function in Chromium uses ICU rules to achieve this:
- http://code.google.com/p/chromium/issues/detail?id=1100
- http://www.google.com/codesearch/p?hl=en#OAMlx_jo-ck/src/third_party/WebKit/Source/WebCore/editing/TextIterator.cpp&q=file:TextIterator.cpp&l=1882

There is a python-icu library that could help achieve this. See for example http://lists.osafoundation.org/pipermail/pyicu-dev/2010-October/000214.html

Or one could just remove the diacritical marks from the string altogether: http://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string

Tags: db
description: updated
Revision history for this message
Matthew Paul Thomas (mpt) wrote :

This looks like a reasonable suggestion. Can you give an example of a search that would produce better results if this was implemented? That would help in prioritizing it.

Kiwinote (kiwinote)
tags: added: db
Michael Vogt (mvo)
Changed in software-center (Ubuntu):
status: New → Confirmed
importance: Undecided → Medium
Changed in software-center (Ubuntu Precise):
status: New → Confirmed
importance: Undecided → Medium
Changed in software-center (Ubuntu Precise):
status: Confirmed → Triaged
Revision history for this message
Steve Langasek (vorlon) wrote :

The Precise Pangolin has reached end of life, so this bug will not be fixed for that release

Changed in software-center (Ubuntu Precise):
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.