Ubuntu

Activity log for bug #744914

Date Who What changed Old value New value Message
2011-03-29 12:24:23 Lucian Adrian Grijincu bug added bug
2011-03-29 12:24:53 Lucian Adrian Grijincu description Binary package hint: software-center As of now software center uses str.lower() when searching in the xapian db: utils/query.py 22: s = search_term.lower() 33: query = xapian.Query(str_to_prefix[search_prefix]+search_term.lower()) There are two problems with this: * many languages have diacritic marks for characters but for fast typing users usually write the base character: (in Romanian: ăâșțî and ĂÂȘȚÎ are spelled AASTI by some users). * characters in the Unicode set can appear in two forms: composed and decomposed: the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0327 (COMBINING CEDILLA) U+0043 (LATIN CAPITAL LETTER C). To solve both problems both the text entered in the xapian db and the user's text query must be normalized. The search function in Chromium uses ICU rules to achieve this: - http://code.google.com/p/chromium/issues/detail?id=1100 - http://www.google.com/codesearch/p?hl=en#OAMlx_jo-ck/src/third_party/WebKit/Source/WebCore/editing/TextIterator.cpp&q=file:TextIterator.cpp&l=1882 There is a python-icu library that could help achieve this. See for example http://lists.osafoundation.org/pipermail/pyicu-dev/2010-October/000214.html Or one could just remove the diacritical marks from the string altogether: http://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string The is the standard unicodedata.normalize() http://docs.python.org/library/unicodedata.html Binary package hint: software-center As of now software center uses str.lower() when searching in the xapian db: utils/query.py 22: s = search_term.lower() 33: query = xapian.Query(str_to_prefix[search_prefix]+search_term.lower()) There are two problems with this: * many languages have diacritic marks for characters but for fast typing users usually write the base character: (in Romanian: ăâșțî and ĂÂȘȚÎ are spelled AASTI by some users). * characters in the Unicode set can appear in two forms: composed and decomposed: the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0327 (COMBINING CEDILLA) U+0043 (LATIN CAPITAL LETTER C). To solve both problems both the text entered in the xapian db and the user's text query must be normalized. The search function in Chromium uses ICU rules to achieve this: - http://code.google.com/p/chromium/issues/detail?id=1100 - http://www.google.com/codesearch/p?hl=en#OAMlx_jo-ck/src/third_party/WebKit/Source/WebCore/editing/TextIterator.cpp&q=file:TextIterator.cpp&l=1882 There is a python-icu library that could help achieve this. See for example http://lists.osafoundation.org/pipermail/pyicu-dev/2010-October/000214.html Or one could just remove the diacritical marks from the string altogether: http://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string
2011-09-27 20:01:29 Kiwinote tags db
2011-10-07 13:47:37 Michael Vogt software-center (Ubuntu): status New Confirmed
2011-10-07 13:47:40 Michael Vogt software-center (Ubuntu): importance Undecided Medium
2011-10-07 13:47:52 Michael Vogt nominated for series Ubuntu Precise
2011-10-07 13:47:52 Michael Vogt bug task added software-center (Ubuntu Precise)
2011-10-07 13:49:00 Michael Vogt software-center (Ubuntu Precise): status New Confirmed
2011-10-07 13:49:01 Michael Vogt software-center (Ubuntu Precise): importance Undecided Medium
2011-10-25 17:06:33 David Planella bug task added ubuntu-translations
2011-11-14 17:51:33 Pedro Villavicencio software-center (Ubuntu Precise): status Confirmed Triaged