[dash] wrong search result of Unity in Chinese

Reported by Kevin Huang on 2011-03-29
42
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OEM Priority Project
Critical
Unassigned
Ubuntu Translations
High
Unassigned
Unity
High
Brandon Schaefer
Unity Foundations
High
Mikkel Kamstrup Erlandsen
Xapian
Confirmed
Unknown
Zeitgeist Extensions
Undecided
Unassigned
unity-2d
Critical
Florian Boucault
unity-lens-applications
High
Unassigned
software-center (Ubuntu)
High
Gary Lasker
unity-2d (Ubuntu)
Undecided
Unassigned
unity (Ubuntu)
Undecided
Unassigned

Bug Description

The search result in Chinese is not correct. Please see the attached example.

Unity-2D version: 3.8.1

Kevin Huang (wasikevin) wrote :
Bill Filler (bfiller) wrote :

To clarify:
This is running unity with Simplified Chinese seledcted and iBus for keyboard input method (see steps below).
Searching using Chinese characters does not yeild correct search results even though the Chinese characters appear in the displayed name. Search results are only correct if English characters are entered and this is wrong.

To enable iBus and Chinese:
In Language Support:
 * Tab 1: Install Chinese/Simplified translations, input methods, and fonts
 * Tab 1: Apply System-Wide button
 * Tab 1: Select iBus in Keyboard Input Method System combo.
 * Tab2 (named "Regional Formats" in en _US"): Select Chinese and Apply.
 * Reboot/Relogin
 * Ctrl + Space enables /disables iBus
 * enter Chinese character for Empathy for example and it won't be found. enter English characters for empathy and it will be found in applications place.

Changed in unity-2d:
milestone: none → 3.10
importance: Undecided → Critical
assignee: nobody → Kyle Nitzsche (knitzsche)
summary: - wrong search result in Chinese
+ [dash] wrong search result in Chinese

This bug most likely occurs in unity 3d. It's probably not being reported because there is currently no way to enter Chinese characters in unity 3d (see bug https://bugs.launchpad.net/ubuntu/+source/unity/+bug/663776) which is a MAJOR issue for OEM's shipping unity in China.

Kevin Huang (wasikevin) wrote :

Supplement information for Ubuntu Software center:

1. The category have be localized in to Chinese, but
2. The package name has NOT been localized.

Please see the attached screen shot.

Bill Filler (bfiller) on 2011-03-29
Changed in unity-2d:
assignee: Kyle Nitzsche (knitzsche) → Florian Boucault (fboucault)
Steve Magoun (smagoun) on 2011-03-29
Changed in oem-priority:
importance: Undecided → Critical
Michael Vogt (mvo) wrote :

Thanks for your bugreport.

I looked into this and was able to reproduce it if I only set my session to zh_CN but not the system wide language.

With a systemwide zh_CN setting (and a subsequent update-software-center call) I can see all the text translated.
I was not able to text the localized search (becauseI don't know what to input) but I assume it works as well as its using
the same data as the display. Attached are two screenshots from my box.

To make this work for non-systemwide setups software-center would have to build a index for each installed language.
We can add this feature if needed, it just means that the diskspace will increase and app-install-data package install
 speed will decrease.

Michael Vogt (mvo) wrote :
Changed in unity-2d:
milestone: 3.10 → 3.8.2
David Barth (dbarth) on 2011-03-30
Changed in unity:
milestone: none → 3.8.2
assignee: nobody → Mikkel Kamstrup Erlandsen (kamstrup)
importance: Undecided → High
status: New → Triaged

I think I need the hand holding of someone versed in Chinese in order to fix this, otherwise it's like shooting blindly. Are you the man for this Kevin? Please try and catch me, kamstrup, on IRC if so.

Changed in unity-foundations:
assignee: nobody → Mikkel Kamstrup Erlandsen (kamstrup)
importance: Undecided → High
milestone: none → unity-3.8.2
status: New → Triaged

I've been looking into CJK indexing in Xapian and the prospects are slightly dire... See http://trac.xapian.org/ticket/180

The upshot is that this will require some work. There are libs we can pull in for this (like http://code.google.com/p/cjk-tokenizer/), we'll then have to manually write some glue code to wire it up in the indexing- and query parsing subsystems for S-C and u-p-a.

Anyone with a simpler solution are more than welcome to chime in :-)

Looks like http://code.google.com/p/cjk-tokenizer is more or less the only option out there, but that it depends on libunicode which is not installed by default, let alone even in the Ubuntu archives :-/

We may be able to replace the libunicode bits with libicu bits (since libicu44 is installed by default as a dep of webkit)

Changed in xapian:
status: Unknown → Confirmed

Looking deeper into the linked Xapian bug it seems we may be able to shoplift some code from the Pinot engine that is based on cjk-tokenizer but ported to glib2 instead of libunicode. As described in the Xapian bug it does depend on the Dijon namespace though so as they've done for the Xapian patch based on the Pinot code we must remove the Dijon usage.

As olly describes in the Xapian bug this is slightly dangerous though and may have unpredictable consequences if we ever see Unicode version mismatches between glib2 and Xapian or if they differ in their error handling (which they almost certainly do).

All of this still leaves the question open for how to handle this in S-C with Python as it's crucial that S-C and u-p-a use the *exact* same method for tokenization. If there's a mismatch between the query parser in u-p-a and how the indexed terms are generated in the S-C index we'll see no-, weird-, or random results.

Olly Betts (ojwb) wrote :

Xapian should have all the Unicode support you need for this built in, so you shouldn't need to add a dependency on libunicode, icu, or glib.

Does SC use Xapian::QueryParser and u-p-a use Xapian::TermGenerator? If not, that could be fun...

Also, Xapian is taking part in GSoC this year, and "CJK support" is one of the potential projects. We've had promising interest in it, though it's too soon to know if that'll happen, and it wouldn't be done until August anyway. It might also be just Chinese support or just Japanese (or possibly students working on each separately). So a patch with a more generic approach may still be useful (probably would be for Korean at least).

I've got some code lying around which is a hacked version of cjk-tokenizer which uses xapian's unicode routines; it wasn't hard to make. I'll shove a copy of it up on github in a moment. It still requires linking into an indexing and query parser, though.

@Olly, @Richard: Thanks for chiming in! Afaik we use use Xapian::TermGenerator and Xapian:QueryParser everywhere. So maybe we should aim for Richard's solution for Natty and then hope we get CJK support out of the box for Natty+1.

The complicating factor here is that the Software Center index is created from a Python program (and also consumed by that program), but also consumed from a C program (unity-place-applications). So we'll need the CJK support available for Python as well. If it's built into Xapian this is a non-issue of course, but using Richard's cjk-tokenizer for Natty may be too complex for this late point in the cycle (considering we need to add Python bindings for it) - i'll talk to Michael Vogt about this.

Olly Betts (ojwb) wrote :

Hmm, it does indeed seem awfully late in the release process for some fairly major distro-specific patching of xapian-core. It's quite likely there will be a better solution before 11.10, and if not we can probably get the cjk-tokeniser approach in cleanly upstream by then.

My thought would be to package the cjk-tokeniser code in its own little C++ library (which can link to libxapian for the Unicode stuff since that's a public API), and then knock up a simple Python wrapper around it (with SWIG or similar or even by hand). Then you can use this for CJK locales, and Xapian's code for others, which means that any breakage won't affect other users of Xapian, and can only break for S-C in CJK locales, where the search doesn't really work currently anyway.

Didier Roche (didrocks) on 2011-04-01
Changed in unity:
milestone: 3.8.2 → 3.8.4
Changed in unity-2d:
status: New → Confirmed
David Barth (dbarth) on 2011-04-04
Changed in unity-foundations:
milestone: unity-3.8.2 → unity-3.8.4
Didier Roche (didrocks) on 2011-04-05
Changed in unity (Ubuntu):
status: New → Triaged
Changed in unity-2d:
milestone: 3.8.2 → 3.10
Kevin Huang (wasikevin) on 2011-04-06
summary: - [dash] wrong search result in Chinese
+ [dash] wrong search result of Unity in Chinese
Didier Roche (didrocks) on 2011-04-07
Changed in unity:
milestone: 3.8.4 → 3.8.6
Didier Roche (didrocks) on 2011-04-11
Changed in unity:
milestone: 3.8.6 → 3.8.8

Attached a branch with my WIP to add support for CJK handling in Xapian. Development details will be in the Xapian bug tracker. Once there is something to ship/test in Ubuntu I'll put a note on this bug.

David Barth (dbarth) wrote :

For reference, here is the summary of IRC discussions on this topic, including support from Platform with regards to a release plan for the fix.

We quickly considered the 2 alternatives today. The alternatives being:
  1. workaround in apps, no change in the library
vs
 2. patch in the library and regression testing in the apps

The library approach (2.) was chosen, as it's easier to implement and is
closer to a long term solution.

The plan of action considered now looks as follows:

1. kamstrup to start integrating the tokenizer + library patch into libxapian
2. seb128 to integrate the lib and impacted packages into a PPA for testing
3. mvo to check potential regressions with the SC test suite (western languages mostly)
4. At that point, we'd like the test teams in OEM, and CJK users in particular, to control the resulting packages.

Didier Roche (didrocks) on 2011-04-14
Changed in unity:
milestone: 3.8.8 → 3.8.10
tags: added: i18n
Didier Roche (didrocks) on 2011-04-19
Changed in unity:
milestone: 3.8.10 → 3.8.12
Olly Betts (ojwb) wrote :

"1. kamstrup to start integrating the tokenizer + library patch into libxapian"

With my Xapian upstream hat on, that really doesn't seem a good plan to me. It would mean that databases built by anything using Ubuntu's packages of Xapian risks being incompatible with those built on other platforms or with other builds of Xapian. We take a lot of care to avoid introducing any such incompatibilities within a release series, and the feedback I've had suggests users appreciate that.

David Barth (dbarth) on 2011-04-22
Changed in unity:
milestone: 3.8.12 → 3.8.14
Changed in unity-foundations:
milestone: unity-3.8.4 → none

@Olly: Agreed - that's not a situation we want to get into. That's also why we'll kick this off from a PPA so one has to specifically opt in to this and we can put a big fat Caveat Emptor sticker on it.

Then if we can guarantee database- and result set compatibility with vanilla libxapian for non-CJK corpuses we can *consider* it for update in main. And if you object *if* we get to that point I am pretty sure the platform team will listen to you - I don't expect that they enjoy maintaining a broken platform :-)

Olly Betts (ojwb) wrote :

@Mikkel: thanks, that's reassuring.

There's been a relevant development too - Xapian has a student working on adding support for a Chinese segmentation algorithm as part of Google's Summer of Code this year. Assuming that project goes well and we can get it merged in, this ticket should be addressed for Chinese.

That still leaves Japanese and Korean, which aren't explicitly mentioned in this ticket so far that I saw, but suffer from the same issues.

Changed in software-center (Ubuntu):
status: New → Triaged
importance: Undecided → High
David Barth (dbarth) on 2011-05-31
Changed in unity:
milestone: 3.8.14 → 3.8.16
Didier Roche (didrocks) on 2011-05-31
Changed in unity-2d (Ubuntu):
status: New → Confirmed
Changed in software-center (Ubuntu):
assignee: nobody → Gary Lasker (gary-lasker)
Changed in unity-2d:
milestone: 3.10 → none
David Barth (dbarth) on 2011-06-15
Changed in unity:
milestone: 3.8.16 → alpha2
David Barth (dbarth) on 2011-06-30
Changed in unity:
assignee: Mikkel Kamstrup Erlandsen (kamstrup) → David Barth (dbarth)
Didier Roche (didrocks) on 2011-07-05
Changed in unity:
milestone: 4.2.0 → 4.4.0
Changed in software-center (Ubuntu):
milestone: none → ubuntu-11.10-beta-1

Here is a patch for Xapian, it edits the Term Generator and Query Parser used in unity-places-applications. So everything is handled under the hood of Xapian. This was taken over from Mikkels' branch and where he left off.

This is the branch with the patch applied:
https://code.launchpad.net/~brandontschaefer/xapian/cjk-support-patch

Which allows for the searching CJK text in the Dash.

tags: added: patch
Didier Roche (didrocks) on 2011-07-21
Changed in unity:
milestone: 4.4.0 → 4.6.0
David Barth (dbarth) on 2011-07-22
Changed in unity:
assignee: David Barth (dbarth) → Brandon Schaefer (brandontschaefer)
status: Triaged → Fix Committed

New patch, hopefully will be merged with Xapian.

Hopefully this will be merged with Xapian.

Kent Lin (kent-jclin) wrote :

David & Brandon,

Since there is a fix for this bug, is there any way for our colleagues in Asia to help test it and moving this bug forward?

Thank you.
Kent

Didier Roche (didrocks) on 2011-07-29
Changed in unity (Ubuntu):
status: Triaged → Fix Committed
Didier Roche (didrocks) on 2011-08-01
Changed in unity:
milestone: 4.6.0 → 4.8.0
Olly Betts (ojwb) wrote :

Kent Lin: It would be very useful for people who can read these languages to try out Brandon's latest patch, and report if it works well, or if there are any issues. Some more test cases would be good too - I've not had a chance to check the test coverage for yet, but it would be good to have most of the new code covered.

Not sure if the latest patch here is that same as the one in xapian's trac or not, but the latter is what I'll be looking at, and at least for me it's simpler if discussion about the patch itself happens there rather than being split:

http://trac.xapian.org/ticket/180

With xapian 1.2.5-2 on Ubuntu 11.10, I see using the same Chinese character for search in the Unity dash returns relevant applications including Empathy. By relevant I mean the Chinese character is in the application names on Unity dash.

See attached screenshot, the translated Chinese name of the first and second applications has the Chinese character, but not the third (Sudoku).

Ubuntu 11.10 Alpha i386 (20110803.1)
Unity-2d: 3.8.14.1-0ubuntu1
xapian-tools: 1.2.5-2~ppa1
LANGUAGE=zh_CN:en_US:en
language-pack-zh-hans: 1:11.10+20110630.1
language-pack-gnome-zh-hans:1:11.10+20110630

Another example that it returns relevant but also irrelevant applications.

Here is a bug that user can not find the correct application if he type full Chinese name of the application in unity search entry,
the attachment is a screentshot of this bug.

Test Case
=======

User want to find a application named as "terminal" (終端機 in Chinese), this Chinese word "終端機" has 3 characters , first character is "終", second is "端", the third is "機".

Step To Reproduce:

1. type "終端" unity2d search entry

Excepted Result:

2. the application "終端機" shows in result (User can find what application he wants)

Actually Result:

No result

Other Related Info:

* type "terminal" in unity2d search entry can find the application "終端機" (User can find what application he wants)
* type "終" unity2d search entry can find the application "終端機" (User can find what application he wants)
* type "端機" unity2d search entry can find the application "終端機", and "端機" is not actually a Chinese word, it does not mean anything here. (Users usually do not try "端機" keyword for finding "終端機" application)

Env
====

Ubuntu 11.10 Alpha i386 (201108010.1)
Unity-2d: 3.8.14.1-0ubuntu1
libxapian22: 1.2.5-2~ppa1
xapian-tools: 1.2.5-2~ppa1
LANGUAGE=zh_TW:en_US:en

Steve Magoun (smagoun) on 2011-08-15
Changed in oem-priority:
status: New → In Progress
Gabor Kelemen (kelemeng) on 2011-08-18
Changed in ubuntu-translations:
status: New → Triaged
importance: Undecided → High
David Barth (dbarth) wrote :

The Xapian part of the bug is now fixed upstream and released in Ubuntu Oneiric, with a distro-patch. See also https://bugs.launchpad.net/ubuntu/+source/xapian-core/+bug/833172

David Barth (dbarth) wrote :

The "Applications" lens has been fixed to trigger support for the new Xapian CJK tokenizer

Changed in unity-lens-applications:
importance: Undecided → High
status: New → Fix Committed
David Barth (dbarth) wrote :

The FTS extension for Zeitgeist that serves for indexing keywords for Files & Folders has been fixed to support the new CJK tokenizer in Xapian.

Changed in zeitgeist-extensions:
status: New → Fix Committed
David Barth (dbarth) wrote :

Unity-2D is not the source of the issue.

Changed in unity-2d (Ubuntu):
status: Confirmed → Invalid
Changed in unity-2d:
status: Confirmed → Invalid
David Barth (dbarth) wrote :

All the foundations aspect of the bug have now been taken care of. See previous comments.

Changed in unity-foundations:
status: Triaged → Fix Committed

I will merge the fix tomorrow :)

On Thu, Aug 25, 2011 at 5:36 PM, David Barth <email address hidden>wrote:

> All the foundations aspect of the bug have now been taken care of. See
> previous comments.
>
> ** Changed in: unity-foundations
> Status: Triaged => Fix Committed
>
> --
> You received this bug notification because you are subscribed to The
> Zeitgeist Project.
> https://bugs.launchpad.net/bugs/745243
>
> Title:
> [dash] wrong search result of Unity in Chinese
>
> Status in OEM Priority Project:
> In Progress
> Status in Ubuntu Translations:
> Triaged
> Status in Unity:
> Fix Committed
> Status in Unity 2D:
> Invalid
> Status in Unity Foundations:
> Fix Committed
> Status in Unity Applications Lens:
> Fix Committed
> Status in Xapian Search Engine Library:
> Confirmed
> Status in Zeitgeist Extensions:
> Fix Committed
> Status in “software-center” package in Ubuntu:
> Triaged
> Status in “unity” package in Ubuntu:
> Fix Committed
> Status in “unity-2d” package in Ubuntu:
> Invalid
>
> Bug description:
> The search result in Chinese is not correct. Please see the attached
> example.
>
> Unity-2D version: 3.8.1
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/oem-priority/+bug/745243/+subscriptions
>

Changed in software-center (Ubuntu):
milestone: ubuntu-11.10-beta-1 → ubuntu-11.10-beta-2
David Barth (dbarth) on 2011-08-31
Changed in unity-foundations:
milestone: none → oneiric-beta-2
Changed in software-center (Ubuntu):
status: Triaged → In Progress
Changed in software-center (Ubuntu):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package software-center - 4.1.21

---------------
software-center (4.1.21) oneiric; urgency=low

  [ Kiwinote ]
  * AUTHORS:
    - add credits for the new icon (LP: #834882)
  * a stash of unicode fixes to make s-c-gtk3 usable around the world
    (LP: #831865, LP: #834409, LP: #834312)
  * softwarecenter/db/update.py:
    - fix reinstall previous purchases (LP: #834984)
  * softwarecenter/ui/gtk3/panes/availablepane.py:
    - set title for 'previous purchases' list view (LP: #833960)
  * softwarecenter/ui/gtk3/panes/softwarepane.py:
    - fix None.copy() such that switching panes works again (LP: #834196)
  * softwarecenter/ui/gtk3/widgets/buttons.py:
    - escape application name in tiles (LP: #835876)

  [ Jacob Johan Edwards ]
  * softwarecenter/ui/gtk3/panes/softwarepane.py:
    - fix the spinner display when loading slow views (LP: #830682)

  [ Gabor Kelemen ]
  * po/POTFILES.in,
    po/POTFILES.skip:
    - update per latest configuration, add new gtk3 files

  [ Matthew McGowan ]
  * softwarecenter/ui/gtk3/widgets/buttons.py :
    - resize fix for Top Rated and What's New tiles (LP: #833697)
  * softwarecenter/ui/gtk3/views/catview_gtk.py,
    softwarecenter/ui/gtk3/widgets/containers.py:
    - disable the rendering of the checkboard pattern in the
      grid views (at request of mpt)
   * lp:~mmcg069/software-center/description-tweaks:
     - fix badly rendered package descriptions, other tweaks
       (LP: #833954)
   * lp:~mmcg069/software-center/globalpane-themeability:
     - various theming fixes (LP: #828092, LP: #830681,
       LP: #830738 and LP: #838382)

  [ Gary Lasker ]
  * software-center,
    software-center-gtk3,
    softwarecenter/db/update.py:
    - enable CJK support in Xapian (LP: #745243)
  * po/software-center.pot:
    - refresh .pot file
  * softwarecenter/ui/gtk/widgets/thumbnail.py:
    - fix missing icon in theme to let non-gtk3 version
      launch again, also fixes all gtk unit tests
  * test/test_database.py:
    - update unit test

  [ Didier Roche ]
  * softwarecenter/ui/gtk3/panes/installedpane.py,
    softwarecenter/ui/gtk3/views/appview.py,
    softwarecenter/ui/gtk3/widgets/menubutton.py,
    softwarecenter/ui/gtk3/widgets/oneconfviews.py,
    softwarecenter/db/appfilter.py,
    softwarecenter/ui/gtk3/app.py,
    data/ui/gtk3/SoftwareCenter.ui:
    - brings back OneConf to software center gtk3 with a fresh new design
      (LP: #838623)
  * debian/control:
    - depends on latest oneconf
 -- Gary Lasker <email address hidden> Thu, 01 Sep 2011 11:55:14 -0400

Changed in software-center (Ubuntu):
status: Fix Committed → Fix Released
Didier Roche (didrocks) on 2011-09-02
Changed in unity:
status: Fix Committed → Fix Released
Changed in unity (Ubuntu):
status: Fix Committed → Fix Released
Changed in unity-lens-applications:
status: Fix Committed → Fix Released
Changed in zeitgeist-extensions:
status: Fix Committed → Fix Released
Gabor Kelemen (kelemeng) on 2011-09-02
Changed in ubuntu-translations:
status: Triaged → Fix Released
Ray Wang (raywang) wrote :

Hi,
I'm sorry but no matter what Chinese character I input from Unity 2D dash, it returns nothing.
System is updated.

apt-xapian-index 0.44ubuntu2
libxapian-dev 1.2.5-1ubuntu1
libxapian22 1.2.5-1ubuntu1
python-xapian 1.2.5-2ubuntu1
python2.6-xapian
python2.7-xapian
xapian-doc
xapian-examples 1.2.5-1ubuntu1
xapian-tools 1.2.5-1ubuntu1

I saw the same issue as Ray said in #37.
libxapian22 1.2.5-1ubuntu1.
It is a fresh install of the 0906 Oneiric build.

Software Center 4.1.21

Application name and description are translated after 'update-software-center'.

Search by Chinese character doesn't show all expected applications.

See attached screen shot.
The first window shows 3 applications that have '文' in either the name or the description or both.
The second window shows search by '文' returns only 2 of those 3 applications.

This problem was fixed and is in the current daily build of unity-2d. It was a regression in unity-2d not libxapian.

See the 2 screenshots:

Danny Hsu (dannyhsu) wrote :

Software Center Version: 4.1.21

Different search result on the same program

Step:
1) Input '播' to search program, it can search 'VLC媒體播放器'.
2) Input '播放' to search program, 'VLC媒體播放器' won't in the search result.
3) Input '播放器' to search program, 'VLC媒體播放器' won't in the search result.

Steve Magoun (smagoun) wrote :

Marking the oem-priority task fix released because the main problem was addressed in 11.10. Some additional improvements are desirable; they will be addressed in separate bugs.

Changed in oem-priority:
status: In Progress → Fix Released

El 25/10/11 17:35, Steve Magoun escribió:
> Marking the oem-priority task fix released because the main problem was
> addressed in 11.10. Some additional improvements are desirable; they
> will be addressed in separate bugs.
>
> ** Changed in: oem-priority
> Status: In Progress => Fix Released
>
Thanks you for your answer. My problems whit ubuntu 11.10 are already
resolved.
My old hard disk was in no good conditions.
I have one new and the installation is ok.
Thanks newly for you and for all the people that make posible the
project of free sofware.
Attendly, Víctor.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related questions

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.