Please sort programs in a even more locale-friendly way (by using python-pyicu)

Bug #427568 reported by Michael Terry
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Translations
Triaged
Low
Unassigned
software-center (Ubuntu)
Triaged
Low
Unassigned

Bug Description

Binary package hint: software-store

The listing of programs right now is sorted by name. But it seems to sort by character code (i.e. strcmp). You can see this falling down when there are accents in the program name.

I propose that it should instead sort using the unicode algorithm for sorting in a locale-sensitive way. Different languages have very different sorting rules.

I recently did this with the Ubuntu installer ubiquity, and found the python wrapper for libicu to be useful (python-pyicu). It can be used to generate collation keys for a given key in a given locale.

Revision history for this message
Michael Terry (mterry) wrote :

For testing purposes without switching your locale, the program name I noticed this on was Déjà Dup. It should sort between De and Df in English, but instead shows up between Dz and Ea.

Michael Vogt (mvo)
Changed in software-store (Ubuntu):
importance: Undecided → Low
status: New → Confirmed
Revision history for this message
Michael Vogt (mvo) wrote :

Thanks for your bugreport.

I used the stock python sort() implementation. I switched to strcoll now

Andrew (and471)
Changed in software-store (Ubuntu):
status: Confirmed → Fix Committed
Andrew (and471)
Changed in software-store (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Michael Terry (mterry) wrote :

I don't think using strcoll fixed anything. Again, look at the Accessories department and note the D's. You'll see Déjà Dup at the end, when it should be earlier (in English locale -- and I want to say all locales?).

(While looking to see which version I have to re-confirm this bug, I noted that Help->About gives a different version than apt does. About says 0.2.2, apt says 0.3.2.)

strcoll isn't a full Unicode collation algorithm (UCA) [1], AFAIK. I'm not really sure what possibilities there are for LC_COLLATE or how it works. But I do know that Python doesn't have a native implementation of UCA (surprisingly).

To fix this, you'll want to use a library that does implement it, like python-pyicu.

[1] http://unicode.org/reports/tr10/

Changed in software-store (Ubuntu):
status: Fix Released → Confirmed
Revision history for this message
Matthew Paul Thomas (mpt) wrote :

Spec updated: '“Alphabetically” means case-insensitive and accent-insensitive. For example, “Déjà Dup Backup Utility” should be sorted between “Debian Documentation Browser” and “DeskScribe”.'
<https://wiki.ubuntu.com/SoftwareStore?action=diff&rev2=184&rev1=183>

(Michael Terry, the problem with the About window was bug 428677.)

Revision history for this message
Michael Terry (mterry) wrote :

mpt, that's a weak definition of 'alphabetically' for most non-Latin languages. When the package name translation contains non-Latin characters, what's the sorting plan? Even among romance languages, sorting is trickier than just stripping accents. From that Unicode report linked above:

"In French and a few other languages, however, it is the last accent difference that determines the order, as in row 2.
Normal Accent Ordering cote < coté < côte < côté
French Accent Ordering cote < côte < coté < côté"

Admittedly, that specific example is contrived, but you get my drift.

Again, my favored solution is this:
import PyICU, os
locale = os.environ['LANG'] # TODO: need to strip @ and .UTF-8
collator = PyICU.Collator.createInstance(PyICU.Locale(locale))
list.sort(key=lambda x: collator.getCollationKey(x).getByteArray())

python-pyicu is in main, on the CD.

Revision history for this message
Michael Vogt (mvo) wrote :

Hey Michael, thanks for your update.

I am no expert for unicode sorting myself, but I know that python uses wscoll internally for locale.strcoll(). So wide chars should be supported and strcoll should implement sorting based on the rules of the selected locale. For me, I see e.g. Déjà Dup sorted between Debian Documentation and DeskScribe (my locale is en_US) with the current version of software-store.

Attached is a small test program that tests the example for french you gave above. It seems to be sorted correctly with locale.strcoll() (unless I miss something). I'm fine with using python-pyicu of course, I just want to understand the issue first.

Revision history for this message
Michael Vogt (mvo) wrote :

Hey Michael, thanks for your update.

I am no expert for unicode sorting myself, but I know that python uses wscoll internally for locale.strcoll(). So wide chars should be supported and strcoll should implement sorting based on the rules of the selected locale. For me, I see e.g. Déjà Dup sorted between Debian Documentation and DeskScribe (my locale is en_US) with the current version of software-store.

Attached is a small test program that tests the example for french you gave above. It seems to be sorted correctly with locale.strcoll() (unless I miss something). I'm fine with using python-pyicu of course, I just want to understand the issue first.

For me the test app prints:
unsorted: coté côte côté cote
normal: cote coté côte côté
strcoll: cote côte coté côté

That seems to match your example above.

Revision history for this message
Michael Vogt (mvo) wrote :

The function is wcscoll - I just checked the glibc source and it seems that the algorithm implemented there is iso14651. UCA and tr10 go beyond as explained in http://unicode.org/faq/collation.html#13 - So it seems to be worthwhile to use icu. I set this to medium priority.

Changed in software-store (Ubuntu):
importance: Low → Medium
summary: - Please sort programs in a locale-friendly way
+ Please sort programs in a even more locale-friendly way (by using
+ python-pyicu)
Revision history for this message
Michael Terry (mterry) wrote :

Ah, partly my mistake then. Deja Dup now appears in the right place for me.

I had tried a software-store 0.3.2 (which had your fix I thought) and I saw Deja Dup still at the end of the D's. I must have been mistaken about either whether your fix was in or where Deja Dup was.

But yeah, I guess UCA is slightly better and that would be a neat enhancement.

Revision history for this message
Matthew Paul Thomas (mpt) wrote :

Michael Terry, can you suggest human-readable language with which I should specify the sort order in the specification? Would "sorted following the rules of Unicode Technical Standard #10" be accurate and precise enough?

Revision history for this message
Michael Terry (mterry) wrote :

Calling it the "Unicode Collation Algorithm" (UCA) I believe is sufficient. UTS #10 sounds stuffy. :)

Revision history for this message
Matthew Paul Thomas (mpt) wrote :
Revision history for this message
Matthew Paul Thomas (mpt) wrote :

"Déjà Dup" now correctly appears between "Debian Reference" and "DeskScribe" in the "Accessories" department. We should fix this properly eventually, but I'm going to mark it as Low importance unless/until there are some specific real-world examples of how the incorrect sorting makes programs hard to find. If you have some, please update the description. Thanks.

Changed in software-center (Ubuntu):
importance: Medium → Low
Changed in software-center (Ubuntu):
status: Confirmed → Triaged
David Planella (dpm)
Changed in ubuntu-translations:
status: New → Triaged
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.