Incorrect alphabetical sort order in thunar with non-latin (eg. cyrillic) file names

Bug #684317 reported by li_yun on 2010-12-02
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
thunar
Fix Released
Low
thunar (Ubuntu)
Wishlist
Unassigned

Bug Description

Binary package hint: thunar

Incorrect alphabetical sort order in thunar with cyrillic file names.
Sort order of files with latin names is correct.
Thunar Settings: Arrange Items -> Sort By Name.
In ls, mc, Nautilus, PCManFM, gnome-commander, emelfm2, Gtk Open File Dialog, etc sort order of such kind of files is correct.

Ubuntu release: Ubuntu natty (development branch) 11.04
Package: thunar 1.0.2-1ubuntu1

The same problem was found at Ubuntu ... 10.04, 10.10 ...

Related branches

li_yun (vtorchermet2) on 2010-12-05
description: updated
description: updated
Charlie Kravetz (charlie-tca) wrote :

Thank you for reporting this issue. I am marking is confirmed, but I think it is related to Bug 376156, if not a duplicate of that bug. Thunar sorts using "sort -n", which does not always sort correctly for anything outside English, non-numeric sorting.

Changed in thunar (Ubuntu):
importance: Undecided → Wishlist
status: New → Confirmed

Created attachment 3358
the screenshot

Sometimes when I open a folder in Thunar (item arrangement set to "By Name") I see it put some folders into the "second series" of arrangement.

For example. it goes 0-9, then A-Z, then А-Я (Cyrillic), and then... again 0-9, then A-Z, then А-Я! The folders (haven't ever seen it happen with files, only with folders) from the first and the second series aren't the same, but I have no idea about what defines which one does a folder belong to.

So, having both non-Cyrillic and Cyrillic folders in one folder may lead to having this bug. Workaround: change item arrangement to any other (e. g. "By Modification Date"), then change it back to "By Name".

Screenshot included: http://www4.picturepush.com/photo/a/4880597/img/4880597.png

Changed in thunar (Ubuntu):
status: Confirmed → Triaged

I can confirm that this happens on Ubuntu 11.10 as well

Created attachment 4311
Screenshot in Japanese

This issue seems to always occur and jufofu's workaround doesn't work on thunar-git.
I get this issue in Japanese.
Attached is screenshot of thunar and bash.

Created attachment 4312
test sample

Attached is test sample files of Comment #1.
Each numbers following Japanese characters in file name are unicode codepoint of the ja characters.

The problem seems to be caused by (the use of) glib.

g_utf8_get_char () returns '0' on the first character.

To clarify: 'utf-8' ordering fails due to the problem described above.

Created attachment 4375
A fix.

Seems to work here.

I know nothing about Thunar internals so I can't guarantee that the patch is correct.

(thank you Hashimoto-san, greetings from Japan).

This patch would sort the items as followed:

(test) John Doe.txt
あおい輝彦.3042-304a-3044-8f1d-5f66.txt
Alan Smithee.txt
一ノ瀬泰造.4e00-30ce-702c-6cf0-9020.txt
一条忠頼.4e00-6761-5fe0-983c.txt
一青窈.4e00-9752-7a88.txt
堀口雅也.5800-53e3-96c5-4e5f.txt
堀孝史.5800-5b5d-53f2.txt
朱謙之.6731-8b19-4e4b.txt

Did we run into another bug with the sorting algorithm?

Created attachment 4377
More fixes.

I've found two more bugs:
- comparison function should not return 0 ("equal"), even if we're using case insensitive sorting
- arguments of strcoll were truncated to a single character - strcoll doesn't like it and returns a different result than with a longer string.

The sorting order now closely follows behavior of strcoll, so if there are
any problems with it, they are likely coming from strcoll.

Created attachment 4378
More fixes.

Added one more bugfix - a check for filename length of otherwise identical utf-8 filenames.

IMHO the code is ready to be used, I don't have anything else to add.

There are some remaining issues, which cannot be easily fixed:

1. 'A' < 'a' but 'ą' < 'Ą' - this is because former is coming from ascii code comparison, and the latter from strcoll.
Reported upstream: http://sourceware.org/bugzilla/show_bug.cgi?id=14039

  Possible solutions:
  - always use strcoll - gives a consistent ('a' < 'A' and 'ą' < 'Ą') ordering but is slower for ascii characters, especially in case insensitive mode.
  - just flip 'a-z' and 'A-Z' codes manually [1] (also gives 'a' < 'A' and 'ą' < 'Ą')
  - wait for http://sourceware.org/bugzilla/show_bug.cgi?id=14039 to be resolved (would give 'A' < 'a' and 'Ą' < 'ą' but that's very unlikely)

2. あ < a < あa < aa < あaa
Reported upstream: http://sourceware.org/bugzilla/show_bug.cgi?id=14038

  No solution (but hopefully this will be fixed upstream). If fixed, then the workaround in the patch (g_strconcat) will not be necessary, so we can then improve performance a bit by removing it.

Created attachment 4379
Swap ascii codes a-z and A-Z

This is a patch implementing the solution 1.2 from comment #9. It's likely much faster than solution 1.1.

It *changes* the sorting order of ascii characters to make it consistent with the order of non-ascii ones.

Got some feedback from glibc bugzilla

1. They recommend using strxfrm for converting the string so that it matches strcoll ordering during simple comparison.

   However, strxfrm itself is pretty heavy, if we wanted "proper" sorting we could simply switch to using strcoll on all strings. So, my suggestion is to use the patch swapping 'a-z' for 'A-Z' maybe not the prettiest but it does 90% of strxfrm at almost 0 cost.

2. Weird ordering of Japanese characters and our workaround - apparently there are no Japanese language definitions in iso14651_t1_common file, which means they are ignored in the first pass and handled in the second one.

   They said that the "workaround" is indeed a correct way of using strcoll as there might be other ignored characters.

   There was no indication whether Japanese definition will be added to the iso14651_t1_common file but the bug was not closed so I imagine that still on the table.

My conclusion:
Current patches are doing as much as we can without sacrificing performance in ascii case (otherwise we could switch to strcoll completely). Other errors are mostly caused by limitations of strcoll in glibc (possibly will be resolved later).

Created attachment 4380
sort using g_utf8_collate_key_for_filename()

After discussion on IRC we have decided to try the g_utf8_collate_key_for_filename() function. It doesn't support number sort (and there is no way to add it efficiently), but should do a better job at sorting, and can potentially be faster (sorting itself is done by a key comparison, cost of collation is unknown).

Created attachment 4381
plugged a memory leak in the previous patch

(In reply to comment #13)
> Created attachment 4381 [details]
> plugged a memory leak in the previous patch

Andrzej-san:

Sorry for late reply.
Your patch works fine!!
Thank you for your quick work in spite of the Golden Week :)

note: "ū" is in wrong place, it's between "j" and "k" but it should be in the end of alphabet(i looked to Maori, Hawaiian, Marshallese, Lithuanian, Livonian, Latvian and Cornish alphabets in all these alphabets that letter is in the end before "v" or "w").

(In reply to comment #15)
> note: "ū" is in wrong place, it's between "j" and "k" but it should be in
> the end of alphabet

Which patch are you using, and what's is your locale (LC_COLLATE)?

I've checked that with LC_COLLATE=POSIX "ū" is after "z"
I don't have Lithuanian locale installed so I can't check it here but different locales yield different results (e.g. with LC_COLLATE=pl_PL.UTF8 "ū" is between "u" and "v")

Note that with patch #13 sorting is done by glib (and ultimately by glibc), so if you see are any errors they come either from an error in your system configuration or from a bug in these libraries (glibc).

(In reply to comment #16)
> (In reply to comment #15)
> > note: "ū" is in wrong place, it's between "j" and "k" but it should be in
> > the end of alphabet
>
> Which patch are you using, and what's is your locale (LC_COLLATE)?
>
> I've checked that with LC_COLLATE=POSIX "ū" is after "z"
> I don't have Lithuanian locale installed so I can't check it here but
> different locales yield different results (e.g. with LC_COLLATE=pl_PL.UTF8
> "ū" is between "u" and "v")
>
> Note that with patch #13 sorting is done by glib (and ultimately by glibc),
> so if you see are any errors they come either from an error in your system
> configuration or from a bug in these libraries (glibc).

i'm using patch from comment #13.
LC_COLLATE=C
in my system, variable "LANG" has value which you mentioning, in my case lt_LT.UTF-8

Most likely there is no bug (if you use correct LC_COLLATE), or if there is, it is not in thunar.

Try this:
/close *all* thunar windows /
$ thunar -q
$ LC_COLLATE=lt_LT.UTF-8 thunar

If there are any problems tell me about it on #xfce (irc.freenode.net). Bugzilla is not a support forum.

The latest version of Thunar (1.4.0) incorrectly sorts contents of folders with cyrillic letters in files and folders names.

Here is the contents of one folder "sorted" by name (ascending).
Looking from top to bottom I see file names starting with...
* cyrillic upper letters
* digits
* cyrillic lower letters
* again digits
* again cyrillic lower letters
* english lower letters
* and again cyrillic lower letters

Concrete example of another folder "sorted" by name:
> голубь.txt
> иволга.txt
> аист.txt
> орёл.txt
> сова.txt

Absoluletly wrong order. The file with name that starts with 'A' is in the middle.

Cyrillic alpabet is:

Аа Бб Вв Гг Дд Ее Ёё Жж Зз Ии Йй Кк Лл Мм Нн Оо Пп Рр Сс Тт Уу Фф Хх Цц Чч Шш Щщ ЬЬ Ыы ЪЪ Ээ Юю Яя

Meanwhile, "ls -1" gives right order
> аист.txt
> голубь.txt
> иволга.txt
> орёл.txt
> сова.txt

PCManFM and other filemanagers give right sort order.

So, Thunar DOES NOT sort with "ls". Also, sort order in Thunar CAN NOT be changed using LC_COLLATE. It ignores this variable, but instead uses its own "mega-wise" algorithm.

What's the matter, guys?! Prior to version 1.4.0, everything was OK in Thunar.

In , guoxh (guoxh) wrote :

Same problem here, for Chinese filenames.

In , 8-nick (8-nick) wrote :

Can people help here a bit with some test files?

Name the files the following way: $(name).$(expected_position).txt, so for example "аист.1.txt", "голубь.2.txt"

Talking about Cyrillic/non-Cyrillic here, Chinese. All that don't fit into [a-Z]

In , 8-nick (8-nick) wrote :

And please mention the used LC_COLLATE.

My variants in Cyrillic:

Variant 1

> голубь.2.txt
> аист.1.txt

Variant 2

> вишня.4.txt
> груша.5.txt
> апельсин.2.txt
> банан.3.txt
> ананас.1.txt
> киви.6.txt
> лимон.7.txt
> яблоко.8.txt

Locale settings. All is English.

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Created attachment 4647
test case

Sorting order as in nautilus. ls uses slightly different sort order (no numeric sort, non-alphanumeric characters). andrzejr/utf8_collate behaves mostly like nautilus except for the '#' sign.

In , 8-nick (8-nick) wrote :

Created attachment 4648
Updated patch

Slightly updated patch that peeks the case-folded name, if equal use the case-hash. Saves some hashing and memory.

We could reduce the hashing to do this on the fly in the thunar_file_compare_by_name function, but that's too much imho.

The special case in nautilus for '.' and '#' might be useful. Duno if there are locales that put other characters in front of '.'/'#'

IMHO we should keep the hidden no-case option

Works well for me. Good idea with the optimization. Lazy hashing could make sense when other methods of sorting are used (e.g. by modification time) and only if they don't fall back to compare by name. IMHO benefit not worth the complexity.

I have no preference for special characters ("#", "."). I don't know why nautilus is treating them differently.

I also feel leaving case-sensitive option for POSIX locale users is OK. We should probably change the default to case insensitive sort, to avoid confusion.

In , 8-nick (8-nick) wrote :

The default is already case-insensitive in Thunar, so that doesn't need to change.

Nautilus sorts 'hidden' files after the other names, instead of showing them first. GTK+ doesn't and there are also bugs for that in the gnome bugtracker: https://bugzilla.gnome.org/show_bug.cgi?id=358812

The change obviously fixed sorting locales, but are there also situations where Thunar does a better job?

In , 8-nick (8-nick) wrote :

Pushed patch in 1fcb0e7 if there are sorting regressions please open a new bug.

In , 8-nick (8-nick) wrote :

*** Bug 9218 has been marked as a duplicate of this bug. ***

tags: added: fixed-in-master
In , 8-nick (8-nick) wrote :

*** Bug 3724 has been marked as a duplicate of this bug. ***

Changed in thunar (Ubuntu):
status: Triaged → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package thunar - 1.6.0-0ubuntu1

---------------
thunar (1.6.0-0ubuntu1) raring; urgency=low

  * Upload to raring.
  * Remaining Ubuntu change:
    - debian/control: recommend udisks2 for mounting devices. lp: #1014632
  * Drop obsolete Ubuntu changes:
    - debian/patches/02_guard-for-no-supported-vfs-schemas.patch,
      debian/patches/xubuntu_fix-duplicate-volumes.patch: included upstream.
  * Bugs fixed:
    - "Thunar: sendto_printer broken" lp: #1061846
    - "segfault when a specific html file is selected" lp: #751739
    - "can't book mark remote shares" lp: #778268
    - "Thunar crashed with SIGSEGV in thunarx_menu_provider_get_file_actions()
      thinking a directory was a file" lp: #852410
    - "Left or right-clicking on 3MB or bigger svg file is unresponsive"
      lp: #893330
    - "Thunar crashed with SIGSEGV in fast_validate()" lp: #913041
    - "Thunar crashed with SIGSEGV in thunar_file_get_display_name()"
      lp: #931101
    - "Thunar crashed with SIGSEGV in sort_by_mime_type()" lp: #931842
    - "Thunar crashed with SIGSEGV in thunar_util_parse_parent()" lp: #969222
    - "thunar crashed with SIGSEGV in thunar_standard_view_cancel_thumbnailing()"
      lp: #1059397
    - "Does not unmount USB drive when you try first time" lp: #1059997
    - "regression: thunar no longer shows all unmounted, but mountable, volumes
      in sidepane" lp: #1068947
    - "Thunar shows folder sizes wrong" lp: #59235
    - "Right-click "Open With" list not refreshing" lp: #107392
    - "no thunar contextmenu with GTK setting "gtk-menu-popup-delay = 0""
      lp: #127372
    - "rename folder, still active but answers not on 'Enter'" lp: #479975
    - "Thunar hangs on first launch of each session" lp: #775117
    - "emblems disappear on rename" lp: #877755
    - "Remote Deleted file in Thunar remains visible until resfresh" lp: #999824
    - "Incorrect alphabetical sort order in thunar with non-latin (eg. cyrillic)
      file names" lp: #684317
    - "Thunar does not display current folder name" lp: #875193
    - "Thunar crashed with SIGSEGV in g_file_equal()" lp: #900306
    - "Hard to see, if volume is mounted or not" lp: #838917

thunar (1.6.0-1) UNRELEASED; urgency=low

  [ Lionel Le Folgoc ]
  * Drop the "Send to printer" action, xfprint4 is obsolete.
  * debian/control:
    - dropped libtdb-dev from b-deps, emblems have been moved to gvfs.
    - bumped minimum required exo version to 0.10.0 for the new symbol.

  [ Yves-Alexis Perez ]
  * New upstream release.
 -- Lionel Le Folgoc <email address hidden> Mon, 03 Dec 2012 13:13:58 +0100

Changed in thunar (Ubuntu):
status: Fix Committed → Fix Released
Changed in thunar:
importance: Unknown → Low
status: Unknown → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.