'sort' does not correctly sort non-latin utf-8 encoded text

Bug #71386 reported by Luzius Thöny
8
Affects Status Importance Assigned to Milestone
coreutils (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Trying to sort the characters of an utf-8 encoded file 'ipa_phrase.txt', containing this:
--------
wen wiː lʊk tuː ðə ɪndɪvɪdʒuːəlz ɒv ðə seɪm veraɪetiː ɔːr sʌb veraɪetiː ɒv aʊr ɒʊldɜː kʌltɪveɪtɪd plɑːnts ænd ænɪməlz, wʌn ɒv ðə fɜːst pɒɪnts wɪtʃ straɪks ʌs, ɪz, ðæt ðeɪ dʒenɜːəliː dɪfɜː mʌtʃ mɔːr frɒm iːtʃ ʌðɜː, ðæn duː ðə ɪndɪvɪdʒuːəlz ɒv eniː wʌn spesiːs ɔːr veraɪetiː ɪn ə stət ɒv nætʃɜː.
---------
...with a command like this:
------
sed "s/\(.\)/\1\n/g" < ipa_phrase.txt | sort
-----
does not give the correct order for the non-latin characters.

i'm aware that there's a relevant FAQ entry in upstream documentation (http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021) which seems to have something to do with it, suggesting that the issue is with the 'locale' settings.
 In fact, doing 'export LC_ALL=POSIX' before sorting does change the result, however that way the output will be garbled.

Revision history for this message
Luzius Thöny (lucius-antonius) wrote :

attaching the mentioned file containing non-latin characters.

Revision history for this message
Micah Cowan (micahcowan) wrote :

Sorry, I'm not at all familiar with how the text you've given is intended to sort; could you please be more specific about what the expected results are, and what you actually obtained?

Also, did you perhaps mean one of:
  sed "s/\(.*\)/\1\n/g" < ipa_phrase.txt | sort
or
  sed "s/\(.\)/\1\n/g" < ipa_phrase.txt | uniq | sort
?

Changed in coreutils:
assignee: nobody → micahcowan
status: Unconfirmed → Needs Info
Revision history for this message
Luzius Thöny (lucius-antonius) wrote :

ok, let me explain my purpose: in a nutshell, i want to make some statistics as to the frequencies of the indiviudal symbols in a specific text. for example, i want to know how much more frequent an 's' is compared to a 't'. the way to achieve this is to split the text up so that every letter/symbol occurs on an individual line, then sort it, and finally count the lines with the same symbol using 'uniq -c'. my sed script is intented to do just this (except the 'uniq -c' part), and i believe it is correct the way i wrote it.

the result i'm currently getting from the script run on the above text is attached, and it just looks very wrong to me. you may see that the normal letters (like 'n', 'r', or 's') are correctly sorted onto adjacent lines in the result, but not the IPA-symbols like 'ʃ' or 'ʌ', which occur in different places of the resultfile.

Micah Cowan (micahcowan)
Changed in coreutils:
assignee: micahcowan → nobody
status: Incomplete → New
Revision history for this message
Luzius Thöny (lucius-antonius) wrote :

this is getting too complicated, let's try a simpler example.

start with this text:
-----
aaa
aab
ʌʌʌ
aba
ɒbb
ɒcc
ʌbb
-----

and run it through 'sort'. the result (on my machine) is:
-----
ʌʌʌ
aaa
aab
aba
ɒbb
ʌbb
ɒcc
-----

that's nor properly sorted at all! what i want is this:

-----
aaa
aab
aba
ɒbb
ɒcc
ʌbb
ʌʌʌ
-----

(i may be wrong wrt the order of 'ʌ' and 'ɒ', but since the former is hex 'CA 8C' and the latter is 'C9 92', i'm guessing it should be this way.)

Revision history for this message
Luzius Thöny (lucius-antonius) wrote :

here is a tiny python script that will do exactly what i expect from 'sort':

-----
#!/usr/bin/env python
# coding: utf-8

list = [u'aaa', u'aab', u'ʌʌʌ', u'aba', u'ɒbb', u'ɒcc', u'ʌbb']

list.sort()

for s in list:
 print s
------

output on the console:

------
aaa
aab
aba
ɒbb
ɒcc
ʌbb
ʌʌʌ
------

Revision history for this message
Steve Langasek (vorlon) wrote :

Thank you for taking the time to report this issue and help to improve Ubuntu.

The sort order you're seeing is in fact correct according to the locale that you're using. Sort, or collation, order is defined on a per-locale basis, because languages don't all have the same alphabetization rules, and for most locales the practice is to ignore "unknown" characters when sorting. This behavior, while debatable, is not something that is ever likely to change, because doing so will break existing software that expects the current behavior from these locales.

You are correct both that setting LC_ALL=POSIX will fix the sorting problem, and that it will break display of the output. The solution to this is to instead set LC_COLLATE=C (or LC_COLLATE=POSIX, if you prefer), which will let you change the sorting order independently of the character set, output language, and other features of the locale.

Changed in coreutils:
status: New → Invalid
Revision history for this message
Yannis Tsop (ogiannhs) wrote :

input:
@3
a3
ae
@e

yannis@earth:~/NetBeansProjects/PrErg1$ LC_ALL=en_us.utf8 sort
@3
@e
a3
ae
yannis@earth:~/NetBeansProjects/PrErg1$ sort
@3
a3
ae
@e

this cannot be correct since all (@) values should be together, or as I get it @ is just ignored.

Revision history for this message
era (era) wrote :

@Yannis T: Your example seems distinct from what is discussed here. Your input does not contain any utf-8 characters.

Please submit a new bug report (if, after reading the FAQ which is linked in earlier comments to this bug report, you are confident that you have a genuine bug).

Revision history for this message
Micah Cowan (micahcowan) wrote :

I'm fairly confident it isn't; in some locales, the @ will definitely be ignored (relative to surrounding alphabetic chars?). The output of "locale" will hopefully clarify what is being used (the LC_COLLATE value should be the important one).

Revision history for this message
era (era) wrote :

Me too; but the issue has been rehashed many times, so I figured the FAQ would be useful to point to. But it turns out it's kind of uninformative, actually. My bad.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.