Ubuntu
coreutils package

'sort' does not correctly sort non-latin utf-8 encoded text

Bug #71386 reported by Luzius Thöny on 2006-11-11

Affects		Status	Importance	Assigned to	Milestone
	coreutils (Ubuntu)	Invalid	Undecided	Unassigned

Bug Description

Trying to sort the characters of an utf-8 encoded file 'ipa_phrase.txt', containing this:
--------
wen wiː lʊk tuː ðə ɪndɪvɪdʒuːəlz ɒv ðə seɪm veraɪetiː ɔːr sʌb veraɪetiː ɒv aʊr ɒʊldɜː kʌltɪveɪtɪd plɑːnts ænd ænɪməlz, wʌn ɒv ðə fɜːst pɒɪnts wɪtʃ straɪks ʌs, ɪz, ðæt ðeɪ dʒenɜːəliː dɪfɜː mʌtʃ mɔːr frɒm iːtʃ ʌðɜː, ðæn duː ðə ɪndɪvɪdʒuːəlz ɒv eniː wʌn spesiːs ɔːr veraɪetiː ɪn ə stət ɒv nætʃɜː.
---------
...with a command like this:
------
sed "s/$.$/\1\n/g" < ipa_phrase.txt | sort
-----
does not give the correct order for the non-latin characters.

i'm aware that there's a relevant FAQ entry in upstream documentation (http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021) which seems to have something to do with it, suggesting that the issue is with the 'locale' settings.
In fact, doing 'export LC_ALL=POSIX' before sorting does change the result, however that way the output will be garbled.

Revision history for this message

Luzius Thöny (lucius-antonius) wrote on 2006-11-11:

utf-8 non-latin example file Edit (395 bytes, text/plain)

attaching the mentioned file containing non-latin characters.

Revision history for this message

Micah Cowan (micahcowan) wrote on 2007-05-16:

Sorry, I'm not at all familiar with how the text you've given is intended to sort; could you please be more specific about what the expected results are, and what you actually obtained?

Also, did you perhaps mean one of:
sed "s/$.*$/\1\n/g" < ipa_phrase.txt | sort
or
sed "s/$.$/\1\n/g" < ipa_phrase.txt | uniq | sort
?

Changed in coreutils:
assignee:	nobody → micahcowan
status:	Unconfirmed → Needs Info

Revision history for this message

Luzius Thöny (lucius-antonius) wrote on 2007-05-16:

result of the 'sorting' sed command Edit (688 bytes, text/plain)

ok, let me explain my purpose: in a nutshell, i want to make some statistics as to the frequencies of the indiviudal symbols in a specific text. for example, i want to know how much more frequent an 's' is compared to a 't'. the way to achieve this is to split the text up so that every letter/symbol occurs on an individual line, then sort it, and finally count the lines with the same symbol using 'uniq -c'. my sed script is intented to do just this (except the 'uniq -c' part), and i believe it is correct the way i wrote it.

the result i'm currently getting from the script run on the above text is attached, and it just looks very wrong to me. you may see that the normal letters (like 'n', 'r', or 's') are correctly sorted onto adjacent lines in the result, but not the IPA-symbols like 'ʃ' or 'ʌ', which occur in different places of the resultfile.

Micah Cowan (micahcowan) on 2007-07-16

Changed in coreutils:
assignee:	micahcowan → nobody
status:	Incomplete → New

Revision history for this message

Luzius Thöny (lucius-antonius) wrote on 2007-08-23:

this is getting too complicated, let's try a simpler example.

start with this text:
-----
aaa
aab
ʌʌʌ
aba
ɒbb
ɒcc
ʌbb
-----

and run it through 'sort'. the result (on my machine) is:
-----
ʌʌʌ
aaa
aab
aba
ɒbb
ʌbb
ɒcc
-----

that's nor properly sorted at all! what i want is this:

-----
aaa
aab
aba
ɒbb
ɒcc
ʌbb
ʌʌʌ
-----

(i may be wrong wrt the order of 'ʌ' and 'ɒ', but since the former is hex 'CA 8C' and the latter is 'C9 92', i'm guessing it should be this way.)

Revision history for this message

Luzius Thöny (lucius-antonius) wrote on 2007-09-08:

here is a tiny python script that will do exactly what i expect from 'sort':

-----
#!/usr/bin/env python
# coding: utf-8

list = [u'aaa', u'aab', u'ʌʌʌ', u'aba', u'ɒbb', u'ɒcc', u'ʌbb']

list.sort()

for s in list:
print s
------

output on the console:

------
aaa
aab
aba
ɒbb
ɒcc
ʌbb
ʌʌʌ
------

Revision history for this message

Steve Langasek (vorlon) wrote on 2008-06-27:

Thank you for taking the time to report this issue and help to improve Ubuntu.

The sort order you're seeing is in fact correct according to the locale that you're using. Sort, or collation, order is defined on a per-locale basis, because languages don't all have the same alphabetization rules, and for most locales the practice is to ignore "unknown" characters when sorting. This behavior, while debatable, is not something that is ever likely to change, because doing so will break existing software that expects the current behavior from these locales.

You are correct both that setting LC_ALL=POSIX will fix the sorting problem, and that it will break display of the output. The solution to this is to instead set LC_COLLATE=C (or LC_COLLATE=POSIX, if you prefer), which will let you change the sorting order independently of the character set, output language, and other features of the locale.

Changed in coreutils:
status:	New → Invalid

Revision history for this message

Yannis Tsop (ogiannhs) wrote on 2009-03-30:

input:
@3
a3
ae
@e

yannis@earth:~/NetBeansProjects/PrErg1$ LC_ALL=en_us.utf8 sort
@3
@e
a3
ae
yannis@earth:~/NetBeansProjects/PrErg1$ sort
@3
a3
ae
@e

this cannot be correct since all (@) values should be together, or as I get it @ is just ignored.

Revision history for this message

era (era) wrote on 2009-03-30:

@Yannis T: Your example seems distinct from what is discussed here. Your input does not contain any utf-8 characters.

Please submit a new bug report (if, after reading the FAQ which is linked in earlier comments to this bug report, you are confident that you have a genuine bug).

Revision history for this message

Micah Cowan (micahcowan) wrote on 2009-03-30:

I'm fairly confident it isn't; in some locales, the @ will definitely be ignored (relative to surrounding alphabetic chars?). The output of "locale" will hopefully clarify what is being used (the LC_COLLATE value should be the important one).

Revision history for this message

era (era) wrote on 2009-03-30:

#10

Me too; but the issue has been rehashed many times, so I figured the FAQ would be useful to point to. But it turns out it's kind of uninformative, actually. My bad.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntucoreutils package

'sort' does not correctly sort non-latin utf-8 encoded text

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
coreutils package