'sort' does not correctly sort non-latin utf-8 encoded text
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
coreutils (Ubuntu) |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Trying to sort the characters of an utf-8 encoded file 'ipa_phrase.txt', containing this:
--------
wen wiː lʊk tuː ðə ɪndɪvɪdʒuːəlz ɒv ðə seɪm veraɪetiː ɔːr sʌb veraɪetiː ɒv aʊr ɒʊldɜː kʌltɪveɪtɪd plɑːnts ænd ænɪməlz, wʌn ɒv ðə fɜːst pɒɪnts wɪtʃ straɪks ʌs, ɪz, ðæt ðeɪ dʒenɜːəliː dɪfɜː mʌtʃ mɔːr frɒm iːtʃ ʌðɜː, ðæn duː ðə ɪndɪvɪdʒuːəlz ɒv eniː wʌn spesiːs ɔːr veraɪetiː ɪn ə stət ɒv nætʃɜː.
---------
...with a command like this:
------
sed "s/\(.\)/\1\n/g" < ipa_phrase.txt | sort
-----
does not give the correct order for the non-latin characters.
i'm aware that there's a relevant FAQ entry in upstream documentation (http://
In fact, doing 'export LC_ALL=POSIX' before sorting does change the result, however that way the output will be garbled.
Changed in coreutils: | |
assignee: | micahcowan → nobody |
status: | Incomplete → New |
attaching the mentioned file containing non-latin characters.