sort doesn't sort and uniq loses data for many non-Latin scripts on UTF-8 locales

Bug #1774857 reported by Miikka-Markus Alhonen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
coreutils (Ubuntu)
New
Undecided
Unassigned
glibc (Ubuntu)
New
Undecided
Unassigned

Bug Description

I’ve found out that sort doesn’t sort strings for many non-Latin scripts at all if the locale you’re using is one of en_US.UTF-8, fr_FR.UTF-8 or fi_FI.UTF-8 (probably others, too, but these are the ones I have tested). For locales ”C” and ko_KR.UTF-8, things work as expected. Here’s a test case:

Open xterm, launch sort and input some lines of Syriac, Ethiopic, Korean, Japanese (Hiragana or Katakana, not Han) or Thai text repeating one of the lines twice. Here’s an example in Syriac:

ܡܠܬܐ
ܒܝܬܐ
ܒܪܢܫܐ
ܡܠܬܐ

Sort produces the following:

ܡܠܬܐ
ܒܝܬܐ
ܡܠܬܐ
ܒܪܢܫܐ

Here strings are ordered only according to their length but not characters. Even the two instances of the word ܡܠܬܐ are found on non-adjacent lines (1 and 3). The expected sort order based on Unicode points would be:

ܒܝܬܐ
ܒܪܢܫܐ
ܡܠܬܐ
ܡܠܬܐ

If you further pass sort’s output to uniq, it produces the following:

ܡܠܬܐ
ܒܪܢܫܐ

Here the word on line 2 ܒܝܬܐ is completely lost since, like sort, uniq seems to consider all Syriac strings of equal length as the same.

Although this issue affects locale, I think it is not a locale issue per se, since perl seems to handle similar cases expectedly. For instance, the following command produces the expected result:

perl -CDS -e 'use locale; use utf8; @str = ("ܡܠܬܐ", "ܒܝܬܐ", "ܒܪܢܫܐ", "ܡܠܬܐ"); foreach $i (sort @str) { print "$i\n"; }'

Curiously enough, codepoints in Plane 1 seem to count as two codepoints of the basic plane, so that if you sort | uniq the following (six codepoints of Syriac and three codepoints of Phoenician):

ܥܠܝܟܘܢ
𐤁𐤉𐤕

you get ”ܥܠܝܟܘܢ" as the result whereas ”𐤁𐤉𐤕” is lost. This is of course due to the UTF-8 representation of Plane 1 characters as two surrogate characters on the basic plane.

Also curiously, LTR scripts seem to conflate with each other and RTL scripts among themselves but not across the directionality line, so that if you sort | uniq the following (three codepoints each in Ethiopic, Hangul, Syriac, Hiragana and Thai):

ዘመን
스물셋
ܐܢܐ
わたし
ฟ้า

you are left with:

ܐܢܐ
ዘመን

That’s one line of Syriac and one line of Ethiopic; everything else was lost. This issue does not seem to affect most Indic scripts (Devanagari, Bengali, Telugu etc.) or Arabic. For CJK, things work as expected for the main Unicode block (4E00..9FFF) but not for Extension A (3400..4DBF, such as 㗖 or 㡘 or 㰋). For Greek, monotonic accents work fine but all polytonic letters are conflated (αὐλὸς and αὐλῆς conflate to αὐλῆς). For Hebrew, letters and vowel marks work fine but cantillation marks are conflated.

Description: Ubuntu 18.04 LTS
Release: 18.04

coreutils:
  Installed: 8.28-1ubuntu1
  Candidate: 8.28-1ubuntu1
  Version table:
 *** 8.28-1ubuntu1 500
        500 http://mr.archive.ubuntu.com/ubuntu bionic/main amd64 Packages
        100 /var/lib/dpkg/status

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: coreutils 8.28-1ubuntu1
ProcVersionSignature: Ubuntu 4.15.0-22.24-generic 4.15.17
Uname: Linux 4.15.0-22-generic x86_64
ApportVersion: 2.20.9-0ubuntu7.1
Architecture: amd64
CurrentDesktop: ubuntu:GNOME
Date: Sun Jun 3 10:13:06 2018
InstallationDate: Installed on 2017-02-13 (474 days ago)
InstallationMedia: Ubuntu 16.10 "Yakkety Yak" - Release amd64 (20161012.2)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=fi_FI.UTF-8
 SHELL=/bin/bash
SourcePackage: coreutils
UpgradeStatus: Upgraded to bionic on 2018-05-31 (2 days ago)

Revision history for this message
Miikka-Markus Alhonen (malhonen) wrote :
Revision history for this message
Miikka-Markus Alhonen (malhonen) wrote :

Since nobody has reacted to this report for a couple of months, I decided to file an upstream report at https://debbugs.gnu.org/cgi/bugreport.cgi?bug=32472

Revision history for this message
Miikka-Markus Alhonen (malhonen) wrote :

One user on debbugs.gnu.org reported that the problem is more likely related to the locale / glibc than coreutils, and that it occurs on Ubuntu 18.04 but not Fedora 28, in case that helps any. He thought it might have already been fixed in glibc, since Fedora tends to be more up to date than Ubuntu.

Revision history for this message
Adam Conrad (adconrad) wrote :

Using the first test case, this does appear to be fixed in cosmic (glibc 2.28) and beyond, and only affect bionic (glibc 2.27), which certainly implies either an upstream or Debian fix slipped in between the two. I'm not sure I'll have the bandwidth to dig into it this SRU cycle, but I'll try to look again when I can and.

Revision history for this message
Seth Arnold (seth-arnold) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.