sort order incorrect

Bug #75705 reported by Kevin Scannell on 2006-12-14
4
Affects Status Importance Assigned to Milestone
glibc (Ubuntu)
Medium
Unassigned
Nominated for Jaunty by Yannis Tsop

Bug Description

Binary package hint: coreutils

sort is not behaving as expected under ga_IE.utf8

To reproduce: create a UTF-8 file "test" with "a" on one line, and "á" on another. Then run:

$ LC_ALL=ga_IE.utf8 sort test
a
á

This is correct - the accented character collates after the unaccented one.

But now change the two lines to "aá" and "áa":
$ LC_ALL=ga_IE.utf8 sort test
áa

Now the accented character collates first. The second command gives the correct order ("aá" followed by "áa")
on all other distros I've used (Gentoo for example).

Kevin Scannell (kscanne) wrote :

More information. The same error occurs with the other locales I've tried, including en_US.UTF-8, fr_FR.UTF-8, etc.

If this is a locale definition problem, one wouldn't be surprising to see the same thing in all locales since LC_COLLATE is defined via a common file: /usr/share/i18n/locales/iso14651_t1

To see if the locale files were at fault, I copied over the "iso14651_t1" file from my Gentoo machine (where sort works correctly), ran "locale-gen", and rebooted to be safe. But sort is still broken. Could there be some ubuntu-specific glibc patch that is causing the different behavior?

Kevin Scannell (kscanne) wrote :

More information still. The problem lies with the strcoll function, which I verified with a small C program that tries to sort the two strings in question, and with any of the utf8 locales, strcoll gives the wrong answer.

I don't know how to change the "Affects" field of this bug, but it should therefore be a libc bug. My version of libc:
ii libc6-dev 2.4-1ubuntu12 GNU C Library: Development Libraries

I posted this to the bug-coreutils list too and there was some discussion, including one person who reproduced it with Debian unstable - his glibc is:
ii libc6-dev 2.3.6.ds1-9 GNU C Library: Development Libraries

That thread starts here:
http://lists.gnu.org/archive/html/bug-coreutils/2007-01/msg00184.html

This seems like something pretty serious since it affects default locales and any application that requires sorting, which would be just about everything.

Micah Cowan (micahcowan) wrote :

Confirmed on Ubuntu 6.10.1

Changed in coreutils:
status: Unconfirmed → Confirmed
Micah Cowan (micahcowan) wrote :

sort appears to work properly in Ubuntu 7.04, with libc6-2.5-0ubuntu14. I don't know why I confirmed this bug for coreutils, as Kevin's comments clearly indicate that this is a problem with libc.

Changed in coreutils:
importance: Undecided → Medium
Micah Cowan (micahcowan) wrote :

Actually, I'm not sure it does work properly. Collation works when I used the given example (LC_ALL=ga_IE.utf8, which I don't have installed) or if I use ja_JP.utf8 (which I do); but if I use en_US.utf8, it's still broken. Must be a bug in the locale definition. I do have a "ga_IE" in /usr/share/i18n, though, and it seems to use the same iso_14651_t1 file for collation as does en_US.

Micah Cowan (micahcowan) wrote :

Okay, sorry: used locale-gen to produce a ga_IE.utf8; still broken now. Must have been silently falling back to POSIX or somesuch.

Matthias Klose (doko) wrote :

> Okay, sorry: used locale-gen to produce a ga_IE.utf8; still broken now.
> Must have been silently falling back to POSIX or somesuch.

closing based on this comment. please reopen if you disagree.

Changed in glibc:
status: Confirmed → Fix Released
Kevin Scannell (kscanne) wrote :

> closing based on this comment. please reopen if you disagree.

Micah's comment is that it is *still broken* in the ga_IE.utf8 locale. And indeed it remains broken for me.

Changed in glibc:
status: Fix Released → Confirmed
Yannis Tsop (ogiannhs) wrote :

input:
@3
a3
ae
@e

yannis@earth:~/NetBeansProjects/PrErg1$ LC_ALL=en_us.utf8 sort
@3
@e
a3
ae
yannis@earth:~/NetBeansProjects/PrErg1$ sort
@3
a3
ae
@e

declare -x LANG="en_US.UTF-8"
declare -x LANGUAGE="en_US:en"

latimerio (fomember) wrote :

To me it appears that its not a locale problem but the -f option is on by default
e.g.
{ echo a
   echo j
   echo A
   echo i
   echo AA
   echo B
} | sort

produces
a
A
AA
B
i
j

instead of
A
AA
B
a
i
j

Kevin Scannell (kscanne) wrote :

latimerio, thanks for looking at this. In comment #2 I verified that it doesn't have anything to do with "sort" per se, but it's really a libc (specifically strcoll) issue. My best test case are the strings

aa

áa
áá

which should sort as they appear above (since a < á), but still sort like this:

aa
áa

áá

tneems (tneems) wrote :

I recently ran across this on 12.04 and 14.04 and managed to reproduce it with only ascii characters

Input
echo "Z
X
Signal
Sign
Sign Problem
Signal Problem
Cc a
Cc
B
A" | sort

Output
A
B
Cc
Cc a
Sign
Signal
Signal Problem
Sign Problem
X
Z

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers