wrong glibc sort order on pt_BR

Bug #82302 reported by Eduardo
6
Affects Status Importance Assigned to Milestone
GLibC
Confirmed
Medium
glibc (Suse)
Unknown
Medium
glibc (Ubuntu)
Confirmed
Medium
Unassigned

Bug Description

Tested in 6.06.1 updated.

In pt_BR, the glibc doesn't count spaces in the sort order.

An example sorted by "sort" command:

That list:

GABRIELA HELEDA DE SOUZA
GABRIEL ALCIDES KLIM PERONDI
GABRIELA LETICIA BATISTA NUNES
GABRIELA JACOBY NOS
GABRIEL ALEXANDRE DA SILVA MANICA
GÁBRIEL ALCIDES KLIM PERONDI
GÁBRIELA JACOBY NOS

But the right order is:

GABRIEL ALCIDES KLIM PERONDI
GÁBRIEL ALCIDES KLIM PERONDI
GABRIEL ALEXANDRE DA SILVA MANICA
GABRIELA HELEDA DE SOUZA
GABRIELA JACOBY NOS
GÁBRIELA JACOBY NOS
GABRIELA LETICIA BATISTA NUNES

I find that I can change that on /usr/share/i18n/locales, adding:

reorder-after <U00A0>
<U0020><CAP>;<CAP>;<CAP>;<U0020>
reorder-end

in the session LC_COLLATE. After generate the locale again, I have the right
sort order.

Revision history for this message
In , Walter Cruz (waltercruz) wrote :

Hi all.

In pt_BR, the glibc doesn't count spaces in the sort order.

An example:

That list:

GABRIELA HELEDA DE SOUZA
GABRIEL ALCIDES KLIM PERONDI
GABRIELA LETICIA BATISTA NUNES
GABRIELA JACOBY NOS
GABRIEL ALEXANDRE DA SILVA MANICA
GÁBRIEL ALCIDES KLIM PERONDI
GÁBRIELA JACOBY NOS

But the right order is:

GABRIEL ALCIDES KLIM PERONDI
GÁBRIEL ALCIDES KLIM PERONDI
GABRIEL ALEXANDRE DA SILVA MANICA
GABRIELA HELEDA DE SOUZA
GABRIELA JACOBY NOS
GÁBRIELA JACOBY NOS
GABRIELA LETICIA BATISTA NUNES

I find that I can change that on /usr/share/i18n/locales, adding:

reorder-after <U00A0>
<U0020><CAP>;<CAP>;<CAP>;<U0020>
reorder-end

in the session LC_COLLATE. After generate the locale again, I have the right
sort order.

Revision history for this message
In , Eduardo (edurbs-gmail) wrote :

When use "sort" command, it's the wrong sorted list:
~$ sort list.txt
GABRIELA HELEDA DE SOUZA
GABRIELA JACOBY NOS
GÁBRIELA JACOBY NOS
GABRIEL ALCIDES KLIM PERONDI
GÁBRIEL ALCIDES KLIM PERONDI
GABRIELA LETICIA BATISTA NUNES
GABRIEL ALEXANDRE DA SILVA MANICA

Tested in ubuntu 6.06, fedora core 3, red hat 9 and openSUSE 10.2 (all i386),
with the same wrong sort order.

Revision history for this message
In , Petter Reinholdtsen (pere-hungry) wrote :

Can you provide any references specifying that space should be handled
as a letter when soring in brazilian portugese? Because if not, I suspect
you are mistaken when you believe space should be sorted that way.

Revision history for this message
In , Walter Cruz (waltercruz) wrote :

(In reply to comment #2)
> Can you provide any references specifying that space should be handled
> as a letter when soring in brazilian portugese? Because if not, I suspect
> you are mistaken when you believe space should be sorted that way.

The rules are defined by ABNT (Assoaciação Brasileira de Normas e Técnicas) in a
paper called NBR 6033, but the document isn't public available.

But, as me and edurbs are native speakers, I think that you should believe us :D

[]'s
- Walter

Revision history for this message
In , Keld Simonsen (keld) wrote :

Subject: Re: sort order on pt_BR

On Tue, Jan 30, 2007 at 04:23:22PM -0000, pere at hungry dot com wrote:
>
> ------- Additional Comments From pere at hungry dot com 2007-01-30 16:23 -------
> Can you provide any references specifying that space should be handled
> as a letter when soring in brazilian portugese? Because if not, I suspect
> you are mistaken when you believe space should be sorted that way.

In most languages using a script with letters, you have two ordering
schemes, the standard one, and the word-by-word one. In the latter, space
is significant on the first level. So both are correct, culturally.

I don't know how we can have an easy way to have both schemes available
to the user, except we provide two locales, with a small delta
(replace-after) to make the word-by-word locale. And then a general
naming scheme so the user can chose easily, like the @euro variants.

best regards
Keld

Changed in glibc:
status: Unknown → Confirmed
Changed in glibc:
status: Confirmed → Needs Info
Revision history for this message
In , Danielcristian (danielcristian) wrote :

(In reply to comment #0)
> I find that I can change that on /usr/share/i18n/locales, adding:
>
> reorder-after <U00A0>
> <U0020><CAP>;<CAP>;<CAP>;<U0020>
> reorder-end
>
> in the session LC_COLLATE. After generate the locale again, I have the right
> sort order.

It didn't worked with Fedora 5. After changing settings on pt_BR file, and run
the following command, still having the same problem...
localedef -i pt_BR -c -f ISO-8859-1 -A /usr/share/locale/locale.alias pt_BR

Did I make something wrong?

Kind regards...

Revision history for this message
Matthias Klose (doko) wrote :

> Can you provide any references specifying that space should be handled
> as a letter when soring in brazilian portugese? Because if not, I suspect
> you are mistaken when you believe space should be sorted that way.

The rules are defined by ABNT (Assoaciação Brasileira de Normas e Técnicas) in a
paper called NBR 6033, but the document isn't public available.

But, as me and edurbs are native speakers, I think that you should believe us :D

Changed in glibc:
importance: Undecided → Medium
status: Unconfirmed → Needs Info
Changed in glibc:
status: Unknown → Confirmed
Revision history for this message
In , Danielcristian (danielcristian) wrote :

(In reply to comment #5)
> Did I make something wrong?

Yes, I did. I put a space between <U0020> and <CAP>.

But it is still ordering in a strange behavior; 'a' and 'á' and 'ã' and 'à' are
the same characters. It is ordering like it were different.

Sorry...

Revision history for this message
In , Luiz-planit (luiz-planit) wrote :

Hi Daniel

How is it ordering ?
I make tests and the behavior with and without the proposed change is the same
when ordering this characters.
May be this an another bug ?

(In reply to comment #6)
> (In reply to comment #5)
> > Did I make something wrong?
>
> Yes, I did. I put a space between <U0020> and <CAP>.
>
> But it is still ordering in a strange behavior; 'a' and 'á' and 'ã' and 'à' are
> the same characters. It is ordering like it were different.
>
> Sorry...
>

Changed in glibc:
status: Confirmed → Needs Info
Revision history for this message
In , Pierre Habouzit (madcoder) wrote :

(In reply to comment #0)
> Hi all.
>
> In pt_BR, the glibc doesn't count spaces in the sort order.

FWIW fr_FR is hit as well, and many other locales are too.

cat a; echo "==========="; LC_ALL=fr_FR sort a
GABRIELA HELEDA DE SOUZA
GABRIEL ALCIDES KLIM PERONDI
GABRIELA LETICIA BATISTA NUNES
GABRIELA JACOBY NOS
GABRIEL ALEXANDRE DA SILVA MANICA
GÁBRIEL ALCIDES KLIM PERONDI
GÁBRIELA JACOBY NOS
===========
GABRIELA HELEDA DE SOUZA
GABRIELA JACOBY NOS
GÁBRIELA JACOBY NOS
GABRIEL ALCIDES KLIM PERONDI
GÁBRIEL ALCIDES KLIM PERONDI
GABRIELA LETICIA BATISTA NUNES
GABRIEL ALEXANDRE DA SILVA MANICA

> I find that I can change that on /usr/share/i18n/locales, adding:
>
> reorder-after <U00A0>
> <U0020><CAP>;<CAP>;<CAP>;<U0020>
> reorder-end
>
> in the session LC_COLLATE. After generate the locale again, I have the right
> sort order.

Changed in glibc:
status: Incomplete → Invalid
Revision history for this message
In , Guilherme de S. Pastore (fatalerror) wrote :

Petter,

I can assure you that the proposed one is the behaviour any Brazilian would
expect since the age of 6, when they learn how to sort at school, right after
learning the alphabet.

If it is *really* necessary, I can pay for web access to the already mentioned
lousy 5-page document from ABNT which defines the technical norm for sorting
just to show you, but you may guess I'm not eager to :)

Revision history for this message
Daniel Vainsencher (dvainsencher) wrote :

It's not a mistale. I confirm the information above.
The site of ABNT is http://www.abnt.org.br/ and the standart about brazilian sort order is NBR 6033:1989 (NB 106). But as Eduardo sad, the document isn't public available and it is in brazilian portuguese ;-)
Is there is another way to convince you about that?

Revision history for this message
Matthias Klose (doko) wrote :

> Is there is another way to convince you about that?

no. it's not the language of the document, it's the availability.

Revision history for this message
In , Email-daniel-h (email-daniel-h) wrote :

Hi, everybody. First of all i apologize for my poor writing skills. English is
not my native language.

pt_BR sort order seems odd to me. If this behavior is not a bug, i agree with
Keld's suggestion: To define a new locale, like pt_BR@abnt, using the "right"
sort order.

Can the reorder sample sentence handle lower and uppercase properly? The result
of a sort, without the suggested change in the locale definition file, can't:

LC_ALL=pt_BR LANG=pt_BR LANGUAGE=pt_BR sort a.txt
gabriela heleda de souza
GABRIELA HELEDA DE SOUZA
gabriela jacoby nos
GABRIELA JACOBY NOS
gábriela jacoby nos
GÁBRIELA JACOBY NOS
gabriel alcides klim perondi
GABRIEL ALCIDES KLIM PERONDI
gábriel alcides klim perondi
GÁBRIEL ALCIDES KLIM PERONDI
gabriela leticia batista nunes
GABRIELA LETICIA BATISTA NUNES
gabriel alexandre da silva manica
GABRIEL ALEXANDRE DA SILVA MANICA

The expected output:
gabriel alcides klim perondi
gábriel alcides klim perondi
gabriel alexandre da silva manica
gabriela heleda de souza
gabriela leticia batista nunes
gabriela jacoby nos
gábriela jacoby nos
GABRIEL ALCIDES KLIM PERONDI
GÁBRIEL ALCIDES KLIM PERONDI
GABRIEL ALEXANDRE DA SILVA MANICA
GABRIELA HELEDA DE SOUZA
GABRIELA LETICIA BATISTA NUNES
GABRIELA JACOBY NOS
GÁBRIELA JACOBY NOS

This is "tricky" because we don't just perform a lexicographically comparison of
each character (a Portuguese Java user will be happy to know that
String.compareTo is not enough to produce the sorted result that he expect, for
several reasons).
We first sort ignoring accented letters, then we use them as a
"tiebreaker/disambiguation criteria" (i don't know the correct term in English)
between equal full names. In the first step, a = á, but in the later step, a < á.

Well, that is all i know. I will try to get a copy of the Norma NBR 6033:1989
(NB 106) from ABNT to confirm (or not :-)) these examples.

Thanks.

Revision history for this message
In , Email-daniel-h (email-daniel-h) wrote :

And i don't know if the Norma is "case sensitive" or "case insensitive".

Revision history for this message
In , Keld Simonsen (keld) wrote :

Subject: Re: sort order on pt_BR

On Sun, Jun 27, 2010 at 02:25:55PM -0000, email_daniel_h at yahoo dot com dot br wrote:
>
> ------- Additional Comments From email_daniel_h at yahoo dot com dot br 2010-06-27 14:25 -------
> And i don't know if the Norma is "case sensitive" or "case insensitive".

All the European language sorting standards I know of are case insensitive on the first
level, case only counts on the 3rd level. I expect this also to be true for
Portuguese. That is: most important distinction is base letter, second
is accent, third is case.

best regards
keld

Revision history for this message
In , Email-daniel-h (email-daniel-h) wrote :

For those interested in an workaround, for a CentOS 5.5 box (use at your own risk):

1. Copy the base locale definition file

cp /usr/share/i18n/locales/pt_BR pt_BR\@abnt\.src

2. Edit <email address hidden> and add

reorder-after <U00A0>
<U0020><CAP>;<CAP>;<CAP>;<U0020>
reorder-end

before END LC_COLLATE

3. Create new directories

mkdir /usr/lib/locale/pt_BR\@abnt
mkdir /usr/lib/locale/pt_BR\.utf8\@abnt

4. Compile the new locales

localedef --verbose -c -i pt_BR\@abnt.src -f ISO-8859-1 /usr/lib/locale/pt_BR\@abnt
localedef --verbose -c -i pt_BR\@abnt.src -f UTF-8 /usr/lib/locale/pt_BR\.utf8\@abnt

5. Check the new locales

locale -a | grep pt_BR

I don't know if this is the best way, but it is one way.

Maybe the directories can be different in other Linux distributions.

I think that will be better to create a new <email address hidden> with a "copy
statement" for each section inside it than to copy the whole source from
/usr/share/i18n/locales/pt_BR

Revision history for this message
In , Email-daniel-h (email-daniel-h) wrote :
Revision history for this message
rusivi2 (rusivi2-deactivatedaccount) wrote :

Thank you for reporting this bug.

Is this still an issue in Lucid?

Revision history for this message
Daniel Vainsencher (dvainsencher) wrote :

Yes.
The sort order must be patched for those who knows this issue and use databases like PostgreSQL that uses glibc for sorting criteria.

Revision history for this message
In , Email-daniel-h (email-daniel-h) wrote :

Hi, everybody. I've got a copy of the Norma NBR 6033:1989 (NB 106). Can i sent
it (in private) for the person that will fix this bug? The "catch": The document
is a pdf file made of images in Portuguese.

Thanks.

Revision history for this message
Daniel Vainsencher (dvainsencher) wrote :

I'm tired to apply the same patch on every postgreSQL (it uses glibc for ordering) server I configure.
If I bought this document and send as gitf to someone who can decide this implementation, can that fix be included on some release?

Revision history for this message
Everton Zanella Alvarenga (everton137) wrote :
Changed in glibc:
importance: Unknown → Medium
Changed in glibc (Suse):
importance: Unknown → Medium
status: Invalid → Unknown
Changed in glibc:
status: Incomplete → Confirmed
Changed in glibc (Ubuntu):
status: Incomplete → Confirmed
status: Confirmed → Incomplete
status: Incomplete → Confirmed
Revision history for this message
Eduardo (edurbs-gmail) wrote :

So... here we will go 14 years later from my fisrt post. Maybe when I die it will be patched.
Whats is missing to fix it?

Revision history for this message
Gunnar Hjalmarsson (gunnarhj) wrote :

@Eduardo: Better to ping on the upstream bug. That's where it hopefully will be fixed.

Revision history for this message
In , Eduardo (edurbs-gmail) wrote :

So... here we will go 14 years later from my first post. Maybe when I die it will be patched.
What is missing to fix it?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.