Ubuntu

Caseless collate sequence in en_GB.UTF8

Reported by Dirk on 2007-06-16
64
This bug affects 7 people
Affects Status Importance Assigned to Milestone
bash (Ubuntu)
Low
Unassigned

Bug Description

do this in a gash (temporary) directory:

touch A a B b
ls [A-Z]*

you get:-

A b B

What most people (especially unix users with >25 years experience) expect is:-

A B

I found out about this by accident yesterday by doing "rm [A-Z]*" in a directory expecting only files with a initial uppercase letter to be removed. You can imagine my surprise when every file (except those starting in 'a') where removed. Fortunately most of the files were either redundant or backed up, but it still caused me a completely unnecessary hour's work to restore the damage.

Obviously the collating sequence is aAbBcCdD... but that really does *not* make it right. Other linux distros do not have this problem, but then they seem to set:

export LC_COLLATE=C

as standard, which is missing in a standard ubuntu installation (6.06lts -> 7.04)

That *is* the work around, but I would respectfully suggest that you set it as standard before someone destroys something irreplaceable!

Dirk (ubuntu-tobit) wrote :

What most people (especially unix users with >25 years experience) is:

should read:

What most people (especially unix users with >25 years experience) expect is:

Ralph Janke (txwikinger) wrote :

I can confirm this behaviour on dapper as well as feisty

Ralph Janke (txwikinger) on 2007-12-13
Changed in bash:
importance: Undecided → Low
description: updated
Matthias Klose (doko) wrote :

this suggestion maybe makes sense, but bash is not the correct place to do this; set it in /etc/environment.
if the we want to support this for fresh installations we have to change this in the installers as well.

Colin Watson (cjwatson) wrote :

Personally I like LC_COLLATE=C and set it everywhere, and I can see why people want to set this in the installer. However, when this has come up in the past, the problem has been that GUI users object; with some justification, they want the sort order in GUI applications to match that defined for their language. en_GB users largely don't care, but in other languages it makes much more of a difference.

Is there a good way to set LC_COLLATE=C for shell users but not for GUI programs other than putting it in one's shell configuration? I don't think so at present.

Gert Kulyk (gkulyk) wrote :

@Dirk: If you look at the /usr/share/doc/bash/COMPAT.gz file, $13 states:
[...]
The portable way to specify upper case letters is [:upper:] instead of A-Z; lower case may be specified as [:lower:] instead of a-z.
[...]

The default for /bin/sh in ubuntu is dash, which seem to behave like older versions of bash (and other shells, e.g. zsh should behave like bash now), that is ignoring the LC_COLLATE environment-variable, which results in shell scripts using the [A-Z]-thing or the like are not destroying anything if it is calling /bin/sh and not /bin/bash, of course.

I do not like the Idea of changing LC_COLLATE, especially for non-english environments.

Mika Fischer (zoop) wrote :

Maybe this should be discussed on the devel mailinglist or someone could start a spec?

In any case bash is not the right package for this bug and I don't know what is...

Rolf Leggewie (r0lf) wrote :

There hasn't been any activity in this ticket for a while. Is this still a problem in Jaunty or Karmic?

Changed in bash (Ubuntu):
status: Confirmed → Incomplete
era (era) wrote :

Reproducible in Jaunty and Karmic alpha 3

Changed in bash (Ubuntu):
status: Incomplete → Confirmed
lieven moors (lievenmoors) wrote :

I think it is an important feature of a shell to be able to
specify character ranges, and distinguish between upper
and lower case while doing this.
Can somebody explain to me why 'export LC_LOCALE=C' is
not set by default in .bashrc ?
I also want to confirm this bug is still present in Karmic,
and I almost had the same accident the original bug reporter
experienced (removing important directories).
I was lucky to double checked the shell expansion before doing it.

lieven moors (lievenmoors) wrote :

sorry, LC_LOCALE should read LC_COLLATE

era (era) wrote :

In reply to comment #4: what's wrong with setting it in the shell's configuration? I don't believe the GUI reads your .bashrc so you could set it in /etc/skel/.bashrc

Gabe Gorelick (gabegorelick) wrote :

Still an issue on Lucid.

StephanBeal (sgbeal) wrote :

@#5: yes, the (buried) documentation states that the "preferred" way is to not assume that char ranges are case-sensitive (apparently SuSv3 also recommends this), but the fact remains that Unix users have, for 30+ years, been relying on case-insensitive ranges. Bash behaves differently on some systems when LC_ALL is _unset_. Some systems i've tested (RHEL + Ubuntu) treat an _unset_ LC_ALL as a case-insensitive locale, whereas others (e.g. Solaris and some Linuxes) treat it equivalently to LC_ALL=C. The latter behaviour is "historically correct."

i was hit by this in the same manner as the original reporter, nuking 30GB of home directories when i did:

  mv home HOME
  rm -fr [a-z]*

before cleaning up a drive for a new Ubuntu installation (full details are in #571958).

StephanBeal (sgbeal) wrote :

@#13: Correction:

"been relying on case-insensitive ranges"
==
"been relying on case-SENSITIVE ranges"

StephanBeal (sgbeal) wrote :

@#11: /etc/skel/.bashrc is only useful for new installations. i've been toting around this same home directory for over 10 years, and have a .bashrc i have lovingly maintained throughout that time. This particular "bug"/behaviour nuked that lovingly-maintained home directory (along with its .bashrc), and i had to pull several tens of GB from backups to recover.

era (era) wrote :

#15: Obviously, there isn't much Ubuntu can do to help people who do not use the OS-supplied startup scripts anyway.

Personally, I routinely ditch the default .bashrc on new installations and replace it with a one-liner which will take me through future upgrades:

. /etc/skel/.bashrc

It tends to grow more additions over time, of course, but this at least should provide a healthy future-proof baseline. It would be nice if something like this was the default, but that's a separate (wishlist) bug #194108

Roberto Gordo Saez (rgs) wrote :

It is a problem for me too. And it is worse when using es_ES locale, because we can't use LC_COLLATE=C, it is very important to match accented characters in files and directories which LC_COLLATE=C does not do. This is ridiculous and counterintuitive:

touch A B
ls [a-b]*
A

This will be fixed if collate order places upper case letters and lower case letters separate, like LC_COLLATE=C does.

era (era) wrote :

So should libc6 be updated to provide a non-standard case-sensitive collate sequence for each available locale? I think that goes outside the scope of what Ubuntu can and should do, but seems like the ultimate solution, if upstream can be persuaded.

Roberto Gordo Saez (rgs) wrote :

Of course it should go upstream, but I don't understand why it is outside the scope of Ubuntu to fix a problem for its users. Ubuntu choose Unity, a non-standard desktop environment, and refuses to choose a "non-standard" collate sequence (which provides actually more "standard" behaivior for many of its users that the upstream choice). I certainly can't understand the reasoning behind that.

sordna (sordna) wrote :

Looks like this affects programs such as GNU grep and egrep ... note I'm using quotes around the A-Z character class to avoid any shell interference:
$ echo hello | grep '[A-Z]'
hello

The above behavior COMPLETELY WRONG AND UNACCEPTABLE. I am utterly shocked I have to worry change my default environment to do a simple task such as identifying upper case characters with grep. Note I'm using en_US.UTF-8 (not en_GB).

LC_COLLATE should default to C under all circumstances, unless the user explicitly wants grep and other programs to behave in a totally weird and unexpected way. Even better, perhaps libc should treat an undefined LC_COLLATE same as being C.

Either way, regular expressions should be honored in a sane linux / unix operating system. Users should not have to jump through hoops to make a fresh installed system behave in a normal, unsuprising way.

Raphaël Droz (raphael-droz) wrote :

Any sane "user-friendly" distribution must default to LC_COLLATE=C for the terminal use.
I already lost unrecoverable data like in #571958, I now export LC_COLLATE=C in .bashrc but
I'm not perverse enough to imagine it's an obligatory stop of terminal users.
(LFS users probably know about collations and read the man 1 bash a long time ago)

About GUI:
the LC_COLLATE is a shell configuration variable.
GUI can find something else, metacity may offer an option like "respect LC_COLLATE to sort files".

utf-8 LC_COLLATE is definitely far too counter-intuitive and risky, please fix, at least, /etc/skel/.bashrc

UTF-8 collating has Upper case sorted before lower case.
from: http://unicode.org/reports/tr10/#Case_Comparisons

6.6 Case Comparisons

In some languages, it is common to sort lowercase before uppercase; in other languages this is reversed. Often this is more dependent on the individual concerned, and is not standard across a single language. It is strongly recommended that implementations provide parameterization that allows uppercase to be sorted before lowercase, and provides information as to the standard (if any) for particular countries. This can easily be done to the Default Unicode Collation Element Table before tailoring by remapping the L3 weights (see Section 7, Weight Derivation). It can be done after tailoring by finding the case pairs and swapping the collation elements.

----

Anyone not following the above is should likely not claim Unicode compatibilty.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers