bash character ranges have unexpected behavior with certain locales

Bug #571958 reported by StephanBeal
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
bash (Ubuntu)
New
Undecided
Unassigned

Bug Description

i am going through the process of upgrading my Jollicloud-powered netbook to Ubuntu 10.4 netbook edition...

i unfortunately cannot paste the console session here, but here's what i did:

  mount /dev/sda5 /mnt (an ext3 filesystem, my old root directory)
  cd /mnt
  ls
(list of dirs from '/' of the older installation)

  mv home HOME
  rm -fr [a-z]*

It deleted not only [a-z]*, but [a-zA-Z]*, which is contrary to my 15+ years of Unix use and every piece of documentation regarding shell patterns and case-sensitive filesystems like this one (ext3). So it just wiped out my old home directory entirely, along with 30GB of data which i will now have to re-fish out of various SVN repos and Dropbox. (The point of the 'mv' was to keep me from having to re-sync the 15GB of Dropbox data.)

To reproduce the problem:

1) Boot from the live edition (i don't yet know if it affects the "normal" install, and if it does millions of shell scripts around the world are going to start hosing data).

2) Open a terminal
3) Type:

  touch XYZ
  rm [a-z]*

that removes XYZ along with everything else, which is just plain wrong.

>:-(

Tags: dataloss
Revision history for this message
StephanBeal (sgbeal) wrote :

The same thing here:

    touch XYZ
    ls [a-z]*

includes XYZ in the listing, which implies that some Unix-noob a&&hole thoughtlessly enabled a shell option which makes bash do case-insensitive wildcard expansion. (Again, contrary to every other Unix system/configuration i've every worked on.)

>:-(

Revision history for this message
StephanBeal (sgbeal) wrote :

i can reproduce this on 9.10, but can find no option (e.g. in /etc/profile or /etc/bash*) which mucks with the wildcard expansion.

root@jareth:/home/root# touch XYZ
root@jareth:/home/root# ls [a-z]*
XYZ

i have often relied on case-sensitive wildcard matching over the years, and am just appalled at this behaviour :(. (And still sick to my stomach from the first 'rm'.)

Here is, for comparison, behaviour from one of the sourceforge shell servers:

-bash-3.2$ ls
R bin libfunutil mycrontab qub_icon.gif sf_bookmarks.html toc
ape cm libfunutil-cvsroot.tar.gz qub s11n tmp userweb
-bash-3.2$ ls -d [A-Z]*
R
-bash-3.2$ ls -d [a-z]*
ape cm libfunutil-cvsroot.tar.gz qub s11n tmp userweb
bin libfunutil mycrontab qub_icon.gif sf_bookmarks.html toc
-bash-3.2$ uname -a
Linux shell-24007 2.6.18-164.2.1.el5.028stab066.10 #1 SMP Sat Dec 12 18:52:53 MSK 2009 x86_64 x86_64 x86_64 GNU/Linux
-bash-3.2$ cat /etc/issue
CentOS release 5.4 (Final)
Kernel \r on an \m

Revision history for this message
Gabe Gorelick (gabegorelick) wrote :

I can confirm this behavior even with the nocaseglob option unset. It's not an rm issue, but a bash globbing one.

affects: ubuntu → bash (Ubuntu)
Changed in bash (Ubuntu):
status: New → Confirmed
summary: - DATA LOSS: "rm" command working case-insensitively
+ bash always globs case-insensitively
Revision history for this message
Gabe Gorelick (gabegorelick) wrote : Re: bash always globs case-insensitively

This link [1] provides some good info on this. A quote from the bash manual found at that link explains the issue:

"Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive, using the locale's collating sequence and character set. For example, in the default C locale, "[a-d]" is equivalent to "[abcd]". Many locales sort characters in dictionary order, and in these locales "[a-d]" is typically not equivalent to "[abcd]"; it might be equivalent to "[aBbCcDd]", for example. To obtain the traditional interpretation of bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value "C""

So if you export LC_ALL=C, then the globbing should work. Can you confirm that this solves the problem for you? I just checked and it works for me.

As a footnote, apparently the Single Unix Specification version 3 advises against using range expressions, so they really shouldn't be used if you can avoid it since they run into problems like this with locales.

[1] http://bugs.centos.org/view.php?id=1511

Changed in bash (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
StephanBeal (sgbeal) wrote :

Thanks for the info (and for the bug renaming - i fired it off before experimenting enough to determine the culprit), but i still consider this to be a bug in the default Ubuntu configuration. Scripts copied from other platforms (even other Linuxes) are ticking time-bombs.

i'm using the default LC config, and my system is set up for US/English, so it should (as far as i'm aware) be behaving like the C locale as far as collation is.concerned.

stephan@jareth:~$ env | grep LC
stephan@jareth:~$

If i explicitly set it to 'C' then it works as it should:

stephan@jareth:~$ export LC_ALL=C
stephan@jareth:~$ ls -d [A-Z]*
Desktop Public Documents
...

Regardless of what SUSv3 specifies, i have never worked on a Solaris/Linux/BSD system (or programming language) which had case-insensitive character ranges. This just bit me in the ass big-time.

i have verified that it happens on my 9.10 box as well, and it's a miracle i haven't nuked huge amounts of data already, as i _instinctively_ rely on all Unix shell operations being case-sensitive (and very often make use of that).

Revision history for this message
Gabe Gorelick (gabegorelick) wrote :

Well the character ranges actually aren't case insensitive, they're just ordered in a way so that it often appears that way. What's going on is that for some locales, the character set is ordered aAbB... while programmers instinctively expect it to be abc...ABC. The fact that this issue existed on 9.10 means that it actually isn't such a big issue. If shell scripts relied on the flawed character ranges, then we definitely would have seen things like data loss. The fact that that's never been a big issue means that
1. Fewer shell scripts than you think are actually using character ranges, since they are basically deprecated
and/or
2. Shell scripts are setting export LC_ALL=C if they really need this kind of character range globbing.

If you absolutely need the expected character ranges, 2 seems like a perfectly good solution. I don't see anything bash could do differently, can you?

Revision history for this message
StephanBeal (sgbeal) wrote :

i agree fundamentally with [a-z] == [a-zA-Z] for certain locales (can't argue with the specs!), but i will argue that no other Linux system i've ever been on does this _in the default configuration_, and that's what i'm whining about. i've been using Unix daily since the early 90's, and i've never been bitten by this before.

Sorry about the whining, but i'm still sick to my stomach over this, and will have to wait another 16 hours before i can use my netbook (that's about how long it will take to sync my files over the local WLAN, given the current xfer speed).

In summary: i understand the reason it happened, i understand the workaround (i've already edited the /etc/profile and ~/.bashrc on all of my user accounts), but i still (objectively) feel that this is a DATA LOSS bug in the default Ubuntu configuration.

That nobody else has reported it yet does indicate that it either hasn't bitten anyone (badly enough) yet, or it bit them in such a way as to render them unable to report the bug (e.g. by removing the file where they've written down the passwords to their email and Launchpad accounts), but i don't feel that this bug deserves any less attention because of that.

"The Unix command-line is case-sensitive" is a maxim of everyday Unix practice, and the current configuration is in blatant violation of that long-standing convention.

But in any case, i very much appreciate your extremely fast response and the workaround. i can't wait to try this on the RedHat and Solaris systems at work on Monday.

:)

Revision history for this message
Gabe Gorelick (gabegorelick) wrote :

Yeah, let us know if other systems handle this better. Thanks.

summary: - bash always globs case-insensitively
+ bash character ranges have unexpected behavior with certain locales
Changed in bash (Ubuntu):
status: Incomplete → New
Revision history for this message
Gabe Gorelick (gabegorelick) wrote :

> "The Unix command-line is case-sensitive" is a maxim of everyday
> Unix practice, and the current configuration is in blatant violation
> of that long-standing convention.

I understand what you're saying, but it's not really in violation of that principle. bash is still doing everything case-sensitively, and everything besides ranges works as expected. The only principle it's violating is that the character encoding is ordered ABC...abc, like it is in ASCII. This is tempting to believe, but is obviously a false principle as evidenced by this bug, and people should be educated about that.

Revision history for this message
StephanBeal (sgbeal) wrote :

Follow-up, comparing two other systems:

RHEL 5.5 with bash 3.2.25 has the same behaviour as Ubuntu: if LC_ALL is NOT set then it defaults to locale settings where [a-z] is case-insensitive.

On Solaris 10 with bash 3.00.16 this is NOT the behaviour. It behaves as if LC_ALL=C, i.e. case-sensitive.

The problem appears, at least on the surface, to be a change in the default behaviour in bash somewhere along with way. i do not have access to try to different bash versions on those systems, so i cannot test this hypotheses.

Revision history for this message
Andrew McCarthy (andrewmccarthy) wrote :

This looks to be the same as bug #120687. Should we mark this as a duplicate?

Revision history for this message
Gabe Gorelick (gabegorelick) wrote :

Yep, this is a duplicate of that bug. Thanks for finding that.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.