ascii is a bad default filesystem encoding

Bug #794353 reported by Martin Pool
62
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Bazaar
Fix Released
High
Martin Packman

Bug Description

bzr's architectural approach is to decode filenames to unicode when they come in from the filesystem. On Unix, to do this, we need to know what encoding is used, since the OS API only works in byte strings.

It seems that often (always?) if no locale is set, we default to trying to decode them in ascii, and fail if the names are not ascii.

Modern Unix machines strongly encourage using UTF-8 and that would be a more reasonable default. We could also provide a way to configure it.

This is distinct from bug 63324, which says that if there are names that really are invalid in the encoding (even if the encoding's set properly) bzr can't represent them.

This might be complicated to implement if Python assumes the fsencoding is set only once at startup, but it's probably still possible.

Tags: unicode

Related branches

Revision history for this message
Per Johansson (per.j) wrote :

As noted on unicode.org (sorry don't have time to look it up), it's also very unlikely that a text can be interpreted as valid UTF-8, if it isn't indeed UTF-8. That makes it a somewhat safe default.

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Related, it would be nice if sys.getfilesystemencoding() would return 'utf8' on Ubuntu (isn't the policy that it has to be utf8?) rather than depending on some environment variables as appears to be the case at the moment.

Revision history for this message
Barry Warsaw (barry) wrote :

I'd be rather worried about deviating either from Debian or upstream here. What if someone runs bzr against a Python installed from source? You'd still have to work around it.

Revision history for this message
Martin Pool (mbp) wrote : Re: [Bug 794353] Re: ascii is a bad default filesystem encoding

On 18 October 2011 06:07, Jelmer Vernooij <email address hidden> wrote:
> Related, it would be nice if sys.getfilesystemencoding() would return
> 'utf8' on Ubuntu (isn't the policy that it has to be utf8?) rather than
> depending on some environment variables as appears to be the case at the
> moment.

In particular it depends on $LANG, and in the reasonably common case
of LANG=C or LANG='' you get ANSI_X3.4-1968 (ie ascii.)

As Barry says, Ubuntu doesn't want to deviate from upstream. Upstream
Python always trusts nl_langinfo(CODESET) which gives this result.

This is a little hard to work around from bzr because the filesystem
encoding is held in a C variable and accessed directly from some
internal C routines. But it may be possible.

It seems like our options include:
 - give a clear error telling the user to use a unicode locale
 - possibly file a Python bug complaining there's no way to override
the fs encoding
 - overwrite it from a C extension
 - use the bytes filename apis to avoid relying on Python's own encoding
 - in any cases where the previous point won't work, perhaps even call
directly to OS APIs

m

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Since there is no set filesystem encoding on Linux, it's always going to be a guessing game.

I can understand not wanting to deviate from upstream or Debian in this case though, though I do wonder if UTF8 would also be a more appropriate guess for upstream. I suspect UTF-8 will be more often right than whatever the terminal encoding happens to be set to.

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

On 10/18/2011 01:53 AM, Martin Pool wrote:
> On 18 October 2011 06:07, Jelmer Vernooij <email address hidden>
wrote:
>> Related, it would be nice if sys.getfilesystemencoding() would return
>> 'utf8' on Ubuntu (isn't the policy that it has to be utf8?) rather than
>> depending on some environment variables as appears to be the case at the
>> moment.
>
> In particular it depends on $LANG, and in the reasonably common case
> of LANG=C or LANG='' you get ANSI_X3.4-1968 (ie ascii.)
>
> As Barry says, Ubuntu doesn't want to deviate from upstream. Upstream
> Python always trusts nl_langinfo(CODESET) which gives this result.
>
> This is a little hard to work around from bzr because the filesystem
> encoding is held in a C variable and accessed directly from some
> internal C routines. But it may be possible.
>
> It seems like our options include:
> - give a clear error telling the user to use a unicode locale
I think this alone would be a big step forward; giving a nice error
message explaining people what to do isn't as good as Doing The Right
Thing, but a lot better than printing a incomprehensible traceback.
>
> - possibly file a Python bug complaining there's no way to override
> the fs encoding
That seems like a good idea, independent of what bzr ends up doing.

Cheers,

Jelmer

Revision history for this message
Martin Pool (mbp) wrote :

bug 884327 is another instance of this.

Revision history for this message
Adi Roiban (adiroiban) wrote :

The shell locale is C

$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=C

----

I don't know how you can find out the encoding of the filesystem.
As far as I knom, for many filesystem it doesn't matter what encoding is in used.

The current OS has no UTF-8 locale installed.
The filesnames are stored in UTF-8 and I am able to perform various operations on those files, with an C locale shell.

$ ls
test_fileț.txt

-----

Maybe trying C and if failed then try UTF-8 would help, but a way for users to control and configure the filesystem encoding is desirable.

-----

After I removed that file from my bzr branch, bzr is working again, but the bzr status is not displaying the right filename.

$ rm test_data/users_files/test_user/test_file\310\233.txt
$ bzr st
removed:
  test_data/users_files/test_user/test_file?.txt

Revision history for this message
Martin Pool (mbp) wrote :

On 1 November 2011 20:01, Adi Roiban <email address hidden> wrote:

> Maybe trying C and if failed then try UTF-8 would help, but a way for
> users to control and configure the filesystem encoding is desirable.

That's the point of this bug.

> After I removed that file from my bzr branch, bzr is working again, but
> the bzr status is not displaying the right filename.
>
> $ rm test_data/users_files/test_user/test_file\310\233.txt
> $ bzr st
> removed:
>  test_data/users_files/test_user/test_file?.txt

If you're using a C (ascii) locale, bzr will only write ascii to
stdout, so it is replacing the accented character with '?'.

Martin Packman (gz)
Changed in bzr:
assignee: nobody → Martin Packman (gz)
status: Confirmed → In Progress
Martin Packman (gz)
Changed in bzr:
milestone: none → 2.5b5
status: In Progress → Fix Released
Revision history for this message
Benjamin Peterson (benjaminp) wrote :

Will you still file a Python bug report?

Revision history for this message
Martin Packman (gz) wrote :

I've filed <http://bugs.python.org/issue13643> for Python 3 and attached a patch along similar lines to the change made for bzr here.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.