Bazaar

ascii is a bad default filesystem encoding

Bug #794353 reported by Martin Pool on 2011-06-08

This bug affects 8 people

Affects		Status	Importance	Assigned to	Milestone
	Bazaar	Fix Released	High	Martin Packman	Bazaar 2.5b5

Bug Description

bzr's architectural approach is to decode filenames to unicode when they come in from the filesystem. On Unix, to do this, we need to know what encoding is used, since the OS API only works in byte strings.

It seems that often (always?) if no locale is set, we default to trying to decode them in ascii, and fail if the names are not ascii.

Modern Unix machines strongly encourage using UTF-8 and that would be a more reasonable default. We could also provide a way to configure it.

This is distinct from bug 63324, which says that if there are names that really are invalid in the encoding (even if the encoding's set properly) bzr can't represent them.

This might be complicated to implement if Python assumes the fsencoding is set only once at startup, but it's probably still possible.

Tags:

Related branches

lp:~gz/bzr/filesystem_default_encoding_794353

Merged into lp:bzr at revision 6367

Vincent Ladeuil: Approve on 2011-12-13

Jelmer Vernooij (community): Approve (code) on 2011-12-12

Revision history for this message

Per Johansson (per.j) wrote on 2011-10-17:

As noted on unicode.org (sorry don't have time to look it up), it's also very unlikely that a text can be interpreted as valid UTF-8, if it isn't indeed UTF-8. That makes it a somewhat safe default.

Revision history for this message

Jelmer Vernooij (jelmer) wrote on 2011-10-17:

Related, it would be nice if sys.getfilesystemencoding() would return 'utf8' on Ubuntu (isn't the policy that it has to be utf8?) rather than depending on some environment variables as appears to be the case at the moment.

Revision history for this message

Barry Warsaw (barry) wrote on 2011-10-17:

I'd be rather worried about deviating either from Debian or upstream here. What if someone runs bzr against a Python installed from source? You'd still have to work around it.

Revision history for this message

Martin Pool (mbp) wrote on 2011-10-17: Re: [Bug 794353] Re: ascii is a bad default filesystem encoding

On 18 October 2011 06:07, Jelmer Vernooij <email address hidden> wrote:
> Related, it would be nice if sys.getfilesystemencoding() would return
> 'utf8' on Ubuntu (isn't the policy that it has to be utf8?) rather than
> depending on some environment variables as appears to be the case at the
> moment.

In particular it depends on $LANG, and in the reasonably common case
of LANG=C or LANG='' you get ANSI_X3.4-1968 (ie ascii.)

As Barry says, Ubuntu doesn't want to deviate from upstream. Upstream
Python always trusts nl_langinfo(CODESET) which gives this result.

This is a little hard to work around from bzr because the filesystem
encoding is held in a C variable and accessed directly from some
internal C routines. But it may be possible.

It seems like our options include:
- give a clear error telling the user to use a unicode locale
- possibly file a Python bug complaining there's no way to override
the fs encoding
- overwrite it from a C extension
- use the bytes filename apis to avoid relying on Python's own encoding
- in any cases where the previous point won't work, perhaps even call
directly to OS APIs

Revision history for this message

Jelmer Vernooij (jelmer) wrote on 2011-10-18:

Since there is no set filesystem encoding on Linux, it's always going to be a guessing game.

I can understand not wanting to deviate from upstream or Debian in this case though, though I do wonder if UTF8 would also be a more appropriate guess for upstream. I suspect UTF-8 will be more often right than whatever the terminal encoding happens to be set to.

Revision history for this message

Jelmer Vernooij (jelmer) wrote on 2011-10-18:

On 10/18/2011 01:53 AM, Martin Pool wrote:
> On 18 October 2011 06:07, Jelmer Vernooij <email address hidden>
wrote:
>> Related, it would be nice if sys.getfilesystemencoding() would return
>> 'utf8' on Ubuntu (isn't the policy that it has to be utf8?) rather than
>> depending on some environment variables as appears to be the case at the
>> moment.
>
> In particular it depends on $LANG, and in the reasonably common case
> of LANG=C or LANG='' you get ANSI_X3.4-1968 (ie ascii.)
>
> As Barry says, Ubuntu doesn't want to deviate from upstream. Upstream
> Python always trusts nl_langinfo(CODESET) which gives this result.
>
> This is a little hard to work around from bzr because the filesystem
> encoding is held in a C variable and accessed directly from some
> internal C routines. But it may be possible.
>
> It seems like our options include:
> - give a clear error telling the user to use a unicode locale
I think this alone would be a big step forward; giving a nice error
message explaining people what to do isn't as good as Doing The Right
Thing, but a lot better than printing a incomprehensible traceback.
>
> - possibly file a Python bug complaining there's no way to override
> the fs encoding
That seems like a good idea, independent of what bzr ends up doing.

Cheers,

Jelmer

Revision history for this message

Martin Pool (mbp) wrote on 2011-11-01:

bug 884327 is another instance of this.

Revision history for this message

Adi Roiban (adiroiban) wrote on 2011-11-01:

The shell locale is C

$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=C

----

I don't know how you can find out the encoding of the filesystem.
As far as I knom, for many filesystem it doesn't matter what encoding is in used.

The current OS has no UTF-8 locale installed.
The filesnames are stored in UTF-8 and I am able to perform various operations on those files, with an C locale shell.

$ ls
test_fileț.txt

-----

Maybe trying C and if failed then try UTF-8 would help, but a way for users to control and configure the filesystem encoding is desirable.

-----

After I removed that file from my bzr branch, bzr is working again, but the bzr status is not displaying the right filename.

$ rm test_data/users_files/test_user/test_file\310\233.txt
$ bzr st
removed:
test_data/users_files/test_user/test_file?.txt

Revision history for this message

Martin Pool (mbp) wrote on 2011-11-01:

On 1 November 2011 20:01, Adi Roiban <email address hidden> wrote:

> Maybe trying C and if failed then try UTF-8 would help, but a way for
> users to control and configure the filesystem encoding is desirable.

That's the point of this bug.

> After I removed that file from my bzr branch, bzr is working again, but
> the bzr status is not displaying the right filename.
>
> $ rm test_data/users_files/test_user/test_file\310\233.txt
> $ bzr st
> removed:
> test_data/users_files/test_user/test_file?.txt

If you're using a C (ascii) locale, bzr will only write ascii to
stdout, so it is replacing the accented character with '?'.

Martin Packman (gz) on 2011-12-09

Changed in bzr:
assignee:	nobody → Martin Packman (gz)
status:	Confirmed → In Progress

Martin Packman (gz) on 2011-12-15

Changed in bzr:
milestone:	none → 2.5b5
status:	In Progress → Fix Released

Revision history for this message

Benjamin Peterson (benjaminp) wrote on 2011-12-20:

#10

Will you still file a Python bug report?

Revision history for this message

Martin Packman (gz) wrote on 2011-12-20:

#11

I've filed <http://bugs.python.org/issue13643> for Python 3 and attached a patch along similar lines to the change made for bzr here.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

python-roundup #13643
[2:8] Edit

Bug watches keep track of this bug in other bug trackers.

Bazaar

ascii is a bad default filesystem encoding

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Related questions

Remote bug watches