Bazaar

bzr add fails on a non-utf8 filename in an utf8 locale

Bug #77657 reported by Wouter van Heyst on 2007-01-02

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Bazaar	Fix Released	Medium	John A Meinel	Bazaar 1.4

Bug Description

Snippet from a test-in-writing:

esp = 'espa\xf1ol.alias'
open(esp, 'wb').write('latin1')
self.run_bzr_decode('add')

gets you:

...

File "/home/larstiq/src/bzr/bzr.dev/bzrlib/add.py", line 300, in smart_add_tree
for subf in sorted(os.listdir(abspath)):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf1 in position 4: ordinal not in range(128)

Tags:

Related branches

lp:~jameinel/bzr/non-utf8-77657 (Merged)

Revision history for this message

Wouter van Heyst (larstiq) wrote on 2007-01-03: Re: InvalidEntryName: Invalid entry name: te\*st

On Wed, Jan 03, 2007 at 10:49:39AM +0000, Lars Wirzenius wrote:
> On ti, 2007-01-02 at 17:30 -0600, John Arbash Meinel wrote:
> > This isn't strictly a bug, it probably shouldn't be giving a traceback,
> > but rather a more helpful error. Basically, we disallow '\' as a
> > character in filenames.
>
> Hm. This is quite surprising to me as a Unix user. It does not seem to
> be documented on the manual page.
>
> Given that \ (and * and ? and [ and ] and : and various other
> characters) and non-Unicode strings actually exist in program source
> file names, I'd be rather happier if bzr would allow them to be used.
> It's not a top priority for me, but for completeness the tool should, in
> my opinion, allow them.
>
> > It isn't the only one (we only support Unicode
> > names, so there are a few byte sequences that could probably be
> > considered illegal).
>
> The source tree has some non-Unicode, non-ASCII filenames, and those
> resulted in crashes (stack traces and termination of process). I removed
> them to be able to complete the test. I'll do the same with filenames
> containing \ as well. Are there any other characters I should be wary of
> as well?

The non-unicode non-ascii files in a utf8 locale are problematic during
add, os.listdir() is fine, but the sorted() wrapping that call blows up.

Robert thought it wasn't sane to deal with a filename not decodable in
the user encoding, where I just wanted that exact byte sequence checked
in. At the very least that means the filename will be unintelligible on
windows, macosx, or in any other encodings. Worse, it might not even
check out at all. This is a general problem we also encounter with the
illegal characters for fat, upper/lower case clashes or files called
con.txt on windows.

Wouter van Heyst

Revision history for this message

to be removed (liw) wrote on 2007-01-03:

On ke, 2007-01-03 at 12:20 +0100, Wouter van Heyst wrote:
> Robert thought it wasn't sane to deal with a filename not decodable in
> the user encoding, where I just wanted that exact byte sequence checked
> in. At the very least that means the filename will be unintelligible on
> windows, macosx, or in any other encodings. Worse, it might not even
> check out at all. This is a general problem we also encounter with the
> illegal characters for fat, upper/lower case clashes or files called
> con.txt on windows.

I understand that point, but in contexts where I as a user am not
concerned with the interoperability of my filenames with other systems
than Unix, I wish I would be able to use them with bzr as well.

If the restriction stays in bzr, it should at least be prominently
documented in the manual page. Oh, and the error message could be
better, too. :)

--
"Quack, damn you!" -- Jamie Hyneman

Revision history for this message

John A Meinel (jameinel) wrote on 2007-01-03:

Well, the reason we don't support non-unicode is because internally all paths are handled as Unicode.

There is also a specific need for this, because how non-ascii characters are handled on various platforms is very different.

Specifically, Windows has a OEM codepage and a Unicode api. The OEM codepage means that you might be able to handle a non-ascii character if it exists in your codepage, though its final value will be arbitrary. The Unicode api allows you to create any valid unicode filename. Which means that I can create an arabic filename on a russian windows installation.

Further, Mac OS X handles filenames in a very different fashion, choosing to normalize unicode names, and doing so with a method different from other common normalizations. Specifically, a filename like å.txt will show up as: '\xe5.txt' on most systems, but on Mac it is 'a\u030a.txt'

They are both valid from a Unicode standpoint. The first is "(a with circle)" and the second is "(a) (with circle)"

Anyway, this is just to say that inserting an arbirtary character code on one filesystem will usually not be properly represented on another filesystem. Especially if you start getting into codepages and encodings. (You want me to version \xe5, which in latin-1 is å, but in iso-8859-2 it is ĺ, and in iso-8859-15 it is å, and in 'cp1251' (Russian) it is е.

It makes far more sense to version Unicode filenames, since they have a *chance* at being portable.

Revision history for this message

Wouter van Heyst (larstiq) wrote on 2007-03-17:

For some users renaming the files is an option, we should at least give a better message.

Changed in bzr:
status:	Unconfirmed → Confirmed

Revision history for this message

felix (woelk-f) wrote on 2007-10-25:

Renaming the file(s) is very hard, because the file name in question is not mentioned in the error message. It would definitely help to have a more specific error message stating which file cannot be handled and why.

Revision history for this message

David Henningsson (diwic) wrote on 2008-03-17:

This was a showstopper to me and I had to do something...

Disclaimer: no idea if this works for you. Make sure you have a backup. Don't sue me. Etc.

At line 2235 of workingtree_4.py (bzr version 1.0.0), change the code from this:

                        while path_index < len(current_dir_info[1]):
                                current_path_info = current_dir_info[1][path_index]
                                if want_unversioned:
                                    if current_path_info[2] == 'directory':

to:

                        while path_index < len(current_dir_info[1]):
                                current_path_info = current_dir_info[1][path_index]
                                print "Current path info: ", current_path_info, "\n"
                                if want_unversioned:
                                    if current_path_info[2] == 'directory':

and it will print every file, including the one it crashes on.

Revision history for this message

John A Meinel (jameinel) wrote on 2008-03-17:

The error is better, which is about all we can do for this bug.

Changed in bzr:
assignee:	nobody → jameinel
importance:	Undecided → Medium
milestone:	none → 1.4
status:	Confirmed → Fix Committed

Revision history for this message

codeslinger (codeslinger) wrote on 2008-04-23:

yet another of the file name bugs....

please see the discussion in Bug #135320

also if you will take a look at this table http://www.asciitable.com/
you will see that on windows there are many valid extended characters.

Wouldn't it be much better to not mangle the names at all? just escape them and preserve their literal values.

It's all very fine and well for someone who is a unix only person to decree that he only cares about what will work for him. But what about the rest of us poor blokes who have to deal with file names that we have no control over?

on a gentoo linux system try creating a repository of /usr/share and see what happens....
answer it breaks and there is no way you can even think about changing any of those file names.

So why create such a repository? very simple, it's a great way to do security audits of your system.

Revision history for this message

Robert Collins (lifeless) wrote on 2008-04-24: Re: [Bug 77657] Re: bzr add fails on a non-utf8 filename in an utf8 locale

unnamed Edit (189 bytes, application/pgp-signature; name=signature.asc)

On Wed, 2008-04-23 at 07:37 +0000, codeslinger wrote:
> yet another of the file name bugs....
>
> please see the discussion in Bug #135320
>
> also if you will take a look at this table http://www.asciitable.com/
> you will see that on windows there are many valid extended characters.
>
> Wouldn't it be much better to not mangle the names at all? just escape
> them and preserve their literal values.
>
> It's all very fine and well for someone who is a unix only person to
> decree that he only cares about what will work for him. But what about
> the rest of us poor blokes who have to deal with file names that we have
> no control over?

I think you have some confusion present. The different bugs are not all
dups; they are indeed raising the same exception but in different places
and for different reasons.

Using unicode lets us take a file with a given name from unix to OS X,
and then to windows, even though they all have different encodings for
the same file name. bzr is not mangling file names, its converting from
a byte stream to unicode.

For a file name to be usable on a file system, it needs to be in some
specific encoding. Some file system interfaces ignore encodings. Others,
like mac OS X, force everything to unicode.

-Rob

--
GPG key available at: <http://www.robertcollins.net/keys.txt>.

Revision history for this message

James Westby (james-w) wrote on 2008-07-14:

#10

Here's a slightly different error I am getting that has the same underlying
cause.

  File "/usr/lib/python2.5/site-packages/bzrlib/transform.py", line 293, in trans_id_tree_path
    path = self.canonical_path(path)
  File "/usr/lib/python2.5/site-packages/bzrlib/transform.py", line 275, in canonical_path
    abs = self._tree.abspath(path)
  File "/usr/lib/python2.5/site-packages/bzrlib/workingtree.py", line 375, in abspath
    return pathjoin(self.basedir, filename)
  File "/usr/lib/python2.5/posixpath.py", line 65, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 1: ordinal not in range(128)

Revision history for this message

Daniel Clemente (n142857) wrote on 2008-09-20:

#11

I filed bug 272444 to track support to non-ascii targets in symlinks.

Revision history for this message

Vincent Ladeuil (vila) wrote on 2009-04-29:

#12

Really released in 1.6 but the milestone is not available anymore.

Changed in bzr:
status:	Fix Committed → Fix Released

Revision history for this message

Simon (simonjwiles) wrote on 2010-02-02:

#13

bzr-20100202025927-17893.crash Edit (38.8 KiB, text/plain)

In what sense is this bug 'fixed'? I've just encountered the same problem with bzr 2.0.2, which produced a crash report and suggested I file it as a bug...

From the discussion above, I appreciate the issue. My preferred solution would be to escape the offending characters and preserve their literal values, as suggested by codeslinger, if this is a viable option. Renaming the files is an option for me in this case, but since neither the error message nor the crash report allows me to identify the problem files. I attempted to apply David Henningsson's hack to print the filenames (I think the equivalent block of code in 2.0.2 is in dirstate.py??), but to no avail.

Can anyone suggest a way I can at least find out where the offending files are? I have to work with a large and messy tree!

Thanks!

bzr: ERROR: exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position 68: ordinal not in range(128)

*** Bazaar has encountered an internal error. This probably indicates a
  bug in Bazaar. You can help us fix it by filing a bug report at
   https://bugs.launchpad.net/bzr/+filebug
  attaching the crash file
   /home/simon/.cache/crash/bzr-20100202025927-17893.crash
  and including a description of the problem.

The crash file is plain text and you can inspect or edit it to remove
private information.

Revision history for this message

Simon (simonjwiles) wrote on 2010-02-02:

#14

for anyone with the same problem as me, here is a small quick 'n' dirty script I just wrote to determine potentially problematic files in the tree:

#!/usr/bin/env python
#-*- coding:utf-8 -*-

import os

def loopfiles(root=os.curdir):
    for path, dirs, files in os.walk(os.path.abspath(root)):
        for filename in files:
            yield os.path.join(path, filename)

if __name__ == "__main__":
    for f in loopfiles():
        try:
            u = f.encode('utf8')
        except:
            print f