Can't upload non-ascii URLs

Bug #472161 reported by Sylvain on 2009-11-03
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Bazaar
Medium
Vincent Ladeuil
bzr Upload plugin
High
Vincent Ladeuil

Bug Description

bzr-upload report an error whenever I try to upload a file with non-ascii characters. No crash report, simply the following error message:

<code>
Uploading xxxxx.xxxxxxxxx.xxxx/files/Contrat de travail US.doc
Uploading xxxxx.xxxxxxxxx.xxxx/files/Couverture sociale US.doc
Uploading xxxxx.xxxxxxxxx.xxxx/files/Harcèlement US.doc
bzr: ERROR: Invalid url supplied to transport: "xxxxx.xxxxxxxxx.xxxx/files/Harcèlement US.doc": URL was not a plain ASCII url: 'ascii' codec can't encode character u'\xe8' in position 31: ordinal not in range(128)
</code>

bzr version
Bazaar (bzr) 2.0.0
  Python interpreter: /usr/bin/python 2.6.4rc2
  Python standard library: /usr/lib/python2.6
  Platform: Linux-2.6.31-14-generic-i686-with-Ubuntu-9.10-karmic
  bzrlib: /usr/lib/python2.6/dist-packages/bzrlib
  Bazaar configuration: /home/sysadmin/.bazaar
  Bazaar log file: /home/sysadmin/.bzr.log

Copyright 2005, 2006, 2007, 2008, 2009 Canonical Ltd.
http://bazaar-vcs.org/

bzr comes with ABSOLUTELY NO WARRANTY. bzr is free software, and
you may use, modify and redistribute it under the terms of the GNU
General Public License version 2 or later.

bzr plugins -v
bzrtools 2.0.1
    Various useful commands for working with bzr.
    /usr/lib/python2.6/dist-packages/bzrlib/plugins/bzrtools

dbus 0.1.0dev
    D-Bus integration for bzr/bzrlib.
    /usr/lib/python2.6/dist-packages/bzrlib/plugins/dbus

gtk 0.97.0.final
    Graphical support for Bazaar using GTK.
    /usr/lib/python2.6/dist-packages/bzrlib/plugins/gtk

launchpad 2.0.0
    Launchpad.net integration plugin for Bazaar.
    /usr/lib/python2.6/dist-packages/bzrlib/plugins/launchpad

netrc_credential_store 2.0.0
    Use ~/.netrc as a credential store for authentication.conf.
    /usr/lib/python2.6/dist-packages/bzrlib/plugins/netrc_credential_store

upload 1.0.0dev
    Upload a working tree, incrementally.
    /home/xxx/.bazaar/plugins/upload

Related branches

Vincent Ladeuil (vila) wrote :

There should be a traceback in your .bzr.log ('bzr version' will tell you where to find it.
That would help a lot to debug the issue.

summary: - bzr-upload crashes on karmic koala (bazaar-2.0.0) with non-ascii
- characters
+ Can't upload non-ascii URLs
Changed in bzr-upload:
status: New → Confirmed
Sylvain (spouilly) wrote :

Sorry about that ... I did not know where to look for the log ... here you go ...

Vincent Ladeuil (vila) wrote :

Lon story short: the bzr ftp transport don't support unicode paths.
The root cause is that some ftp servers don't support them either and even the ftp
spec if far from clear about UTF8 support.
I'll mark this bug as affecting bzr too as that's where the bug should/could be fixed but bzr itself
uses only ascii file names for its own purposes.

It should be possible to fix the issue in bzr though but until then, the best you can do is:
1) avoid using non-ascii filenames (sorry for that :-/)
2) ensure your ftp server supports unicode filenames and report anything relevant in this bug

By relevant I mean at least: OS/version where the ftp server runs, what ftp server it is,
a log of successful dialog between the client and the server for such a file, any format will do as long as
it makes it clear what format is used to exchange the filename between the client and the server (presumably utf8).

Changed in bzr-upload:
importance: Undecided → High
Changed in bzr:
status: New → Confirmed
importance: Undecided → Medium
Sylvain (spouilly) wrote :

Hi there,

Well, it is unlikely a bug in the server not being able to handle utf-8, because I am able to transfer the same files using FileZilla (default configuration), or the ftp (command line) provided by Karmic (I am providing both log files, but I am not not quite sure if the information that you are looking for is visible in it).

The FTP server is running linux (kernel 2.6.18-028stab064.7), with Pure-FTPd (so we can conclude it is running RedHat RHEL 5.2).

The client (my machine) is running Ubuntu Karmic Koala (no special development environment) with locales set to en_US.UTF-8.

Kamil Szot (kamil-szot) wrote :

Please try using bzr-upload plugin in bzr with my modifications: https://code.launchpad.net/~kamil-szot/bzr/non-ascii-chars-in-ftp-filenames

I successfully uploaded with 'bzr upload' directories and files with non-ascii characters in names thanks to those modifications.

Vincent Ladeuil (vila) wrote :

@Kamil, thanks for working on that !

Unfortunately your changes break the bzr test suite so I urge you to be extremely careful if you're using that modified bzr.

Internally bzr use unicode paths only, modidying urlutils.unescape() as you did is not an option
(the docstring says: This returns a Unicode path from a URL)

You may want to look at bzrlib/transport/ftp/__init__.py where urlutils.unescape() is called in _remote_path() with a comment fully related to this bug.

I'm sure you're on the right track though but make sure you don't break the test suite :)

Vincent Ladeuil (vila) wrote :

I forgot to mention that the bzr team expects merge proposals to track patches.
Go to your branch page on lanuchpad, click the 'Propose for merging' button and follow the instructions.
You are more likely to get timely feedback that way !

Kamil Szot (kamil-szot) wrote :

Thank you for reviewing my changes. This caused me to inspect problem bit deeper.

I reverted previous changes and fixed the problem differently. I pushed this to my branch. Please take a look if you have a moment.

I am new to bzr development. If you could kindly direct me to some information about how I can check if my changes brake the test suite, I'll be grateful.

Vincent Ladeuil (vila) wrote :

@Sylvain & Kamit: I push an alternate fix which passes the test suite at lp:~vila/bzr/472161-ftp-utf8

Can you try it in your config and report any problems ?

@Kamit: I filed https://code.edge.launchpad.net/~vila/bzr/472161-ftp-utf8/+merge/16967 with that branch.
That's the preferred way to submit changes to bzr.

Regarding you 'branches diverged' problem, presumably you did 'bzr uncommit' so if you wanted to push to
the same branch you should have use 'bzr push --overwrite'

To run the test suite you issue:
  bzr selftest

You'll need to have an ftp test server available for that. bzr supports medusa up to python2.5 and
pyftpdlib (http://code.google.com/p/pyftpdlib/) for python2.5, 2.5 and 2.6.

Since medusa doesn't support Unicode paths, I recommend pyftpdlib of course (You have to install the module
in a suitable directory, use PYTHONPATH if needed).

To run less than the full test suite, you explore from:
   bzr selftest FTPTestServer

which will run only the tests whose name contains FTPTestServer or:

  bzr selftest -s bt.per_transport.TransportTests.test_unicode_paths

to run only the test_unicode_paths test.

Changed in bzr:
assignee: nobody → Vincent Ladeuil (vila)
status: Confirmed → Fix Committed
Kamil Szot (kamil-szot) wrote :

By examining the changes you made http://bazaar.launchpad.net/~vila/bzr/472161-ftp-utf8/revision/4936 I doubt this goes anywhere near fixing my trouble.

FTP servers are usually encoding agnostic. They treat file names as built of 8-bit characters without any specific encoding. They don't support any encoding explicitly,

Please take a look at my changes. They should fix the problem even if your system uses some other encoding for filenames (for example iso-8859-2).

The root of my problem seems to come from the fact that python os.path functions appear not to be aware of the fact that OS might use different encodings for file names. They just return string and if it is implicitly converted to unicode at some point then the 'ascii' encoding is used. I wonder how the core of bzr deals with this. Some people get around such bugs by setting default encoding for unicode<->string conversions by calling os.setdefaultencoding() like complained about here http://tarekziade.wordpress.com/2008/01/08/syssetdefaultencoding-is-evil/

I have not made merge proposal yet because I wanted to hear your opinion first and test if it breaks testsuite for myself.

I think I sorted out the problems with branch divergence by merging and doing empty commit. I'll remember the --overwrite trick for later use in similar cases.

bzr selftest reports that I have too old version of python-testtools i my ubuntu, I'll try to fix that but it might take me a moment.

I'm using bazaar with python 2.6.4, I'll try to install medusa or pyftpdlib, run the tests the way you described and report my findings.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Kamil Szot wrote:
> By examining the changes you made
> http://bazaar.launchpad.net/~vila/bzr/472161-ftp-utf8/revision/4936 I
> doubt this goes anywhere near fixing my trouble.
>
> FTP servers are usually encoding agnostic. They treat file names as
> built of 8-bit characters without any specific encoding. They don't
> support any encoding explicitly,

If the actual server is agnostic, then it means that Bazaar can chose
what encoding it wants to use.

>
> Please take a look at my changes. They should fix the problem even if
> your system uses some other encoding for filenames (for example
> iso-8859-2).

Except it uses the local encoding, and fancy_rename is being run on a
remote location. Where it matters what the *remote* encoding is.

However, I think a key change would actually be:

=== modified file 'bzrlib/osutils.py'
- --- bzrlib/osutils.py 2009-12-23 00:15:34 +0000
+++ bzrlib/osutils.py 2010-01-07 17:00:45 +0000
@@ -208,7 +208,7 @@
     # sftp rename doesn't allow overwriting, so play tricks:
     base = os.path.basename(new)
     dirname = os.path.dirname(new)
- - tmp_name = u'tmp.%s.%.9f.%d.%s' % (base, time.time(), os.getpid(),
rand_chars(10))
+ tmp_name = 'tmp.%s.%.9f.%d.%s' % (base, time.time(), os.getpid(),
rand_chars(10))
     tmp_name = pathjoin(dirname, tmp_name)

     # Rename the file out of the way, but keep track if it didn't exist

I *believe* that fancy_rename is being called on URL fragments, which
should *not* be Unicode strings. (In bzr, paths are Unicode, urls are
url-escaped-utf8-encoded 7-bit ascii strings.)

>
> The root of my problem seems to come from the fact that python os.path
> functions appear not to be aware of the fact that OS might use different
> encodings for file names. They just return string and if it is
> implicitly converted to unicode at some point then the 'ascii' encoding
> is used. I wonder how the core of bzr deals with this. Some people get
> around such bugs by setting default encoding for unicode<->string
> conversions by calling os.setdefaultencoding() like complained about
> here http://tarekziade.wordpress.com/2008/01/08/syssetdefaultencoding-
> is-evil/
>
> I have not made merge proposal yet because I wanted to hear your opinion
> first and test if it breaks testsuite for myself.
>
> I think I sorted out the problems with branch divergence by merging and
> doing empty commit. I'll remember the --overwrite trick for later use in
> similar cases.
>
> bzr selftest reports that I have too old version of python-testtools i
> my ubuntu, I'll try to fix that but it might take me a moment.

You can get the latest version in the bzr ppa, or see this blog post:
http://code.mumak.net/

>
> I'm using bazaar with python 2.6.4, I'll try to install medusa or
> pyftpdlib, run the tests the way you described and report my findings.
>

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktGE54ACgkQJdeBCYSNAANv+wCgmrcUqpC3yc3NkUTDN2NRiteh
Vc0AoKJuWOEtGIVwaNXq+2RFya94KlGG
=mTwE
-----END PGP SIGNATURE-----

Vincent Ladeuil (vila) wrote :
Download full text (3.9 KiB)

>>>>> "Kamil" == Kamil Szot writes:

    Kamil> By examining the changes you made
    Kamil> http://bazaar.launchpad.net/~vila/bzr/472161-ftp-utf8/revision/4936 I
    Kamil> doubt this goes anywhere near fixing my trouble.

I realized the approach is different, that's why I asked you to
test it :)

Sorry for not having commented on your changes more clearly, see
my comments below.

You started on the assumption that the client and the server will
use the same encoding which is less likely to work than using
UTF8 (see RFC2640).

    Kamil> FTP servers are usually encoding agnostic.

Not really, they have to obey whatever file system is used
underneath.

    Kamil> They treat file names as built of 8-bit characters
    Kamil> without any specific encoding. They don't support any
    Kamil> encoding explicitly,

Many file systems will refuse to create files with arbitrary
8-bits characters. Using UTF8 as an intermediate representation
guarantees that clients and servers talk about the same paths
(again, have a look at RFC2640).

And some ftp servers will just not support arbitrary 8-bits paths
(medusa for one).

    Kamil> Please take a look at my changes. They should fix the
    Kamil> problem even if your system uses some other encoding
    Kamil> for filenames (for example iso-8859-2).

Your changes only impact the client side, the problem is on the
server side.

RFC2640 says that the clients should use UTF8 and that servers
should handle UTF8 and *then* do what is needed on their side.

If the clients uses iso-8859-1, mac-roman or utf8 and the server
uses iso-8859-2, then using your changes can break but using utf8
should work.

Now, it depends on whether the server is handling utf8.

If it doesn't, then if (but only if) the server and all the
clients use the same fs encoding, your changes will work.

If it does, your changes will break but using utf8 will work.

So the proposed changes may not be the final answer but it's more
likely to work under more configurations.

And that's why I wanted to know if it worked for *you* !

    Kamil> The root of my problem seems to come from the fact
    Kamil> that python os.path functions appear not to be aware
    Kamil> of the fact that OS might use different encodings for
    Kamil> file names. They just return string and if it is
    Kamil> implicitly converted to unicode at some point then the
    Kamil> 'ascii' encoding is used.

True and you re-discover what is done in bzrlib.osutils (_fsenc,
get_terminal_encoding and get_user_encoding, but mainly the first
one).

The ftp transport receives paths that have been processed by
higher layers, it should not worry about what the file system
encoding is, the higher layers did.

    Kamil> I wonder how the core of bzr deals with this.

By using unicode internally and decoding the paths respecting the
file system encoding when needed (outside the scope or both your
changes and mine).

    Kamil> Some people get around such bugs by setting default
    Kamil> encoding for unicode<->string conversions by calling
    Kamil> os.setdefaultencoding() like complained about here
    Kamil> http://tarekziade.wordpress.com/2008/01/08/syssetdefaultencoding-
    Ka...

Read more...

Kamil Szot (kamil-szot) wrote :
Download full text (5.8 KiB)

Thank you for verbose responses.

@John
> > FTP servers are usually encoding agnostic. They treat file names as
> > built of 8-bit characters without any specific encoding. They don't
> > support any encoding explicitly,
>
> If the actual server is agnostic, then it means that Bazaar can chose
> what encoding it wants to use.

Not exactly. If I have in local file system two files: one containing non-ascii chars, and other that contains reference to the first one inside then after uploading to FTP server I expect for this reference to still be right.

If ftp client reads filename, interprets it as iso8859-2, sends it to FTP server encoded as utf-8, but the server does not support utf-8 communication and treats incoming communication as 8-bit characters without specific encoding, then writes them to disk, then the reference between two files will be lost.

In such case sending file names with special chars encoded with local system encoding is better because it preserves exact byte representation that the filenames have in this system.

os.path calls do not return unicode, they return string of bytes. In other languages with varying unicode support this is also the case. System calls return filenames as 8-bit strings they are actually encoded by.

> I *believe* that fancy_rename is being called on URL fragments, which
should *not* be Unicode strings. (In bzr, paths are Unicode, urls are
url-escaped-utf8-encoded 7-bit ascii strings.)

So at some point 8-bit characters in file names should be url-escaped at some point (after reading them from disk?) before passing them to actual functions fancy_rename. This should also fix the problem in clean way.

During my investigation I encountered two errors. One in _remote_path(), other in pathjoin() call inside fancy_rename()

@Vincent

> I realized the approach is different, that's why I asked you to test it :)

I'm gonna definitely do that next week.

> Kamil> FTP servers are usually encoding agnostic.
> Not really, they have to obey whatever file system is used underneath.

The FTP servers and clients I encountered seemed like they didn't care about the difference between chars and bytes. They never converted anything between encodings. Clients read streams of bytes from system calls, passed them to servers and servers wrote what they received by passing it directly to system calls. They seemed not to use unicode at any point or be aware of system locale.

Of course I might be wrong in my impression. I also might encounter FTP software that did not obey RFC as it should or was just too lenient in what it accepted.

> Many file systems will refuse to create files with arbitrary 8-bits characters.
I imagine this might be the case when system file system uses multi-byte encoding for file names like utf-8.

For one I can verify that ext4 is not such file system. Despite the fact that my system LOCALE is utf-8 I can create file with invalid UTF-8 characters in its file name by passing arbitrary string of 8-bit characters to function touch() in PHP language.

> Your changes only impact the client side, the problem is on the server side.
Perhaps but it was the client that threw exceptions at me, an...

Read more...

Vincent Ladeuil (vila) wrote :
Download full text (6.2 KiB)

>>>>> "Kamil" == Kamil Szot <email address hidden> writes:

<snip/>

    >> I realized the approach is different, that's why I asked
    >> you to test it :)

    Kamil> I'm gonna definitely do that next week.

Good.

<snip/>

    Kamil> The FTP servers and clients I encountered seemed like
    Kamil> they didn't care about the difference between chars
    Kamil> and bytes. They never converted anything between
    Kamil> encodings. Clients read streams of bytes from system
    Kamil> calls, passed them to servers and servers wrote what
    Kamil> they received by passing it directly to system
    Kamil> calls. They seemed not to use unicode at any point or
    Kamil> be aware of system locale.

So using utf8 should work fine isn't it ?

    Kamil> Of course I might be wrong in my impression. I also
    Kamil> might encounter FTP software that did not obey RFC as
    Kamil> it should or was just too lenient in what it accepted.

Again utf8 is a safe bet here.

    >> Many file systems will refuse to create files with arbitrary 8-bits characters.

    Kamil> I imagine this might be the case when system file
    Kamil> system uses multi-byte encoding for file names like
    Kamil> utf-8.

    Kamil> For one I can verify that ext4 is not such file system.

    >> Your changes only impact the client side, the problem is on the server side.

    Kamil> Perhaps but it was the client that threw exceptions at
    Kamil> me, and it was the client that I had under my control.

Sure, I just wanted to explain that the problem was server side,
the exceptions was a consequence of that.

    >> If the clients uses iso-8859-1, mac-roman or utf8 and the server uses iso-8859-2, then using your changes can break but using utf8 should work.
    >> Now, it depends on whether the server is handling utf8.

    Kamil> If you could apply my ideas only to servers that
    Kamil> report that they do not support utf-8 it would be
    Kamil> great. I don't want to break anything. I just want to
    Kamil> have it working in my setup that most likely does not
    Kamil> support RFC2640.

The proposed changes doesn't implement all of RFC2640, it just
follows the recommendations here and should be compatible and
allow round-tripping paths in most configurations.

If that's not the case, we'll have to design a more complicated
dialog with the server to check whether or not it's likely to
support utf8 paths and if not fallback to something else but
still check that we get correct results...

A far more complicated approach.

    >> Now, it depends on whether the server is handling utf8.
    >> If it doesn't, then if (but only if) the server and all
    >> the clients use the same fs encoding, your changes will
    >> work.

    Kamil> As I mentioned file systems I encounter also tend to
    Kamil> be encoding agnostic.

On linux yes. On other unixes it could be. On at least OSX I know
for sure it's not agnostic and I suspect it's also true on
Windows.

    Kamil> Can you give an example of such file system that is
    Kamil> widely used?

hfs+ on OSX.

    >> The ftp transport receives paths that have been processed
    >> by higher layers, it should not worry about what the...

Read more...

John A Meinel (jameinel) on 2010-01-21
Changed in bzr:
milestone: none → 2.1.0rc1
status: Fix Committed → Fix Released
Vincent Ladeuil (vila) on 2010-12-10
Changed in bzr-upload:
assignee: nobody → Vincent Ladeuil (vila)
status: Confirmed → Fix Released
milestone: none → 1.0.0
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments