Bazaar

Bug #472161
Comment #15

Comment 15 for bug 472161

Revision history for this message

Vincent Ladeuil (vila) wrote on 2010-01-08: Re: [Bug 472161] Re: Can't upload non-ascii URLs

#15

>>>>> "Kamil" == Kamil Szot <email address hidden> writes:

<snip/>

>> I realized the approach is different, that's why I asked
>> you to test it :)

Kamil> I'm gonna definitely do that next week.

Good.

<snip/>

    Kamil> The FTP servers and clients I encountered seemed like
    Kamil> they didn't care about the difference between chars
    Kamil> and bytes. They never converted anything between
    Kamil> encodings. Clients read streams of bytes from system
    Kamil> calls, passed them to servers and servers wrote what
    Kamil> they received by passing it directly to system
    Kamil> calls. They seemed not to use unicode at any point or
    Kamil> be aware of system locale.

So using utf8 should work fine isn't it ?

    Kamil> Of course I might be wrong in my impression. I also
    Kamil> might encounter FTP software that did not obey RFC as
    Kamil> it should or was just too lenient in what it accepted.

Again utf8 is a safe bet here.

>> Many file systems will refuse to create files with arbitrary 8-bits characters.

    Kamil> I imagine this might be the case when system file
    Kamil> system uses multi-byte encoding for file names like
    Kamil> utf-8.

Kamil> For one I can verify that ext4 is not such file system.

>> Your changes only impact the client side, the problem is on the server side.

Kamil> Perhaps but it was the client that threw exceptions at
Kamil> me, and it was the client that I had under my control.

Sure, I just wanted to explain that the problem was server side,
the exceptions was a consequence of that.

>> If the clients uses iso-8859-1, mac-roman or utf8 and the server uses iso-8859-2, then using your changes can break but using utf8 should work.
>> Now, it depends on whether the server is handling utf8.

    Kamil> If you could apply my ideas only to servers that
    Kamil> report that they do not support utf-8 it would be
    Kamil> great. I don't want to break anything. I just want to
    Kamil> have it working in my setup that most likely does not
    Kamil> support RFC2640.

The proposed changes doesn't implement all of RFC2640, it just
follows the recommendations here and should be compatible and
allow round-tripping paths in most configurations.

If that's not the case, we'll have to design a more complicated
dialog with the server to check whether or not it's likely to
support utf8 paths and if not fallback to something else but
still check that we get correct results...

A far more complicated approach.

    >> Now, it depends on whether the server is handling utf8.
    >> If it doesn't, then if (but only if) the server and all
    >> the clients use the same fs encoding, your changes will
    >> work.

Kamil> As I mentioned file systems I encounter also tend to
Kamil> be encoding agnostic.

On linux yes. On other unixes it could be. On at least OSX I know
for sure it's not agnostic and I suspect it's also true on
Windows.

Kamil> Can you give an example of such file system that is
Kamil> widely used?

hfs+ on OSX.

    >> The ftp transport receives paths that have been processed
    >> by higher layers, it should not worry about what the file
    >> system encoding is, the higher layers did.

    Kamil> This statement actually supports my claim that my
    Kamil> fixes are applied in right places because the cause of
    Kamil> the exceptions I encountered was mixing strings and
    Kamil> unicode in these functions without caring about proper
    Kamil> encoding to be used.

It doesn't support your claim, it's too late to try guessing if
we get the path from the user and we may try to decode it with fs
encoding or if it's already in unicode. These functions should
not refer to fs encoding because they can be used in different
contexts. bzr-upload for example is more likely to acquire the
paths from bzr as they are serialized internally (utf8) and then
upcasted to unicode. The transport layer has no idea about where
things are coming from (except if they come from the server[1])
but have to deal with whether the paths should be URL-encoded or
not depending to who it is talking to.

    Kamil> fancyrename() gets string from os.dirname() and mixes
    Kamil> it with unicode without without converting it with
    Kamil> right encoding first and _remote_path() crams unicode
    Kamil> into 7-bit string without regad that unicode string
    Kamil> might contain national characters.

ftp._remote_path() was doing it on purpose to raise UnicodeError
to comply with the test suite. It's deleted now.

Kamil> Maybe proper way to detect encoding in fancy rename

fancy_rename() should not try to guess the encoding, and it
doesn't need to. John pointed out an even simpler fix for it and
I have since verified that he is correct. By avoiding the use of
unicode prefix, the full test suite pass.

That means that all uses of fancy_rename doesn't require the
safe_unicode() call I added.

Which in turn means, there is no decoding to do there.

<snip/>

Kamil> I wonder how the core of bzr deals with this.

    >> By using unicode internally and decoding the paths
    >> respecting the file system encoding when needed (outside
    >> the scope or both your changes and mine).

    Kamil> So paths in fancyrename() after reading them should be
    Kamil> passed through same functions that rest of bzr uses
    Kamil> for decoding the paths with respect of the file system
    Kamil> encoding.

No, fancy_rename() should be called with decoded paths :)

    >> But merge proposals are easier to deal with even at that
    >> point as they present a diff of your changes without the
    >> need to drill down into each revision or grab a local copy
    >> of your branch. They also allows us to discuss there
    >> instead of here :)

    Kamil> Ok. So I'm gonna make a merge proposal. I was under
    Kamil> wrong impression that this is something reserved for
    Kamil> mature code.

Merge proposals are the way we use to help the code mature :)
Once everybody agree, after discussion, that the code is indeed
mature.

[1] And I failed to mention it in my fix, but we were already
considering that paths received from the ftp server were utf8
encoded. This wasn't fully tested because bzr requires only
7-bits paths these days.

>>>>> "Kamil" == Kamil Szot <kamil.szot@gmail.com> writes:

<snip/>

>> I realized the approach is different, that's why I asked
    >> you to test it :)

Kamil> I'm gonna definitely do that next week.

Good.

<snip/>

So using utf8 should work fine isn't it ?

Kamil> Of course I might be wrong in my impression. I also
    Kamil> might encounter FTP software that did not obey RFC as
    Kamil> it should or was just too lenient in what it accepted.

Again utf8 is a safe bet here.

>> Many file systems will refuse to create files with arbitrary 8-bits characters.

Kamil> I imagine this might be the case when system file
    Kamil> system uses multi-byte encoding for file names like
    Kamil> utf-8.

Kamil> For one I can verify that ext4 is not such file system.

>> Your changes only impact the client side, the problem is on the server side.

Kamil> Perhaps but it was the client that threw exceptions at
    Kamil> me, and it was the client that I had under my control.

Sure, I just wanted to explain that the problem was server side,
the exceptions was a consequence of that.

>> If the clients uses iso-8859-1, mac-roman or utf8 and the server uses iso-8859-2, then using your changes can break but using utf8 should work.
    >> Now, it depends on whether the server is handling utf8.

The proposed changes doesn't implement all of RFC2640, it just
follows the recommendations here and should be compatible and
allow round-tripping paths in most configurations.

A far more complicated approach.

>> Now, it depends on whether the server is handling utf8.
    >> If it doesn't, then if (but only if) the server and all
    >> the clients use the same fs encoding, your changes will
    >> work.

Kamil> As I mentioned file systems I encounter also tend to
    Kamil> be encoding agnostic.

On linux yes. On other unixes it could be. On at least OSX I know
for sure it's not agnostic and I suspect it's also true on
Windows.

Kamil> Can you give an example of such file system that is
    Kamil> widely used?

hfs+ on OSX.

>> The ftp transport receives paths that have been processed
    >> by higher layers, it should not worry about what the file
    >> system encoding is, the higher layers did.

ftp._remote_path() was doing it on purpose to raise UnicodeError
to comply with the test suite.  It's deleted now.

Kamil> Maybe proper way to detect encoding in fancy rename

That means that all uses of fancy_rename doesn't require the
safe_unicode() call I added.

Which in turn means, there is no decoding to do there.

<snip/>

Kamil> I wonder how the core of bzr deals with this.

>> By using unicode internally and decoding the paths
    >> respecting the file system encoding when needed (outside
    >> the scope or both your changes and mine).

No, fancy_rename() should be called with decoded paths :)

>> But merge proposals are easier to deal with even at that
    >> point as they present a diff of your changes without the
    >> need to drill down into each revision or grab a local copy
    >> of your branch.  They also allows us to discuss there
    >> instead of here :)

Kamil> Ok. So I'm gonna make a merge proposal. I was under
    Kamil> wrong impression that this is something reserved for
    Kamil> mature code.

Merge proposals are the way we use to help the code mature :)
Once everybody agree, after discussion, that the code is indeed
mature.