environment variables are not decoded properly

Bug #832028 reported by Vincent Ladeuil on 2011-08-23
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Bazaar
Medium
Martin Packman

Bug Description

As mentioned in https://code.launchpad.net/~vila/bzr/822571-bzr-home-unicode/+merge/70870, there is some vagueness in the way we interpret environment variables.

Addressing this may require to explicitly declare which environment variables can be paths so the file system encoding can be respected for them and the user encoding tried for the others.

windows probably needs a different policy of always using mbcs though (comments welcome this is roughly what I remember from old discussions).

Related branches

Martin Packman (gz) wrote :

For common cases like getting an integer or other simple values out of the environment, it should be possible to always do the right thing. Path handling in bzr is generally confused by corner cases, here the main issues are:
* Python 2 doesn't give access to the unicode environment block on windows only the 'ANSI' compatiblity apis.
* Both the environment and paths can be arbitrary bytes on nix so decoding to unicode is never fully correct.

Vincent Ladeuil (vila) wrote :

> * Python 2 doesn't give access to the unicode environment block on windows only the 'ANSI' compatiblity apis.

What does that mean in practice ? Is there at least a way for the user to specify unicode paths in *some* encoding (mbcs ?)

> * Both the environment and paths can be arbitrary bytes on nix so decoding to unicode is never fully correct.

As long as we can trap invalid values, we can at least define which subset we can support (and report proper errors for the rest) no ?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/24/2011 9:04 AM, Vincent Ladeuil wrote:
>> * Python 2 doesn't give access to the unicode environment block
>> on
> windows only the 'ANSI' compatiblity apis.
>
> What does that mean in practice ? Is there at least a way for the
> user to specify unicode paths in *some* encoding (mbcs ?)

There are GetEnvironW sort of apis, where you set and retrieve env
variables in UCS-2/UTF-16. There are similar APIs for CreateProcessW, etc.

However, *many* Windows programs aren't wide-char safe, so there are
something like 3-5 different encodings things can use. (OEM, ANSI,
MBCS, ...)

>
>> * Both the environment and paths can be arbitrary bytes on nix
>> so
> decoding to unicode is never fully correct.
>
> As long as we can trap invalid values, we can at least define
> which subset we can support (and report proper errors for the rest)
> no ?
>

The issue (at least partly) is that if internally we say "X is a
Unicode String", then we have trouble when on Nix it is an
8-bit-in-some-arbitrary-encoding-that-is-often-utf-8. We can't decode
it into a Unicode string, and it isn't safe to leave it as "str"
because when we do "\xb5" + u"Unicode" it blows up.

So yes, we could trap things we can't decode. I think we just need a
better story than we currently have around that.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk5VHzgACgkQJdeBCYSNAANRpQCgxlvFmE63su0HqmJ+Ld+BzjHr
R80AmwQ5rdf6wnPU80aqX2Wpf3VT/t6b
=pxZ4
-----END PGP SIGNATURE-----

Martin Packman (gz) on 2011-12-12
Changed in bzr:
assignee: nobody → Martin Packman (gz)
status: Confirmed → In Progress
Martin Packman (gz) on 2011-12-15
Changed in bzr:
milestone: none → 2.5b5
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers