locking fails with non-ascii characters in host/username/something (russian Vista)

Bug #256550 reported by molchuvka
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Bazaar
Fix Released
High
Mark Hammond

Bug Description

C:\p\relex>bzr branch lp:relex
You have not informed bzr of your launchpad login. If you are attempting a
write operation and it fails, run "bzr launchpad-login YOUR_ID" and try again.
bzr: ERROR: Target directory "relex" already exists.

C:\p\relex>bzr branch lp:relex
You have not informed bzr of your launchpad login. If you are attempting a
write operation and it fails, run "bzr launchpad-login YOUR_ID" and try again.
bzr: ERROR: exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

Traceback (most recent call last):
  File "bzrlib\commands.pyc", line 857, in run_bzr_catch_errors
  File "bzrlib\commands.pyc", line 797, in run_bzr
  File "bzrlib\commands.pyc", line 499, in run_argv_aliases
  File "bzrlib\builtins.pyc", line 852, in run
  File "bzrlib\bzrdir.pyc", line 1055, in sprout
  File "bzrlib\bzrdir.pyc", line 1649, in initialize_on_transport
  File "bzrlib\lockable_files.pyc", line 254, in lock_write
  File "bzrlib\lockdir.pyc", line 564, in lock_write
  File "bzrlib\lockdir.pyc", line 488, in wait_lock
  File "bzrlib\lockdir.pyc", line 454, in attempt_lock
  File "bzrlib\lockdir.pyc", line 220, in _attempt_lock
  File "bzrlib\lockdir.pyc", line 278, in _create_pending_dir
  File "bzrlib\lockdir.pyc", line 436, in _prepare_info
  File "bzrlib\rio.pyc", line 115, in __init__
  File "bzrlib\rio.pyc", line 122, in add
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

bzr 1.6b3 on python 2.5.2 (win32)
arguments: ['C:\\Program Files\\Bazaar\\bzr.EXE', 'branch', 'lp:relex']
encoding: 'cp1251', fsenc: 'mbcs', lang: None
plugins:
  launchpad C:\Program Files\Bazaar\plugins\launchpad [unknown]
*** Bazaar has encountered an internal error.
    Please report a bug at https://bugs.launchpad.net/bzr/+filebug
    including this traceback, and a description of what you
    were doing when the error occurred.

Revision history for this message
John A Meinel (jameinel) wrote :

Can you give us the output of 'bzr whoami' ?

Thanks for the traceback, it is pretty clear *where *iit is failing, I just need to figure out which element is including non-ascii characters. We certainly should support it.

I wonder if we are getting something as an 8-bit non-ascii string, when we thought it was unicode.

Also, what is the hostname of the machine, and your username?

Revision history for this message
John A Meinel (jameinel) wrote :

The code in question is doing this:
try:
    user = config.user_email()
except errors.NoEmailInUsername:
    user = config.username()
s = Stanza(hostname=socket.gethostname(),
           pid=str(os.getpid()),
           start_time=str(int(time.time())),
           nonce=self.nonce,
           user=user,
           )
return s.to_string()

My guess is that "socket.gethostname()" is returning non-ascii characters, but I'm guessing.

You seem to be running the standalone install, if you had python I would ask you to run:

python -c "import socket; print '%r' % socket.gethostname()"

Changed in bzr:
importance: Undecided → High
status: New → Triaged
Revision history for this message
John A Meinel (jameinel) wrote :

It would also be good to know what the Russian hostname is for your machine, to try to figure out what encoding the 'gethostname()' is in. (Traditionally, I believe hostnames were supposed to ASCII only).

On Vista here, locale.getpreferredencoding() returns CP1252. Which is almost identical to iso-8859-1. With the exception of stuffl like 0x80 => € on cp1252, and is a control code for iso-8859-1.

When I tried to set my hostname to non-ascii, it warned me that other machines would not be able to find it (presumably because the DNS spec doesn't directly allow for it). It let me use µ and å, but if I tried to do € or any Arabic letters, it simply changed it to "_".

It did let me do "ƒ" which seems to be CP1252 only (it isn't in iso-8859-1). So I would *guess* that sys.gethostname() is in the same encoding as locale.getpreferredencoding().

I thought Alexander Belchenko (bialix) had a patch to determine OEMEncoding, which may be different yet again.

Revision history for this message
John A Meinel (jameinel) wrote :

Just to confirm, with non-ascii characters in my host name, it does fail in the same location, and the string appears to be in cp1252 encoding.

I'm sure that isn't your encoding on Russian Vista, though, so we would need to figure out what platform function we need to call to find the appropriate encoding of the socket.gethostname() function.

In the worst case, we could always fall back to something that guarantees a mapping for every character (such as iso-8859-1).

For this lock/info file, it isn't 100% critical to be correct, it is just a nicety.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 256550] Re: locking fails with non-ascii characters in host/username/something (russian Vista)

On Mon, 2008-08-11 at 01:49 +0000, John A Meinel wrote:
>
> I'm sure that isn't your encoding on Russian Vista, though, so we
> would
> need to figure out what platform function we need to call to find the
> appropriate encoding of the socket.gethostname() function.

I wonder if we're getting an IDN string back or something similar?

-Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.

Revision history for this message
John A Meinel (jameinel) wrote :

Provided Robert is talking about "punycode": https://designarchitecture.com:19638/docs/en_US/user/oh_user_overview_of_internationalized_domain_names_idn.htm#Translation_of_IDNs

Then no, we are not. I set my host name to:
samus憟

And python socket.gethostname() returned:
'samus\xb5\xe5\x83'

Which can be ".decode('cp1252')" to give the right string. If I try .decode('iso-8859-1') it doesn't convert the ƒ character correctly (it exists in the cp1252 codepage, but *not* in iso-8859-1.

Now, Alexander has actually done the work in win32utils.py to write the function "get_host_name()" which is able to use a Unicode-aware api in windows (!win98) GetComputerNameW. Which returns a proper unicode string:
>>> bzrlib.win32utils.get_host_name()
u'SAMUS\xb5\xc5\u0192'

Note, however, that 'socket.gethostname()' returns the lowercase form at GetComputerNameW() returns the upper-case form. I'm guessing GetComputerNameW is returning the NETBIOS name, which is always in upper case.

Now, according to MSDN, there is another function:
http://msdn.microsoft.com/en-us/library/ms724301(VS.85).aspx

GetComputerNameEx (I assume this will become GetComputerNameExW when appropriate).

If I manually hack together something with cytpes, and I do:

>>> ctypes.windll.kernel32.GetComputerNameExW(3, buf, n)
>>> buf.value
u'samus\xb5\xe5\u0192'

Now, I'm guessing on the COMPUTER_NAME_FORMAT parameter, because it is an enum, and they don't actually mention the mapping. But I'm thinking we might actually want '3' =ComputerNameDnsFullyQualified
http://msdn.microsoft.com/en-us/library/ms724224(VS.85).aspx

So it might just make sense to write an 'osutils.gethostname()' which returns a unicode string, and calls over into GetComputeNameExW on windows when available.

Revision history for this message
John A Meinel (jameinel) wrote :

And I should mention that on other platforms it could use punycode. Simply doing:

hostname = socket.gethostname()
if hostname.startswith('xn--'):
  return hostname[4:].decode('punycode')

Just for reference, u'samus\xb5\xe5\u0192'.encode('punycode') == 'samus-ija05a89b'

I don't know if there is a "standard" way on other platforms to handle non-ascii domain names, *other* than punycode.

Revision history for this message
molchuvka (efilippov) wrote :

C:\Users\molchuvka\.VirtualBox>bzr whoami
molchuvka <molchuvka@ДНС-ПК>

computer name: ДНС-ПК
(cyrillic, cp1251)
user name: molchuvka

Revision history for this message
molchuvka (efilippov) wrote :

I changed the machine name to ascii-only, the traceback disappeared.

Revision history for this message
Lukáš Lalinský (luks) wrote :

Related bug (if not a duplicate): https://bugs.launchpad.net/bzr/+bug/193089

Revision history for this message
Mark Hammond (mhammond) wrote :

There are a couple of paths here that can cause errors (I renamed a local VM to have an extended character):

win32utils.get_user_name_unicode should probably decode('mbcs') the result of os.environ["COMPUTERNAME"], as ctypes is unlikely to exist for python 2.4 builds. win32api.GetComputerNameEx(0) also will return unicode (raising NotImplementedError on win9x). BUT - that isn't our problem here (2.5 build with ctypes)

It turns out 2 other socket.gethostname() calls are the problem - I've come up with a patch - see http://bundlebuggy.aaronbentley.com/project/bzr/request/%3C026001c8ff5b%24b5c6ca70%2421545f50%24%40com.au%3E

Revision history for this message
John A Meinel (jameinel) wrote :

Just to mention, I don't think os.environ['COMPUTERNAME'] is going to be in 'mbcs' encoding. MBCS is effectively UTF-16, which is the filesystem encoding. We've found that environment variables are generally in OEM encoding, which for US Vista is CP1252.

Anyway, we require ctypes or pywin32 on win32. So if users don't have it (python24), they have to install it anyway. (We require it for access to the LockFileEx functionality to handle shared-locks.) So we know users are going to have one of those. (Current preference is for ctypes simply because it is bundled with 2.5.)

Did you mean "it turns out 2 socket.ge..." rather than "2 other". As I only see 2 places that you have changed.

Revision history for this message
Mark Hammond (mhammond) wrote :

FYI, the "mbcs" encoding simply uses WideCharToMultiByte/MultiByteToWideChar. IIUC, pretty much anything in windows which is natively implemented in Unicode will have used this function to implement the "ansi" version of the function (consoles being a confusing exception). I'm far from an expert in this, but still IIUC, cp1252 may not be able to represent all characters that the "mbcs" encoding can in some locales - indeed, the characters that can be represented by "mbcs" depend on the locale.

Revision history for this message
John A Meinel (jameinel) wrote :

This will be in bzr 1.7 (bzr.dev 3686)

Changed in bzr:
assignee: nobody → mhammond
milestone: none → 1.7
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.