UnicodeDecodeError when strerror is not ascii

Bug #273978 reported by mijutu
50
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Bazaar
In Progress
Medium
Martin Packman
bzr (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

bzr: ERROR: Generic bzr smart protocol error: Unprintable exception PermissionDenied: dict={'path':
               u'/var/bzr/projects/XXX', '_preformatted_string': None, 'extra': ": [Errno 13] Lupa ev\xc3\xa4tty:
               '/var/bzr/projects/XXX'"}, fmt='Permission denied: "%(path)s"%(extra)s', error=UnicodeDecodeError('ascii', ":
               [Errno 13] Lupa ev\xc3\xa4tty: '/var/bzr/projects/XXX'", 20, 21, 'ordinal not in range(128)')

"Lupa evätty" means "Permission denied"

Tags: easy unicode
Revision history for this message
John A Meinel (jameinel) wrote :

I'm pretty sure we need more context to be able to determine what is going on here.

This is either fixed, or it is likely to be a configuration issue, where the environment is telling us that your system is *in* ascii encoding, and thus we are unable to decode the string we were given.

Changed in bzr:
importance: Undecided → Medium
status: New → Incomplete
Revision history for this message
mijutu (mijutu) wrote : Re: [Bug 273978] Re: bzr wrongly assumes error messages written in utf8 to be ascii

to, 2009-11-12 kello 12:59 +0000, John A Meinel kirjoitti:
> I'm pretty sure we need more context to be able to determine what is
> going on here.

mijutu@crc11-ett:~$ cd /tmp/
mijutu@crc11-ett:/tmp$ rm -rf test
mijutu@crc11-ett:/tmp$ mkdir test
mijutu@crc11-ett:/tmp$ chmod a-rx test
mijutu@crc11-ett:/tmp$ echo $LANG
fi_FI.UTF-8
mijutu@crc11-ett:/tmp$ bzr branch test/someproject newbranch
bzr: ERROR: Unprintable exception PermissionDenied: dict={'path':
u'/tmp/test/someproject/.bzr/branch-format', '_preformatted_string':
None, 'extra': ": [Errno 13] Lupa ev\xc3\xa4tty:
u'/tmp/test/someproject/.bzr/branch-format'"}, fmt='Permission denied:
"%(path)s"%(extra)s', error=UnicodeDecodeError('ascii', ": [Errno 13]
Lupa ev\xc3\xa4tty: u'/tmp/test/someproject/.bzr/branch-format'", 20,
21, 'ordinal not in range(128)')
mijutu@crc11-ett:/tmp$ LANG=C bzr branch test/someproject newbranch
bzr: ERROR: Permission denied:
"/tmp/test/someproject/.bzr/branch-format": [Errno 13] Permission
denied: u'/tmp/test/someproject/.bzr/branch-format'
mijutu@crc11-ett:/tmp$ python --version
Python 2.5.4
mijutu@crc11-ett:/tmp$ bzr --version
Bazaar (bzr) 1.16.1

Whoops, old bzr. Let's try again.

mijutu@crc11-ett:/tmp$ bzr branch test/someproject newbranch
bzr: ERROR: Unprintable exception PermissionDenied: dict={'path':
u'/tmp/test/someproject/.bzr/branch-format', '_preformatted_string':
None, 'extra': ": [Errno 13] Lupa ev\xc3\xa4tty:
u'/tmp/test/someproject/.bzr/branch-format'"}, fmt='Permission denied:
"%(path)s"%(extra)s', error=UnicodeDecodeError('ascii', ": [Errno 13]
Lupa ev\xc3\xa4tty: u'/tmp/test/someproject/.bzr/branch-format'", 20,
21, 'ordinal not in range(128)')
mijutu@crc11-ett:/tmp$ bzr --version
Bazaar (bzr) 2.0.2

http://wiki.python.org/moin/UnicodeDecodeError

Python seems to default to latin1 somewehere:

$ python
Python 2.5.4 (r254:67916, Sep 26 2009, 10:32:22)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> "aäa".decode("utf-8")
u'a\xe4a'

I have not mentioned latin1 anywhere (locale is utf-8), but still I see
"a\xe4a".
0xE4 is ä in latin1. Where does that come from?

"Lupa evätty" is latin1-compatible, so I guess this latin1 assumption is
not be responsible for the error.

>>> "ä".decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

Even the error messages are not exactly the same, it seems that
somewhere in bzr the permission denied error message from libc is
converted to python utf8 string assuming that it is ascii.
Instead of assuming ascii, bzr should check LC_MESSAGES and choose
encoding accordingly.

Revision history for this message
Martin Pool (mbp) wrote : Re: bzr wrongly assumes error messages written in utf8 to be ascii

> 0xE4 is ä in latin1. Where does that come from?

0xe4 is also ä in Unicode. Python is not defaulting to latin-1, it's printing a unicode string here.

I think what's happening here is that strerror is returning a utf-8 string, and that's put into the error's attribute apparently as a byte string, not converted to unicode. If it's actually still a byte string we'd have to interpret it in the error, which would be gross. If it's Unicode it may be Toshio's (?) point that error message templates should be unicode.

Changed in bzr:
status: Incomplete → Confirmed
summary: - bzr wrongly assumes error messages written in utf8 to be ascii
+ UnicodeDecodeError when strerror is not ascii
Changed in bzr:
importance: Medium → Low
Revision history for this message
Daniel Clemente (n142857) wrote :

A simple way to reproduce this:
1. Use a particular locale, like Catalan: export LANG=ca_ES.UTF-8;
2. Run: mkdir cinc; cd cinc; bzr init .; chmod 000 .bzr; bzr log;

You get:
bzr: ERROR: Unprintable exception PermissionDenied: dict={'path': u'/n/cinc/.bzr/branch-format', '_preformatted_string': None, 'extra': ": [Errno 13] S\xe2\x80\x99ha denegat el perm\xc3\xads: u'/n/cinc/.bzr/branch-format'"}, fmt='Permission denied: "%(path)s"%(extra)s', error=UnicodeDecodeError('ascii', ": [Errno 13] S\xe2\x80\x99ha denegat el perm\xc3\xads: u'/n/cinc/.bzr/branch-format'", 14, 15, 'ordinal not in range(128)')

The error message was: S’ha denegat el permís

Martin Pool (mbp)
tags: added: easy unicode
Revision history for this message
Martin Pool (mbp) wrote :

I had a look at this with bialix and gz. On both linux and Windows, os.strerror (which gets put into the OSError etc) is a byte string in the current encoding.

We currently call locale.setlocale(locale.LC_ALL, '') which causes it to be set by the environment.

We would have the option to do setlocale(LC_MESSAGES, 'C') which would give English OS error messages always, which would avoid encoding bugs and also avoid variation in tests when running in non-English locales. The only question there is whether users would generally prefer to get English or i18n error messages. Possibly we could recommend they users manually set LC_MESSAGES=C if they prefer this, but this won't work on Windows on python2.5 and later.

On Unix I think what we want to do is:

enc = locale.getlocale(locale.LC_MESSAGES)[1]
print os.strerror(4).decode(enc, 'replace')

and that should give us a safe unicode version of the message. It will vary across OSs and may vary across python versions.

On Windows we can go through ctypes.windll.kernel32.GetACP() to tell us the codepage, or the second return value from locale.getdefaultlocale() should tell us the right encoding to use for error message byte strings.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in bzr (Ubuntu):
status: New → Confirmed
Revision history for this message
Martin Packman (gz) wrote :

Same bug has low, medium, and high priority. I think low was probably most honest, but lets take the median.

Changed in bzr:
assignee: nobody → Martin Packman (gz)
importance: Low → Medium
status: Confirmed → In Progress
Revision history for this message
scrasnups (fchastanet) wrote :

Hello,

I was receiving this error message with tortoiseBzr, bzr 2.5.1 under Windows 7.
restarting pageant seems to solve the problem

Hope it helps

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.