IPython - Enhanced Interactive Python

unicode bug - encoding input

Reported by Murkt on 2009-03-08
100
This bug affects 13 people
Affects Status Importance Assigned to Milestone
IPython
Invalid
High
Fernando Perez
Nominated for 0.10 by mrk
0.11
Fix Released
Undecided
Unassigned
ipython (Ubuntu)
Undecided
Unassigned

Bug Description

Default Python shell:

>>> u'абвгд'
u'\u0430\u0431\u0432\u0433\u0434'

IPython 0.9.1:

>>> u'абвгд'
u'\xd0\xb0\xd0\xb1\xd0\xb2\xd0\xb3\xd0\xb4'
>>> 'абвгд'.decode('utf8')
u'\u0430\u0431\u0432\u0433\u0434'

sys.stdin.encoding is 'UTF-8'.

How to fix: remove the line №2022 from IPython/iplib.py (for 0.9.1 release). Here it is:

--- a/iplib.py
+++ b/iplib.py
@@ -2019,7 +2019,6 @@
         # this allows execution of indented pasted code. It is tempting
         # to add '\n' at the end of source to run commands like ' a=1'
         # directly, but this fails for more complicated scenarios
- source=source.encode(self.stdin_encoding)
         if source[:1] in [' ', '\t']:
             source = 'if 1:\n%s' % source

I didn't find any intoduced bugs by a quick check.

Additionaly, I checked ipython-wx and ipythonx - latter doesn't have this bug.

description: updated
Sergey Kishchenko (voidwrk) wrote :

I confirm this bug. Attached patch fixed the issue for me

Fernando Perez (fdo.perez) wrote :

That's indeed a bug, but the patch is removing a line that was put in there explicitly for some reason. So what I'd like to have, before committing this, is a set of tests in a file named test_unicode.py, that encapsulates all of the recent unicode work.

Unfortunately a lot of these unicode fixes have been made in a completely ad-hoc manner, as people report problems, but we don't have a centralized list of cases to check against. His may be a reasonable fix, for all I know, but I'm afraid that if we apply it we'll get back 10 old bugs again. I don't know, maybe not, but there's simply no way to be sure.

I'm one of the most ignorant of our bunch in unicode issues, blissfully living in the stupidity of the ascii world. It would be great if one of us who knows more about this stuff could at least write a set of simple unicode tests that catch many of the recently reported encoding problems. Jorgen, Ville, any chance you guys could take this up at some point? You know about it a lot more than I do...

The proposed patch does not work for me on win32 with or without pyreadline

sys.stdin.encoding == "cp1252"

Standard python:

c:\python>python
Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> "åäö"
'\xe5\xe4\xf6'
>>> u"åäö"
u'\xe5\xe4\xf6'
>>>

IPython from trunk:

c:\python>ipython
Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 0.9.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object'. ?object also works, ?? prints more.

In [1]: "åäö"
Out[1]: '\xe5\xe4\xf6'

In [2]: u"åäö"
Out[2]: u'\xe5\xe4\xf6'

In [3]:
Do you really want to exit ([y]/n)?

IPython with proposed change:

c:\python>ipython
Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 0.9.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object'. ?object also works, ?? prints more.

In [1]: "åäö"
Out[1]: '\xc3\xa5\xc3\xa4\xc3\xb6'

In [2]: u"åäö"
Out[2]: u'\xe5\xe4\xf6'

In [3]:
Do you really want to exit ([y]/n)?

Rodrigo Senra (rsenra) wrote :

This bugis still live and kicking.
The problem is in iplib.py: source=source.encode(self.stdin_encoding)

This is wrong whenever there is a unicode string in source.

A simple:

  x = u"ação"

with the offending line becomes:

 x = u'a\xc3\xa7\xc3\xa3o'

Notice that the encoding is done inplace,and the u"" is kept after the encoding. This is wrong.
I have removed the line, and it is now working for me. Do not know enough of IPython internals to predict side effects. I hope this helps.
regards,
Rod Senra

INADA Naoki (songofacandy) wrote :

This is another patch that handle encoded byte string literal and unicode literal correctly.

        source=source.encode(self.stdin_encoding)
        if source[:1] in [u' ', u'\t']:
            source = u'if 1:\n%s' % source
+ source = '# coding: %s\n%s' % (self.stdin_encoding, source)

Can anyone provide a set of *tests* that we can actually run
automatically for this? Honestly, until we have actual tests, this is
like playing whack-a-mole blind: the problems will just keep
resurfacing... What we need is a test file for unicode that can be
run reliably, by anyone, and that shows the various issues...

As I said earlier, it's quite possible that the various proposed fixes
work for *someone*, but without actual tests that we can include,
there's no way to know what they may break for someone else (as has
happened in the past).

Sorry to seem like a curmudgeon: I really appreciate people
contributing ideas and even code. But we need to fix these unicode
problems the right way, else we'll be hunting them forever.

Brian Granger (ellisonbg) wrote :

Definitely, I don't like playing whack-a-mole blind. These types of
bug fixes definitely need tests before fixes get commited.

Brian

On Tue, Apr 14, 2009 at 12:20 AM, Fernando Perez <email address hidden> wrote:
> Can anyone provide a set of *tests* that we can actually run
> automatically for this? Honestly, until we have actual tests, this is
> like playing whack-a-mole blind: the problems will just keep
> resurfacing... What we need is a test file for unicode that can be
> run reliably, by anyone, and that shows the various issues...
>
> As I said earlier, it's quite possible that the various proposed fixes
> work for *someone*, but without actual tests that we can include,
> there's no way to know what they may break for someone else (as has
> happened in the past).
>
> Sorry to seem like a curmudgeon: I really appreciate people
> contributing ideas and even code. But we need to fix these unicode
> problems the right way, else we'll be hunting them forever.
>
> --
> unicode bug - encoding input
> https://bugs.launchpad.net/bugs/339642
> You received this bug notification because you are a member of IPython
> Developers, which is subscribed to IPython.
>
> Status in IPython - Enhanced Interactive Python: New
>
> Bug description:
> Default Python shell:
>
>>>> u'абвгд'
> u'\u0430\u0431\u0432\u0433\u0434'
>
> IPython 0.9.1:
>
>>>> u'абвгд'
> u'\xd0\xb0\xd0\xb1\xd0\xb2\xd0\xb3\xd0\xb4'
>>>> 'абвгд'.decode('utf8')
> u'\u0430\u0431\u0432\u0433\u0434'
>
> sys.stdin.encoding is 'UTF-8'.
>
> How to fix: remove the line No.2022 from IPython/iplib.py (for 0.9.1 release). Here it is:
>
> --- a/iplib.py
> +++ b/iplib.py
> @@ -2019,7 +2019,6 @@
> # this allows execution of indented pasted code. It is tempting
> # to add '\n' at the end of source to run commands like ' a=1'
> # directly, but this fails for more complicated scenarios
> - source=source.encode(self.stdin_encoding)
> if source[:1] in [' ', '\t']:
> source = 'if 1:\n%s' % source
>
>
> I didn't find any intoduced bugs by a quick check.
>
> Additionaly, I checked ipython-wx and ipythonx - latter doesn't have this bug.
>

--
Brian E. Granger, Ph.D.
Assistant Professor of Physics
Cal Poly State University, San Luis Obispo
<email address hidden>
<email address hidden>

Fernando Perez skrev:
> Can anyone provide a set of *tests* that we can actually run
> automatically for this? Honestly, until we have actual tests, this is
> like playing whack-a-mole blind: the problems will just keep
> resurfacing... What we need is a test file for unicode that can be
> run reliably, by anyone, and that shows the various issues...
>
> As I said earlier, it's quite possible that the various proposed fixes
> work for *someone*, but without actual tests that we can include,
> there's no way to know what they may break for someone else (as has
> happened in the past).
>
> Sorry to seem like a curmudgeon: I really appreciate people
> contributing ideas and even code. But we need to fix these unicode
> problems the right way, else we'll be hunting them forever.
>
I agree, but part of the problem here is that part of the problem is to
have the correct visual output in the shell and this may be difficult to
check automatically. I have a feeling that this problem is also platform
dependent making it necessary to run the tests on several platforms as
well to see that the bug has been fixed.

/Jörgen

Fernando Perez (fdo.perez) wrote :

On Tue, Apr 14, 2009 at 11:16 AM, Jörgen Stenarson
<email address hidden> wrote:
> Fernando Perez skrev:

> I agree, but part of the problem here is that part of the problem is to
> have the correct visual output in the shell and this may be difficult to
> check automatically. I have a feeling that this problem is also platform
> dependent making it necessary to run the tests on several platforms as
> well to see that the bug has been fixed.

Well, even if we have a special file we need to re-run by hand, that
would be better than little snippets as we have. At least the file
can be run by the test suite automatically and not crashing is a good
start. Core developers can then re-run it by hand (we can put an "if
__name__" main section at the bottom for this) to check visually.
This is basically what we are doing now with snippets all over the
mailing list, I'm just suggesting that unless all those checks are:

- collected in one file
- auto-executed

we'll never get anywhere reliable on these unicode problems. We can
then have a note to manually do

%run test_unicode

ourselves for the full visual verification.

Cheers,

f

Changed in ipython:
assignee: nobody → fdo.perez
importance: Undecided → Medium
status: New → Confirmed
Damjan Georgievski (gdamjan) wrote :

I can confirm this bug and the sollution given.

Now obviously the bug is in the *input* handling of ipython .. how do you make test cases for that??

Andy Mikhailenko (neithere) wrote :

Confirming. "UTF-8" in all cases, IPython prints screwed up "unicode" strings and this renders the program almost unusable.

Anyone got ideas about how to test this? I guess IPython developers possess a bit more knowledge of the immense innards of the package than reporters of the bug do, so users could expect at least some guidelines for writing tests, could they?

Maybe we should allow to tune bug-related behaviour in user settings until the bug is finally fixed? This may also help with testing.

pawciobiel (pawciobiel) wrote :

Confirming.

core/iplib.py
2201
--- source=source.encode(self.stdin_encoding)

Apart of the above, shouldn't the input be decoded if it's not unicode?
(Similar issue was in python2.5/code.py)
core/iplib.py
2332,2334d2330
< line = raw_input_original(prompt)
< if not isinstance(line, unicode):
< line = line.decode(self.stdin_encoding)

cheers,

INADA Naoki (songofacandy) wrote :

I manage to fix this bug in Python side: http://bugs.python.org/issue5911
But if the python issue is fixed in Python 2.7, this problem is still in Python 2.6 and lower.

Changed in ipython:
importance: Medium → High
milestone: none → 0.11
t0ster (tosters) wrote :

Patches worked for me, removing 'source=source.encode(self.stdin_encoding)' helped in Mac OS X 10.6

Thanks

Thorsten Glaser (mirabilos) wrote :
Thorsten Glaser (mirabilos) wrote :

I didn’t use the patch from LP: #290677 due to
http://bugs.python.org/issue5911
but wrote a workaround.

It may or may not touch all places needed and not break anything unrelated,
I searched for a good place to do so actually, but can’t guarantee anything.
Feedback extremely welcome.

It at least fixes the two Trac things for me.

Fernando Perez (fdo.perez) wrote :

Please note that this bug tracker for ipython has been closed, we've moved all of our work to Github.

This bug is now tracked here:

http://github.com/ipython/ipython/issues/labels/unicode#issue/25

So I'm closing it on LP so further work continues on GH.

Many thanks for your patch. Unfortunately the RC for 0.10.1 is already out, and I simply don't have the resources right now to do the amount of testing of this patch it would deserve. But I've added a back link on GH to your comment/patch and we can continue the discussion there.

As I mention on GH, I'll be happy to push a 0.10.2 later if your patch proves to work well in more extensive testing.

Changed in ipython:
status: Confirmed → Invalid
Fernando Perez (fdo.perez) wrote :

Closed here because all future work will proceed on github:

http://github.com/ipython/ipython/issues/labels/unicode#issue/25

Thorsten Glaser (mirabilos) wrote :

I cannot use this github thing with *any* browser I have (Lynx, Links+ 2.x, Dillo, Opera 9.2x).
Please include me in further discussion, for example using eMail, if possible.

Changed in ipython (Ubuntu):
status: New → Confirmed
tags: added: patch
Thomas Kluyver (takluyver) wrote :

Anyone still watching this bug, note that it was fixed by IPython 0.11, released earlier this year. It will be some time before it appears in Ubuntu repositories, but you can install from PyPI or jtaylor's PPA here: https://launchpad.net/~jtaylor/+archive/ipython

Thomas Kluyver (takluyver) wrote :
Julian Taylor (jtaylor) on 2012-01-08
Changed in ipython (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.