UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 58: invalid start byte

Bug #690110 reported by Aristotelis Mikropoulos on 2010-12-14
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
lxml
Undecided
Unassigned

Bug Description

I try to parse a document with lxml.etree.fromstring that includes a strange (Unicode?) character, and I get that error:

Traceback (most recent call last):
File "lxml.etree.pyx", line 815, in lxml.etree._Element.text.__get__ (src/lxml/lxml.etree.c:33236)
  File "apihelpers.pxi", line 616, in lxml.etree._collectText (src/lxml/lxml.etree.c:15062)
  File "apihelpers.pxi", line 1280, in lxml.etree.funicode (src/lxml/lxml.etree.c:20049)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 58: invalid start byte

However, when I manually get the problematic text and decode it from utf8, I get the unicode character u'\ufffd'.
So, I guess lxml.etree.fromstring could do the same. Please add that functionality to fix this problem.
Thank you very much.

Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 2, 6, 0)
libxml used : (2, 7, 7)
libxml compiled : (2, 7, 6)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

I wonder how you get that character parsed anyway. Could you add a code
example that shows how you parse and how you get to the exception?

Stefan

from lxml.etree import parse
import urllib

usock = urllib.urlopen('http://users.auth.gr/~aamikrop/test.html')
data = usock.read()
data = data.decode('utf8') # it doesn't work neither with or without this line
parse(data)

Telofy (telofy) wrote :

The error occurs all the time for me. Whenever pages claim to be UTF-8 (in the HTTP header and the meta tags) but contain invalid characters, I get the above error. Here are a few ways to reproduce the bug:

In [1]: from lxml.html import parse
In [2]: root = parse('http://telofy.spline.de/foo/lxml-bug-690110.html').getroot()
In [3]: root.xpath('//br')[0].tail
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)

/tmp/gist-1496487/<ipython console> in <module>()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._Element.tail.__get__ (src/lxml/lxml.etree.c:36181)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._collectText (src/lxml/lxml.etree.c:16915)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree.funicode (src/lxml/lxml.etree.c:22016)()
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte

Or in case I one day forget what that file’s purpose is and delete it:

In [6]: import urllib2
In [7]: text = urllib2.urlopen('https://gist.github.com/raw/1496487/3290697d42023941296c2ba092b95642ba03c5ee/lxml-bug-690110.html').read()
In [8]: from lxml.html import document_fromstring
In [9]: root = document_fromstring(text)
In [10]: root.xpath('//br')[0].tail
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)

/tmp/gist-1496487/<ipython console> in <module>()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._Element.tail.__get__ (src/lxml/lxml.etree.c:36181)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._collectText (src/lxml/lxml.etree.c:16915)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree.funicode (src/lxml/lxml.etree.c:22016)()
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte

That gist can be found on GitHub [1].

What Python does, and what happened in the case of the original reporter, is that it replaces the invalid character with the replacement character �. This is identical to the behavior of a “text.decode('utf-8', errors='replace')” following the second example above.

I don’t know Cyphon, but perhaps you can just add this “errors='replace'” in [2] and possibly in [3] (but I haven’t tested any of this).

Sorry for the bad formatting, but I don’t know which markup syntax I can use in such comments, if any.

[1] https://gist.github.com/1496487
[2] https://github.com/lxml/lxml/blob/c5c8cae024a543205c55e09af832c1bf528d2a0d/src/lxml/apihelpers.pxi#L1344
[3] https://github.com/lxml/lxml/blob/c5c8cae024a543205c55e09af832c1bf528d2a0d/src/lxml/apihelpers.pxi#L1332

Jakub Wilk (jwilk) wrote :

I attached a minimal test-case:

>>> from lxml.html import parse
>>> tree = parse('690110.html.gz')
>>> body = tree.getroot()[1]
>>> body.text
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 897, in lxml.etree._Element.text.__get__ (src/lxml/lxml.etree.c:37022)
  File "apihelpers.pxi", line 691, in lxml.etree._collectText (src/lxml/lxml.etree.c:16626)
  File "apihelpers.pxi", line 1344, in lxml.etree.funicode (src/lxml/lxml.etree.c:21864)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data

scoder (scoder) wrote :

The problems described by telofy and jwilk are duplicates of issue #1002581 and issue #898072 (not sure which is better), i.e. they use the HTML parser.

The original poster (amikrop) claimed to get this with the XML parser, which is an entirely different situation. I have no idea how to reproduce this with the XML parser.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments