UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 58: invalid start byte

Bug #690110 reported by Aristotelis Mikropoulos
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

I try to parse a document with lxml.etree.fromstring that includes a strange (Unicode?) character, and I get that error:

Traceback (most recent call last):
File "lxml.etree.pyx", line 815, in lxml.etree._Element.text.__get__ (src/lxml/lxml.etree.c:33236)
  File "apihelpers.pxi", line 616, in lxml.etree._collectText (src/lxml/lxml.etree.c:15062)
  File "apihelpers.pxi", line 1280, in lxml.etree.funicode (src/lxml/lxml.etree.c:20049)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 58: invalid start byte

However, when I manually get the problematic text and decode it from utf8, I get the unicode character u'\ufffd'.
So, I guess lxml.etree.fromstring could do the same. Please add that functionality to fix this problem.
Thank you very much.

Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 2, 6, 0)
libxml used : (2, 7, 7)
libxml compiled : (2, 7, 6)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Revision history for this message
scoder (scoder) wrote : Re: [Bug 690110] [NEW] UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 58: invalid start byte

I wonder how you get that character parsed anyway. Could you add a code
example that shows how you parse and how you get to the exception?

Stefan

Revision history for this message
Aristotelis Mikropoulos (amikrop) wrote :

from lxml.etree import parse
import urllib

usock = urllib.urlopen('http://users.auth.gr/~aamikrop/test.html')
data = usock.read()
data = data.decode('utf8') # it doesn't work neither with or without this line
parse(data)

Revision history for this message
Telofy (telofy) wrote :

The error occurs all the time for me. Whenever pages claim to be UTF-8 (in the HTTP header and the meta tags) but contain invalid characters, I get the above error. Here are a few ways to reproduce the bug:

In [1]: from lxml.html import parse
In [2]: root = parse('http://telofy.spline.de/foo/lxml-bug-690110.html').getroot()
In [3]: root.xpath('//br')[0].tail
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)

/tmp/gist-1496487/<ipython console> in <module>()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._Element.tail.__get__ (src/lxml/lxml.etree.c:36181)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._collectText (src/lxml/lxml.etree.c:16915)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree.funicode (src/lxml/lxml.etree.c:22016)()
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte

Or in case I one day forget what that file’s purpose is and delete it:

In [6]: import urllib2
In [7]: text = urllib2.urlopen('https://gist.github.com/raw/1496487/3290697d42023941296c2ba092b95642ba03c5ee/lxml-bug-690110.html').read()
In [8]: from lxml.html import document_fromstring
In [9]: root = document_fromstring(text)
In [10]: root.xpath('//br')[0].tail
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)

/tmp/gist-1496487/<ipython console> in <module>()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._Element.tail.__get__ (src/lxml/lxml.etree.c:36181)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._collectText (src/lxml/lxml.etree.c:16915)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree.funicode (src/lxml/lxml.etree.c:22016)()
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte

That gist can be found on GitHub [1].

What Python does, and what happened in the case of the original reporter, is that it replaces the invalid character with the replacement character �. This is identical to the behavior of a “text.decode('utf-8', errors='replace')” following the second example above.

I don’t know Cyphon, but perhaps you can just add this “errors='replace'” in [2] and possibly in [3] (but I haven’t tested any of this).

Sorry for the bad formatting, but I don’t know which markup syntax I can use in such comments, if any.

[1] https://gist.github.com/1496487
[2] https://github.com/lxml/lxml/blob/c5c8cae024a543205c55e09af832c1bf528d2a0d/src/lxml/apihelpers.pxi#L1344
[3] https://github.com/lxml/lxml/blob/c5c8cae024a543205c55e09af832c1bf528d2a0d/src/lxml/apihelpers.pxi#L1332

Revision history for this message
Jakub Wilk (jwilk) wrote :

I attached a minimal test-case:

>>> from lxml.html import parse
>>> tree = parse('690110.html.gz')
>>> body = tree.getroot()[1]
>>> body.text
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 897, in lxml.etree._Element.text.__get__ (src/lxml/lxml.etree.c:37022)
  File "apihelpers.pxi", line 691, in lxml.etree._collectText (src/lxml/lxml.etree.c:16626)
  File "apihelpers.pxi", line 1344, in lxml.etree.funicode (src/lxml/lxml.etree.c:21864)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data

Revision history for this message
scoder (scoder) wrote :

The problems described by telofy and jwilk are duplicates of issue #1002581 and issue #898072 (not sure which is better), i.e. they use the HTML parser.

The original poster (amikrop) claimed to get this with the XML parser, which is an entirely different situation. I have no idea how to reproduce this with the XML parser.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.