Bug #690110 “UnicodeDecodeError: 'utf8' codec can't decode byte ...” : Bugs : lxml

Revision history for this message

scoder (scoder) wrote on 2010-12-14: Re: [Bug 690110] [NEW] UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 58: invalid start byte

#1

I wonder how you get that character parsed anyway. Could you add a code
example that shows how you parse and how you get to the exception?

Stefan

Revision history for this message

Aristotelis Mikropoulos (amikrop) wrote on 2010-12-14:

#2

from lxml.etree import parse
import urllib

usock = urllib.urlopen('http://users.auth.gr/~aamikrop/test.html')
data = usock.read()
data = data.decode('utf8') # it doesn't work neither with or without this line
parse(data)

Revision history for this message

Telofy (telofy) wrote on 2011-12-19:

#3

The error occurs all the time for me. Whenever pages claim to be UTF-8 (in the HTTP header and the meta tags) but contain invalid characters, I get the above error. Here are a few ways to reproduce the bug:

In [1]: from lxml.html import parse
In [2]: root = parse('http://telofy.spline.de/foo/lxml-bug-690110.html').getroot()
In [3]: root.xpath('//br')[0].tail
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)

/tmp/gist-1496487/<ipython console> in <module>()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._Element.tail.__get__ (src/lxml/lxml.etree.c:36181)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._collectText (src/lxml/lxml.etree.c:16915)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree.funicode (src/lxml/lxml.etree.c:22016)()
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte

Or in case I one day forget what that file’s purpose is and delete it:

In [6]: import urllib2
In [7]: text = urllib2.urlopen('https://gist.github.com/raw/1496487/3290697d42023941296c2ba092b95642ba03c5ee/lxml-bug-690110.html').read()
In [8]: from lxml.html import document_fromstring
In [9]: root = document_fromstring(text)
In [10]: root.xpath('//br')[0].tail
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)

/tmp/gist-1496487/<ipython console> in <module>()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._Element.tail.__get__ (src/lxml/lxml.etree.c:36181)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._collectText (src/lxml/lxml.etree.c:16915)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree.funicode (src/lxml/lxml.etree.c:22016)()
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte

That gist can be found on GitHub [1].

What Python does, and what happened in the case of the original reporter, is that it replaces the invalid character with the replacement character �. This is identical to the behavior of a “text.decode('utf-8', errors='replace')” following the second example above.

I don’t know Cyphon, but perhaps you can just add this “errors='replace'” in [2] and possibly in [3] (but I haven’t tested any of this).

Sorry for the bad formatting, but I don’t know which markup syntax I can use in such comments, if any.

[1] https://gist.github.com/1496487
[2] https://github.com/lxml/lxml/blob/c5c8cae024a543205c55e09af832c1bf528d2a0d/src/lxml/apihelpers.pxi#L1344
[3] https://github.com/lxml/lxml/blob/c5c8cae024a543205c55e09af832c1bf528d2a0d/src/lxml/apihelpers.pxi#L1332

The error occurs all the time for me. Whenever pages claim to be UTF-8 (in the HTTP header and the meta tags) but contain invalid characters, I get the above error. Here are a few ways to reproduce the bug:

In [1]: from lxml.html import parse
In [2]: root = parse('http://telofy.spline.de/foo/lxml-bug-690110.html').getroot()
In [3]: root.xpath('//br')[0].tail
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)

/tmp/gist-1496487/<ipython console> in <module>()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._Element.tail.__get__ (src/lxml/lxml.etree.c:36181)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._collectText (src/lxml/lxml.etree.c:16915)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree.funicode (src/lxml/lxml.etree.c:22016)()
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte

Or in case I one day forget what that file’s purpose is and delete it:

In [6]: import urllib2
In [7]: text = urllib2.urlopen('https://gist.github.com/raw/1496487/3290697d42023941296c2ba092b95642ba03c5ee/lxml-bug-690110.html').read()
In [8]: from lxml.html import document_fromstring
In [9]: root = document_fromstring(text)
In [10]: root.xpath('//br')[0].tail
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)

/tmp/gist-1496487/<ipython console> in <module>()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._Element.tail.__get__ (src/lxml/lxml.etree.c:36181)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._collectText (src/lxml/lxml.etree.c:16915)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree.funicode (src/lxml/lxml.etree.c:22016)()
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte

That gist can be found on GitHub [1].

What Python does, and what happened in the case of the original reporter, is that it replaces the invalid character with the replacement character �. This is identical to the behavior of a “text.decode('utf-8', errors='replace')” following the second example above.

I don’t know Cyphon, but perhaps you can just add this “errors='replace'” in [2] and possibly in [3] (but I haven’t tested any of this).

Sorry for the bad formatting, but I don’t know which markup syntax I can use in such comments, if any.

[1] https://gist.github.com/1496487
[2] https://github.com/lxml/lxml/blob/c5c8cae024a543205c55e09af832c1bf528d2a0d/src/lxml/apihelpers.pxi#L1344
[3] https://github.com/lxml/lxml/blob/c5c8cae024a543205c55e09af832c1bf528d2a0d/src/lxml/apihelpers.pxi#L1332

Revision history for this message

Jakub Wilk (jwilk) wrote on 2012-05-07:

#4

690110.html.gz Edit (121 bytes, text/html)

I attached a minimal test-case:

>>> from lxml.html import parse
>>> tree = parse('690110.html.gz')
>>> body = tree.getroot()[1]
>>> body.text
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 897, in lxml.etree._Element.text.__get__ (src/lxml/lxml.etree.c:37022)
  File "apihelpers.pxi", line 691, in lxml.etree._collectText (src/lxml/lxml.etree.c:16626)
  File "apihelpers.pxi", line 1344, in lxml.etree.funicode (src/lxml/lxml.etree.c:21864)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data

Revision history for this message

scoder (scoder) wrote on 2013-04-23:

#5

The problems described by telofy and jwilk are duplicates of issue #1002581 and issue #898072 (not sure which is better), i.e. they use the HTML parser.

The original poster (amikrop) claimed to get this with the XML parser, which is an entirely different situation. I have no idea how to reproduce this with the XML parser.

lxml

UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 58: invalid start byte

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches