Comment 3 for bug 690110

Revision history for this message
Telofy (telofy) wrote :

The error occurs all the time for me. Whenever pages claim to be UTF-8 (in the HTTP header and the meta tags) but contain invalid characters, I get the above error. Here are a few ways to reproduce the bug:

In [1]: from lxml.html import parse
In [2]: root = parse('http://telofy.spline.de/foo/lxml-bug-690110.html').getroot()
In [3]: root.xpath('//br')[0].tail
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)

/tmp/gist-1496487/<ipython console> in <module>()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._Element.tail.__get__ (src/lxml/lxml.etree.c:36181)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._collectText (src/lxml/lxml.etree.c:16915)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree.funicode (src/lxml/lxml.etree.c:22016)()
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte

Or in case I one day forget what that file’s purpose is and delete it:

In [6]: import urllib2
In [7]: text = urllib2.urlopen('https://gist.github.com/raw/1496487/3290697d42023941296c2ba092b95642ba03c5ee/lxml-bug-690110.html').read()
In [8]: from lxml.html import document_fromstring
In [9]: root = document_fromstring(text)
In [10]: root.xpath('//br')[0].tail
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)

/tmp/gist-1496487/<ipython console> in <module>()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._Element.tail.__get__ (src/lxml/lxml.etree.c:36181)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._collectText (src/lxml/lxml.etree.c:16915)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree.funicode (src/lxml/lxml.etree.c:22016)()
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte

That gist can be found on GitHub [1].

What Python does, and what happened in the case of the original reporter, is that it replaces the invalid character with the replacement character �. This is identical to the behavior of a “text.decode('utf-8', errors='replace')” following the second example above.

I don’t know Cyphon, but perhaps you can just add this “errors='replace'” in [2] and possibly in [3] (but I haven’t tested any of this).

Sorry for the bad formatting, but I don’t know which markup syntax I can use in such comments, if any.

[1] https://gist.github.com/1496487
[2] https://github.com/lxml/lxml/blob/c5c8cae024a543205c55e09af832c1bf528d2a0d/src/lxml/apihelpers.pxi#L1344
[3] https://github.com/lxml/lxml/blob/c5c8cae024a543205c55e09af832c1bf528d2a0d/src/lxml/apihelpers.pxi#L1332