HTML parser reinterprets Unicode StringIOs from UTF-8 to Latin-1
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Fix Released
|
Medium
|
scoder |
Bug Description
In certain environments, the lxml HTML parser corrupts Unicode strings by encoding them as UTF-8 and subsequently interpreting the bytes as a Latin-1 sequence. This is mostly only an issue if libxml's iconv support is missing, but there are also other encoding issues even when iconv is present.
To simulate an environment where iconv is missing, edit parser.pxi, and comment out the line "_UNICODE_ENCODING = enc". lxml has separate code paths when _UNICODE_ENCODING is NULL, and these code paths currently do not work. Note that there is already a test in the lxml test suite that fails when _UNICODE_ENCODING is NULL.
When _UNICODE_ENCODING is non-NULL, it lets libxml handle Unicode strings. When it is NULL, lxml encodes the string as UTF-8, then passes the bytes to lxml, which decodes it with whatever encoding the user has specified, or "guesses" the encoding if none is specified (which, for me, means Latin-1). I have also found a corner case where it does this even if _UNICODE_ENCODING is non-NULL.
TO REPRODUCE (on Python 2.6):
import StringIO
import lxml.html
HTMLSOURCE = u'<html>
htmltree = lxml.html.
for node in htmltree:
print repr(node.text)
print node.text.
EXPECTED OUTPUT:
u'\u4f60\u597d'
你好
ACTUAL OUTPUT:
u'\xe4\
ä½ å¥½
Note that you can work around this by either:
- Having <meta http-equiv=
- Using lxml.etree with an lxml.etree.
It looks like if you pass a byte string, and you don't do either of the above things, it will guess the charset as Latin-1 and interpret the bytes accordingly. Eventually, it decodes it back into a Unicode string using Latin-1. That isn't technically broken, but it's a bit weird.
What's broken is that if you pass a Unicode string, LXML encodes it as UTF-8 and then follow the above process, making the encoding and decoding internally inconsistent.
This seems to be a similar issue to Bug #898072 (read Unicode strings from a file, convert them to UTF-8, then let libxml "guess" the encoding on the way out). But I have looked at the code -- I can easily work around the issue for StringIO. It will be much harder to fix the general issue of reading from files (so I have filed a separate bug).
Part II: If the StringIO is not read from the very start, it takes a separate code path which has the same consequences, regardless of whether iconv is supported.
HTMLSOURCE = u'<html>
sio = StringIO.
sio.read(1)
htmltree = lxml.html.
for node in htmltree:
print repr(node.text)
print node.text.
I have a fix for both of these issues (with test cases) on Github:
https:/
Python : (2, 6, 5, 'final', 0)
lxml.etree : (2, 4, -300, 0)
libxml used : (2, 7, 6)
libxml compiled : (2, 7, 6)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)
Note that this bug was originally reported on Google App Engine (where we don't currently support iconv):
http://
Hmm, yes, it's very unfortunate that libxml2 defaults to Latin-1 instead of UTF-8 for the HTML parser. If it wasn't for backwards compatibility, that would be the thing to change - but I doubt that it'd be easy to work around...
Also, Unicode file parsing needs a major overhaul all by itself. It's currently rather fragile. What I think should happen in that the whole encoding setup should be delayed until after the first data string was read from the file (or maybe read the first data block earlier and keep it around). That would make it easier to react on the actual type of data returned by the file.