HTML parser reinterprets Unicode StringIOs from UTF-8 to Latin-1

Bug #1002581 reported by Matt Giuca
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Medium
scoder

Bug Description

In certain environments, the lxml HTML parser corrupts Unicode strings by encoding them as UTF-8 and subsequently interpreting the bytes as a Latin-1 sequence. This is mostly only an issue if libxml's iconv support is missing, but there are also other encoding issues even when iconv is present.

To simulate an environment where iconv is missing, edit parser.pxi, and comment out the line "_UNICODE_ENCODING = enc". lxml has separate code paths when _UNICODE_ENCODING is NULL, and these code paths currently do not work. Note that there is already a test in the lxml test suite that fails when _UNICODE_ENCODING is NULL.

When _UNICODE_ENCODING is non-NULL, it lets libxml handle Unicode strings. When it is NULL, lxml encodes the string as UTF-8, then passes the bytes to lxml, which decodes it with whatever encoding the user has specified, or "guesses" the encoding if none is specified (which, for me, means Latin-1). I have also found a corner case where it does this even if _UNICODE_ENCODING is non-NULL.

TO REPRODUCE (on Python 2.6):

import StringIO
import lxml.html

HTMLSOURCE = u'<html><body>\u4f60\u597d</body></html>'

htmltree = lxml.html.parse(StringIO.StringIO(HTMLSOURCE)).getroot()
for node in htmltree:
    print repr(node.text)
    print node.text.encode('utf-8')

EXPECTED OUTPUT:

u'\u4f60\u597d'
你好

ACTUAL OUTPUT:

u'\xe4\xbd\xa0\xe5\xa5\xbd'
ä½ å¥½

Note that you can work around this by either:
- Having <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> in the HTML document, or
- Using lxml.etree with an lxml.etree.HTMLParser object, passing encoding='utf-8' to the HTMLParser constructor.

It looks like if you pass a byte string, and you don't do either of the above things, it will guess the charset as Latin-1 and interpret the bytes accordingly. Eventually, it decodes it back into a Unicode string using Latin-1. That isn't technically broken, but it's a bit weird.

What's broken is that if you pass a Unicode string, LXML encodes it as UTF-8 and then follow the above process, making the encoding and decoding internally inconsistent.

This seems to be a similar issue to Bug #898072 (read Unicode strings from a file, convert them to UTF-8, then let libxml "guess" the encoding on the way out). But I have looked at the code -- I can easily work around the issue for StringIO. It will be much harder to fix the general issue of reading from files (so I have filed a separate bug).

Part II: If the StringIO is not read from the very start, it takes a separate code path which has the same consequences, regardless of whether iconv is supported.

HTMLSOURCE = u'<html><body>\u4f60\u597d</body></html>'

sio = StringIO.StringIO('x' + HTMLSOURCE)
sio.read(1)
htmltree = lxml.html.parse(sio).getroot()
for node in htmltree:
    print repr(node.text)
    print node.text.encode('utf-8')

I have a fix for both of these issues (with test cases) on Github:
https://github.com/mgiuca/lxml/tree/html-unicode

Python : (2, 6, 5, 'final', 0)
lxml.etree : (2, 4, -300, 0)
libxml used : (2, 7, 6)
libxml compiled : (2, 7, 6)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Note that this bug was originally reported on Google App Engine (where we don't currently support iconv):
http://code.google.com/p/googleappengine/issues/detail?id=7526

Tags: html unicode
Revision history for this message
scoder (scoder) wrote :

Hmm, yes, it's very unfortunate that libxml2 defaults to Latin-1 instead of UTF-8 for the HTML parser. If it wasn't for backwards compatibility, that would be the thing to change - but I doubt that it'd be easy to work around...

Also, Unicode file parsing needs a major overhaul all by itself. It's currently rather fragile. What I think should happen in that the whole encoding setup should be delayed until after the first data string was read from the file (or maybe read the first data block earlier and keep it around). That would make it easier to react on the actual type of data returned by the file.

Changed in lxml:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Matt Giuca (mgiuca) wrote :

I agree that that solution would be ideal. Mine is a bit of a hack (as discussed on the commit, it is special-cased for StringIO and there isn't an obvious way to generalise it to all file-like objects that return Unicode strings). Hence the reason I filed a separate bug for StringIO in particular (the more general bug is bug #898072).

Note: We no longer need this fix, as we have fixed our system so that it supports iconv. So it's possibly better to wait until someone actually encounters this problem and needs a fix instead of trying to rewrite code unnecessarily.

Revision history for this message
scoder (scoder) wrote :
Changed in lxml:
assignee: nobody → scoder (scoder)
milestone: none → 3.3
status: Confirmed → Fix Committed
Revision history for this message
scoder (scoder) wrote :

Fixed in lxml 3.3.0.

Changed in lxml:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.