lxml

lxml.html.parse treats encoding as Latin1 in Python 3 when reading from unicode file-objects directly

Bug #898072 reported by lilydjwg on 2011-11-30

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	lxml	Confirmed	Low	Unassigned

Bug Description

In Python 3, when using `lxml.html.parse` to parse a file, or a file object, lxml assumes it's in Latin1 but, when I provide a file object, reading from it already produces Unicode. lxml shouldn't be wrong. Reading from an io.StringIO works as expected, so as `lxml.etree.parse`.

Python : sys.version_info(major=3, minor=2, micro=2, releaselevel='final', serial=0)
lxml.etree : (2, 3, 2, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Revision history for this message

lilydjwg (lilydjwg) wrote on 2011-11-30:

This code snippet will reproduce the bug Edit (174 bytes, text/x-python)

Revision history for this message

lilydjwg (lilydjwg) wrote on 2011-11-30:

The HTML that causes the bug Edit (38 bytes, text/html)

Revision history for this message

scoder (scoder) wrote on 2011-11-30:

I assume that your system's default encoding (that CPython uses for opening the file) is not Latin-1 and that the HTML page uses exactly that encoding? In that case, pass the encoding into the parser explicitly.

Rejecting this ticket, because lxml (or libxml2) cannot possibly know what encoding your file is encoded with if the file does not contain any information about the encoding.

Changed in lxml:
status:	New → Invalid

Revision history for this message

lilydjwg (lilydjwg) wrote on 2011-11-30:

@Stefan Behnel, I don't think lxml need to know which encoding the file is in, because in Python 3.x, `open` handles this when opening in text mode. What lxml read from that file object is encoding indenpendent---it's unicode. lxml somehow decodes this into UTF-8, then takes it as Latin1.

Revision history for this message

scoder (scoder) wrote on 2011-11-30:

No, Python 3.x does not magically "handle" this. It *guesses* the encoding, based on platform parameters. libxml2 guesses something different, which is just as good and simply happens to be the wrong assumption for this specific file. Just because the encoding of the file happens to be the same as the default encoding on your platforms does not mean that that's the case for all files that you (or someone else) wants to parse. So, doing what Python does would work in this specific case, but fail in others.

Revision history for this message

lilydjwg (lilydjwg) wrote on 2011-11-30:

I mean, Python 3.x will decode the file in the specified encoding or system default one. I did not pass `encoding='utf-8'` to the `open()` call just because that happens to be my system default (my fault), which will decode the attached html file correctly.

lxml (or libxml2) *needn't* guess at all about a *text file object*. Python already takes care of this, and lxml should not ignore this. Why set `encoding` of `htmlCtxtReadIO` to NULL (around parser.pxi:331) when you've encoded it in UTF-8 (around parser.pxi:380)?

Revision history for this message

scoder (scoder) wrote on 2011-12-01: Re: [Bug 898072] Re: lxml.html.parse treats encoding as Latin1 when reading from file-objects directly

Download full text (3.6 KiB)

> I mean, Python 3.x will decode the file in the specified encoding or
> system default one. I did not pass `encoding='utf-8'` to the `open()`
> call just because that happens to be my system default (my fault), which
> will decode the attached html file correctly.

Ok.

> lxml (or libxml2) *needn't* guess at all about a *text file object*.
> Python already takes care of this, and lxml should not ignore this.

Well, it doesn't really ignore it. The thing is that this is not a trivial
part of the code. You can't change the encoding along the way, and libxml2
either requires to know it before starting to parse, or it will try to
detect it along the way, which, in the case of HTML, means that it needs to
parse quite a bit of the document until it find a <meta> tag (or maybe
fails to find it and needs to guess...).

Encoding to UTF-8 is actually perfectly correct for XML, it just poses a
problem for HTML, which lacks a proper way of declaring the encoding up-front.

In any case, the parser needs to read bytes at some point, whether they
came in inside of a unicode object or a bytes object.

> Why
> set `encoding` of `htmlCtxtReadIO` to NULL (around parser.pxi:331) when
> you've encoded it in UTF-8 (around parser.pxi:380)?

At that point, lxml does not yet know that the data type returned by the
file object will be unicode.

I do see that it's worth handling this better, also because there is some
code overlap with the "feed parser" further down (line 1030+).

What I could imagine to happen, is, that the _FileReaderContext would not
initialise the parser context before it has read the first string from the
file-like object. And it would raise an error if it sees that the file-like
object ever returns a different type than what it returned as first result.

There is code in the string parsing section that currently supports parsing
from the data buffer of unicode objects. However, that is something that
generally needs revising for CPython 3.3, which has a different internal
representation for the data buffer - or rather *four* of them, which makes
it impossible to parse directly from the buffer step by step, as both the
_FileReaderContext and the feed parser need to do. It currently works, but
it's extremely inefficient because it recodes the internal buffer before
parsing from it.

Lots of open tasks, little money to do them. And fixing or speeding up
Unicode parsing is certainly not a priority because it's rather useless in
practice. I'll still reopen the ticket, because the current way it's
handled is far from optimal and yields incorrect results for HTML. But
unless you want to provide a patch yourself, don't expect it to get closed
any time soon.

My advice is to pass the encoding parameter into the HTMLParser() instance,
instead of opening the file in unicode mode. That's always going to work,
and it's also *substantially* more efficient in both Py2 and Py3. It will
also be more efficient in Py3.3, because creating the Unicode string inside
of CPython's file implementation requires two passes over the data there,
including a copy to a potentially wider buffer, before libxml2 recodes it
again into a third copy ...

Ok.

> lxml (or libxml2) *needn't* guess at all about a *text file object*.
> Python already takes care of this, and lxml should not ignore this.

Well, it doesn't really ignore it. The thing is that this is not a trivial 
part of the code. You can't change the encoding along the way, and libxml2 
either requires to know it before starting to parse, or it will try to 
detect it along the way, which, in the case of HTML, means that it needs to 
parse quite a bit of the document until it find a <meta> tag (or maybe 
fails to find it and needs to guess...).

Encoding to UTF-8 is actually perfectly correct for XML, it just poses a 
problem for HTML, which lacks a proper way of declaring the encoding up-front.

In any case, the parser needs to read bytes at some point, whether they 
came in inside of a unicode object or a bytes object.

> Why
> set `encoding` of `htmlCtxtReadIO` to NULL (around parser.pxi:331) when
> you've encoded it in UTF-8 (around parser.pxi:380)?

At that point, lxml does not yet know that the data type returned by the 
file object will be unicode.

I do see that it's worth handling this better, also because there is some 
code overlap with the "feed parser" further down (line 1030+).

What I could imagine to happen, is, that the _FileReaderContext would not 
initialise the parser context before it has read the first string from the 
file-like object. And it would raise an error if it sees that the file-like 
object ever returns a different type than what it returned as first result.

There is code in the string parsing section that currently supports parsing 
from the data buffer of unicode objects. However, that is something that 
generally needs revising for CPython 3.3, which has a different internal 
representation for the data buffer - or rather *four* of them, which makes 
it impossible to parse directly from the buffer step by step, as both the 
_FileReaderContext and the feed parser need to do. It currently works, but 
it's extremely inefficient because it recodes the internal buffer before 
parsing from it.

Lots of open tasks, little money to do them. And fixing or speeding up 
Unicode parsing is certainly not a priority because it's rather useless in 
practice. I'll still reopen the ticket, because the current way it's 
handled is far from optimal and yields incorrect results for HTML. But 
unless you want to provide a patch yourself, don't expect it to get closed 
any time soon.

My advice is to pass the encoding parameter into the HTMLParser() instance, 
instead of opening the file in unicode mode. That's always going to work, 
and it's also *substantially* more efficient in both Py2 and Py3. It will 
also be more efficient in Py3.3, because creating the Unicode string inside 
of CPython's file implementation requires two passes over the data there, 
including a copy to a potentially wider buffer, before libxml2 recodes it 
again into a third copy and parses from that.

Handling bytes is a much faster single copy inside of CPython, then libxml2 
does its second copy while decoding the buffer and parses from that. In 
your specific case of UTF-8 encoded data, it can even avoid the recoding 
step completely, because UTF-8 is the native encoding that libxml2 uses 
internally.

Do some benchmarks, you'll quickly see the difference.

(I guess I should put the above into a FAQ section ...)

Stefan

Changed in lxml:
importance:	Undecided → Low
status:	Invalid → Confirmed
summary:	- lxml.html.parse treats encoding as Latin1 when reading from file-objects - directly + lxml.html.parse treats encoding as Latin1 when reading from unicode + file-objects directly
summary:	- lxml.html.parse treats encoding as Latin1 when reading from unicode - file-objects directly + lxml.html.parse treats encoding as Latin1 in Python 3 when reading from + unicode file-objects directly

Revision history for this message

Matt Giuca (mgiuca) wrote on 2012-05-22:

I've read over the code for this, and I'm pretty sure this is almost identical to bug #1002581 (which I just filed, with a patch). Unfortunately, I was only able to work around the issue for StringIO and not for the more general case of unicode files.

The problem is the _FileReaderContext class in parser.pxi. The _readDoc method calls libxml's htmlCtxtReadIO, passing the encoding (specified by the user, or NULL if unspecified). The copyToBuffer method is called on each chunk of the file (crucially, after htmlCtxtReadIO has been called), and it pulls the strings out of the file, checks if they are unicode and the encoding is None, and in that case, encodes them to UTF-8.

To fix it, you basically want to set the encoding to "UTF-8" regardless of what the user specified (if and only if it is a Unicode file). But it is going to be pretty hard to do that before calling htmlCtxtReadIO, because you can't generally know whether a file is going to produce Unicode strings until after you read some bytes from it. I worked around this for the special case of StringIO in my fix for bug #1002581.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1338546

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.