Comment 7 for bug 898072

Revision history for this message
scoder (scoder) wrote : Re: [Bug 898072] Re: lxml.html.parse treats encoding as Latin1 when reading from file-objects directly

> I mean, Python 3.x will decode the file in the specified encoding or
> system default one. I did not pass `encoding='utf-8'` to the `open()`
> call just because that happens to be my system default (my fault), which
> will decode the attached html file correctly.


> lxml (or libxml2) *needn't* guess at all about a *text file object*.
> Python already takes care of this, and lxml should not ignore this.

Well, it doesn't really ignore it. The thing is that this is not a trivial
part of the code. You can't change the encoding along the way, and libxml2
either requires to know it before starting to parse, or it will try to
detect it along the way, which, in the case of HTML, means that it needs to
parse quite a bit of the document until it find a <meta> tag (or maybe
fails to find it and needs to guess...).

Encoding to UTF-8 is actually perfectly correct for XML, it just poses a
problem for HTML, which lacks a proper way of declaring the encoding up-front.

In any case, the parser needs to read bytes at some point, whether they
came in inside of a unicode object or a bytes object.

> Why
> set `encoding` of `htmlCtxtReadIO` to NULL (around parser.pxi:331) when
> you've encoded it in UTF-8 (around parser.pxi:380)?

At that point, lxml does not yet know that the data type returned by the
file object will be unicode.

I do see that it's worth handling this better, also because there is some
code overlap with the "feed parser" further down (line 1030+).

What I could imagine to happen, is, that the _FileReaderContext would not
initialise the parser context before it has read the first string from the
file-like object. And it would raise an error if it sees that the file-like
object ever returns a different type than what it returned as first result.

There is code in the string parsing section that currently supports parsing
from the data buffer of unicode objects. However, that is something that
generally needs revising for CPython 3.3, which has a different internal
representation for the data buffer - or rather *four* of them, which makes
it impossible to parse directly from the buffer step by step, as both the
_FileReaderContext and the feed parser need to do. It currently works, but
it's extremely inefficient because it recodes the internal buffer before
parsing from it.

Lots of open tasks, little money to do them. And fixing or speeding up
Unicode parsing is certainly not a priority because it's rather useless in
practice. I'll still reopen the ticket, because the current way it's
handled is far from optimal and yields incorrect results for HTML. But
unless you want to provide a patch yourself, don't expect it to get closed
any time soon.

My advice is to pass the encoding parameter into the HTMLParser() instance,
instead of opening the file in unicode mode. That's always going to work,
and it's also *substantially* more efficient in both Py2 and Py3. It will
also be more efficient in Py3.3, because creating the Unicode string inside
of CPython's file implementation requires two passes over the data there,
including a copy to a potentially wider buffer, before libxml2 recodes it
again into a third copy and parses from that.

Handling bytes is a much faster single copy inside of CPython, then libxml2
does its second copy while decoding the buffer and parses from that. In
your specific case of UTF-8 encoded data, it can even avoid the recoding
step completely, because UTF-8 is the native encoding that libxml2 uses

Do some benchmarks, you'll quickly see the difference.

(I guess I should put the above into a FAQ section ...)
