lxml

Bug #898072
Comment #7

Comment 7 for bug 898072

Revision history for this message

scoder (scoder) wrote on 2011-12-01: Re: [Bug 898072] Re: lxml.html.parse treats encoding as Latin1 when reading from file-objects directly

> I mean, Python 3.x will decode the file in the specified encoding or
> system default one. I did not pass `encoding='utf-8'` to the `open()`
> call just because that happens to be my system default (my fault), which
> will decode the attached html file correctly.

Ok.

> lxml (or libxml2) *needn't* guess at all about a *text file object*.
> Python already takes care of this, and lxml should not ignore this.

Well, it doesn't really ignore it. The thing is that this is not a trivial
part of the code. You can't change the encoding along the way, and libxml2
either requires to know it before starting to parse, or it will try to
detect it along the way, which, in the case of HTML, means that it needs to
parse quite a bit of the document until it find a <meta> tag (or maybe
fails to find it and needs to guess...).

Encoding to UTF-8 is actually perfectly correct for XML, it just poses a
problem for HTML, which lacks a proper way of declaring the encoding up-front.

In any case, the parser needs to read bytes at some point, whether they
came in inside of a unicode object or a bytes object.

> Why
> set `encoding` of `htmlCtxtReadIO` to NULL (around parser.pxi:331) when
> you've encoded it in UTF-8 (around parser.pxi:380)?

At that point, lxml does not yet know that the data type returned by the
file object will be unicode.

I do see that it's worth handling this better, also because there is some
code overlap with the "feed parser" further down (line 1030+).

What I could imagine to happen, is, that the _FileReaderContext would not
initialise the parser context before it has read the first string from the
file-like object. And it would raise an error if it sees that the file-like
object ever returns a different type than what it returned as first result.

There is code in the string parsing section that currently supports parsing
from the data buffer of unicode objects. However, that is something that
generally needs revising for CPython 3.3, which has a different internal
representation for the data buffer - or rather *four* of them, which makes
it impossible to parse directly from the buffer step by step, as both the
_FileReaderContext and the feed parser need to do. It currently works, but
it's extremely inefficient because it recodes the internal buffer before
parsing from it.

Lots of open tasks, little money to do them. And fixing or speeding up
Unicode parsing is certainly not a priority because it's rather useless in
practice. I'll still reopen the ticket, because the current way it's
handled is far from optimal and yields incorrect results for HTML. But
unless you want to provide a patch yourself, don't expect it to get closed
any time soon.

My advice is to pass the encoding parameter into the HTMLParser() instance,
instead of opening the file in unicode mode. That's always going to work,
and it's also *substantially* more efficient in both Py2 and Py3. It will
also be more efficient in Py3.3, because creating the Unicode string inside
of CPython's file implementation requires two passes over the data there,
including a copy to a potentially wider buffer, before libxml2 recodes it
again into a third copy and parses from that.

Handling bytes is a much faster single copy inside of CPython, then libxml2
does its second copy while decoding the buffer and parses from that. In
your specific case of UTF-8 encoded data, it can even avoid the recoding
step completely, because UTF-8 is the native encoding that libxml2 uses
internally.

Do some benchmarks, you'll quickly see the difference.

(I guess I should put the above into a FAQ section ...)

Stefan

Ok.

> lxml (or libxml2) *needn't* guess at all about a *text file object*.
> Python already takes care of this, and lxml should not ignore this.

Well, it doesn't really ignore it. The thing is that this is not a trivial 
part of the code. You can't change the encoding along the way, and libxml2 
either requires to know it before starting to parse, or it will try to 
detect it along the way, which, in the case of HTML, means that it needs to 
parse quite a bit of the document until it find a <meta> tag (or maybe 
fails to find it and needs to guess...).

Encoding to UTF-8 is actually perfectly correct for XML, it just poses a 
problem for HTML, which lacks a proper way of declaring the encoding up-front.

In any case, the parser needs to read bytes at some point, whether they 
came in inside of a unicode object or a bytes object.

> Why
> set `encoding` of `htmlCtxtReadIO` to NULL (around parser.pxi:331) when
> you've encoded it in UTF-8 (around parser.pxi:380)?

At that point, lxml does not yet know that the data type returned by the 
file object will be unicode.

I do see that it's worth handling this better, also because there is some 
code overlap with the "feed parser" further down (line 1030+).

What I could imagine to happen, is, that the _FileReaderContext would not 
initialise the parser context before it has read the first string from the 
file-like object. And it would raise an error if it sees that the file-like 
object ever returns a different type than what it returned as first result.

There is code in the string parsing section that currently supports parsing 
from the data buffer of unicode objects. However, that is something that 
generally needs revising for CPython 3.3, which has a different internal 
representation for the data buffer - or rather *four* of them, which makes 
it impossible to parse directly from the buffer step by step, as both the 
_FileReaderContext and the feed parser need to do. It currently works, but 
it's extremely inefficient because it recodes the internal buffer before 
parsing from it.

Lots of open tasks, little money to do them. And fixing or speeding up 
Unicode parsing is certainly not a priority because it's rather useless in 
practice. I'll still reopen the ticket, because the current way it's 
handled is far from optimal and yields incorrect results for HTML. But 
unless you want to provide a patch yourself, don't expect it to get closed 
any time soon.

My advice is to pass the encoding parameter into the HTMLParser() instance, 
instead of opening the file in unicode mode. That's always going to work, 
and it's also *substantially* more efficient in both Py2 and Py3. It will 
also be more efficient in Py3.3, because creating the Unicode string inside 
of CPython's file implementation requires two passes over the data there, 
including a copy to a potentially wider buffer, before libxml2 recodes it 
again into a third copy and parses from that.

Handling bytes is a much faster single copy inside of CPython, then libxml2 
does its second copy while decoding the buffer and parses from that. In 
your specific case of UTF-8 encoded data, it can even avoid the recoding 
step completely, because UTF-8 is the native encoding that libxml2 uses 
internally.

Do some benchmarks, you'll quickly see the difference.

(I guess I should put the above into a FAQ section ...)

Stefan