lxml.html.parse treats encoding as Latin1 in Python 3 when reading from unicode file-objects directly

Bug #898072 reported by lilydjwg
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
lxml
Confirmed
Low
Unassigned

Bug Description

In Python 3, when using `lxml.html.parse` to parse a file, or a file object, lxml assumes it's in Latin1 but, when I provide a file object, reading from it already produces Unicode. lxml shouldn't be wrong. Reading from an io.StringIO works as expected, so as `lxml.etree.parse`.

Python : sys.version_info(major=3, minor=2, micro=2, releaselevel='final', serial=0)
lxml.etree : (2, 3, 2, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Revision history for this message
lilydjwg (lilydjwg) wrote :
Revision history for this message
lilydjwg (lilydjwg) wrote :
Revision history for this message
scoder (scoder) wrote :

I assume that your system's default encoding (that CPython uses for opening the file) is not Latin-1 and that the HTML page uses exactly that encoding? In that case, pass the encoding into the parser explicitly.

Rejecting this ticket, because lxml (or libxml2) cannot possibly know what encoding your file is encoded with if the file does not contain any information about the encoding.

Changed in lxml:
status: New → Invalid
Revision history for this message
lilydjwg (lilydjwg) wrote :

@Stefan Behnel, I don't think lxml need to know which encoding the file is in, because in Python 3.x, `open` handles this when opening in text mode. What lxml read from that file object is encoding indenpendent---it's unicode. lxml somehow decodes this into UTF-8, then takes it as Latin1.

Revision history for this message
scoder (scoder) wrote :

No, Python 3.x does not magically "handle" this. It *guesses* the encoding, based on platform parameters. libxml2 guesses something different, which is just as good and simply happens to be the wrong assumption for this specific file. Just because the encoding of the file happens to be the same as the default encoding on your platforms does not mean that that's the case for all files that you (or someone else) wants to parse. So, doing what Python does would work in this specific case, but fail in others.

Revision history for this message
lilydjwg (lilydjwg) wrote :

I mean, Python 3.x will decode the file in the specified encoding or system default one. I did not pass `encoding='utf-8'` to the `open()` call just because that happens to be my system default (my fault), which will decode the attached html file correctly.

lxml (or libxml2) *needn't* guess at all about a *text file object*. Python already takes care of this, and lxml should not ignore this. Why set `encoding` of `htmlCtxtReadIO` to NULL (around parser.pxi:331) when you've encoded it in UTF-8 (around parser.pxi:380)?

Revision history for this message
scoder (scoder) wrote : Re: [Bug 898072] Re: lxml.html.parse treats encoding as Latin1 when reading from file-objects directly
Download full text (3.6 KiB)

> I mean, Python 3.x will decode the file in the specified encoding or
> system default one. I did not pass `encoding='utf-8'` to the `open()`
> call just because that happens to be my system default (my fault), which
> will decode the attached html file correctly.

Ok.

> lxml (or libxml2) *needn't* guess at all about a *text file object*.
> Python already takes care of this, and lxml should not ignore this.

Well, it doesn't really ignore it. The thing is that this is not a trivial
part of the code. You can't change the encoding along the way, and libxml2
either requires to know it before starting to parse, or it will try to
detect it along the way, which, in the case of HTML, means that it needs to
parse quite a bit of the document until it find a <meta> tag (or maybe
fails to find it and needs to guess...).

Encoding to UTF-8 is actually perfectly correct for XML, it just poses a
problem for HTML, which lacks a proper way of declaring the encoding up-front.

In any case, the parser needs to read bytes at some point, whether they
came in inside of a unicode object or a bytes object.

> Why
> set `encoding` of `htmlCtxtReadIO` to NULL (around parser.pxi:331) when
> you've encoded it in UTF-8 (around parser.pxi:380)?

At that point, lxml does not yet know that the data type returned by the
file object will be unicode.

I do see that it's worth handling this better, also because there is some
code overlap with the "feed parser" further down (line 1030+).

What I could imagine to happen, is, that the _FileReaderContext would not
initialise the parser context before it has read the first string from the
file-like object. And it would raise an error if it sees that the file-like
object ever returns a different type than what it returned as first result.

There is code in the string parsing section that currently supports parsing
from the data buffer of unicode objects. However, that is something that
generally needs revising for CPython 3.3, which has a different internal
representation for the data buffer - or rather *four* of them, which makes
it impossible to parse directly from the buffer step by step, as both the
_FileReaderContext and the feed parser need to do. It currently works, but
it's extremely inefficient because it recodes the internal buffer before
parsing from it.

Lots of open tasks, little money to do them. And fixing or speeding up
Unicode parsing is certainly not a priority because it's rather useless in
practice. I'll still reopen the ticket, because the current way it's
handled is far from optimal and yields incorrect results for HTML. But
unless you want to provide a patch yourself, don't expect it to get closed
any time soon.

My advice is to pass the encoding parameter into the HTMLParser() instance,
instead of opening the file in unicode mode. That's always going to work,
and it's also *substantially* more efficient in both Py2 and Py3. It will
also be more efficient in Py3.3, because creating the Unicode string inside
of CPython's file implementation requires two passes over the data there,
including a copy to a potentially wider buffer, before libxml2 recodes it
again into a third copy ...

Read more...

Changed in lxml:
importance: Undecided → Low
status: Invalid → Confirmed
summary: - lxml.html.parse treats encoding as Latin1 when reading from file-objects
- directly
+ lxml.html.parse treats encoding as Latin1 when reading from unicode
+ file-objects directly
summary: - lxml.html.parse treats encoding as Latin1 when reading from unicode
- file-objects directly
+ lxml.html.parse treats encoding as Latin1 in Python 3 when reading from
+ unicode file-objects directly
Revision history for this message
Matt Giuca (mgiuca) wrote :

I've read over the code for this, and I'm pretty sure this is almost identical to bug #1002581 (which I just filed, with a patch). Unfortunately, I was only able to work around the issue for StringIO and not for the more general case of unicode files.

The problem is the _FileReaderContext class in parser.pxi. The _readDoc method calls libxml's htmlCtxtReadIO, passing the encoding (specified by the user, or NULL if unspecified). The copyToBuffer method is called on each chunk of the file (crucially, after htmlCtxtReadIO has been called), and it pulls the strings out of the file, checks if they are unicode and the encoding is None, and in that case, encodes them to UTF-8.

To fix it, you basically want to set the encoding to "UTF-8" regardless of what the user specified (if and only if it is a Unicode file). But it is going to be pretty hard to do that before calling htmlCtxtReadIO, because you can't generally know whether a file is going to produce Unicode strings until after you read some bytes from it. I worked around this for the special case of StringIO in my fix for bug #1002581.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.