> I mean, Python 3.x will decode the file in the specified encoding or > system default one. I did not pass `encoding='utf-8'` to the `open()` > call just because that happens to be my system default (my fault), which > will decode the attached html file correctly. Ok. > lxml (or libxml2) *needn't* guess at all about a *text file object*. > Python already takes care of this, and lxml should not ignore this. Well, it doesn't really ignore it. The thing is that this is not a trivial part of the code. You can't change the encoding along the way, and libxml2 either requires to know it before starting to parse, or it will try to detect it along the way, which, in the case of HTML, means that it needs to parse quite a bit of the document until it find a tag (or maybe fails to find it and needs to guess...). Encoding to UTF-8 is actually perfectly correct for XML, it just poses a problem for HTML, which lacks a proper way of declaring the encoding up-front. In any case, the parser needs to read bytes at some point, whether they came in inside of a unicode object or a bytes object. > Why > set `encoding` of `htmlCtxtReadIO` to NULL (around parser.pxi:331) when > you've encoded it in UTF-8 (around parser.pxi:380)? At that point, lxml does not yet know that the data type returned by the file object will be unicode. I do see that it's worth handling this better, also because there is some code overlap with the "feed parser" further down (line 1030+). What I could imagine to happen, is, that the _FileReaderContext would not initialise the parser context before it has read the first string from the file-like object. And it would raise an error if it sees that the file-like object ever returns a different type than what it returned as first result. There is code in the string parsing section that currently supports parsing from the data buffer of unicode objects. However, that is something that generally needs revising for CPython 3.3, which has a different internal representation for the data buffer - or rather *four* of them, which makes it impossible to parse directly from the buffer step by step, as both the _FileReaderContext and the feed parser need to do. It currently works, but it's extremely inefficient because it recodes the internal buffer before parsing from it. Lots of open tasks, little money to do them. And fixing or speeding up Unicode parsing is certainly not a priority because it's rather useless in practice. I'll still reopen the ticket, because the current way it's handled is far from optimal and yields incorrect results for HTML. But unless you want to provide a patch yourself, don't expect it to get closed any time soon. My advice is to pass the encoding parameter into the HTMLParser() instance, instead of opening the file in unicode mode. That's always going to work, and it's also *substantially* more efficient in both Py2 and Py3. It will also be more efficient in Py3.3, because creating the Unicode string inside of CPython's file implementation requires two passes over the data there, including a copy to a potentially wider buffer, before libxml2 recodes it again into a third copy and parses from that. Handling bytes is a much faster single copy inside of CPython, then libxml2 does its second copy while decoding the buffer and parses from that. In your specific case of UTF-8 encoded data, it can even avoid the recoding step completely, because UTF-8 is the native encoding that libxml2 uses internally. Do some benchmarks, you'll quickly see the difference. (I guess I should put the above into a FAQ section ...) Stefan