lxml

Overview
Code
Bugs
Blueprints
Translations
Answers

Bug #898072
Comment #6

Comment 6 for bug 898072

Revision history for this message

lilydjwg (lilydjwg) wrote on 2011-11-30: Re: lxml.html.parse treats encoding as Latin1 when reading from file-objects directly

I mean, Python 3.x will decode the file in the specified encoding or system default one. I did not pass `encoding='utf-8'` to the `open()` call just because that happens to be my system default (my fault), which will decode the attached html file correctly.

lxml (or libxml2) *needn't* guess at all about a *text file object*. Python already takes care of this, and lxml should not ignore this. Why set `encoding` of `htmlCtxtReadIO` to NULL (around parser.pxi:331) when you've encoded it in UTF-8 (around parser.pxi:380)?