lxml.html.parse treats encoding as Latin1 in Python 3 when reading from unicode file-objects directly
Bug #898072 reported by
lilydjwg
This bug affects 4 people
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Confirmed
|
Low
|
Unassigned |
Bug Description
In Python 3, when using `lxml.html.parse` to parse a file, or a file object, lxml assumes it's in Latin1 but, when I provide a file object, reading from it already produces Unicode. lxml shouldn't be wrong. Reading from an io.StringIO works as expected, so as `lxml.etree.parse`.
Python : sys.version_
lxml.etree : (2, 3, 2, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)
To post a comment you must log in.
I assume that your system's default encoding (that CPython uses for opening the file) is not Latin-1 and that the HTML page uses exactly that encoding? In that case, pass the encoding into the parser explicitly.
Rejecting this ticket, because lxml (or libxml2) cannot possibly know what encoding your file is encoded with if the file does not contain any information about the encoding.