lxml

Overview
Code
Bugs
Blueprints
Translations
Answers

Bug #1002581
Comment #1

Comment 1 for bug 1002581

Revision history for this message

scoder (scoder) wrote on 2012-06-26:

Hmm, yes, it's very unfortunate that libxml2 defaults to Latin-1 instead of UTF-8 for the HTML parser. If it wasn't for backwards compatibility, that would be the thing to change - but I doubt that it'd be easy to work around...

Also, Unicode file parsing needs a major overhaul all by itself. It's currently rather fragile. What I think should happen in that the whole encoding setup should be delayed until after the first data string was read from the file (or maybe read the first data block earlier and keep it around). That would make it easier to react on the actual type of data returned by the file.