lxml

Overview
Code
Bugs
Blueprints
Translations
Answers

Bug #1240696
Comment #6

Comment 6 for bug 1240696

Revision history for this message

Dan Lecocq (q-dan) wrote on 2013-10-21:

It /seems/ to be an issue of the pages declaring their own charset, but being stored as another. In particular, our crawler converts everything it sees to 'utf-8' based on its content-encoding header. But all the examples we found that evince this bug declare an encoding of Shift_JIS within the doc. And the example provided in this ticket provides an encoding of 'GB2312'.

I've not contrived an example of this, but my suspicion is that these are cases where we're not providing enough content to be a valid document in the declared encoding. At least, this would jive with our findings in valgrind where we were accessing uninitialized memory when parsing these docs.