Comment 5 for bug 1240696

Revision history for this message
Dan Lecocq (q-dan) wrote :

Agreed. And many of the pages that we crawl aren't even the same encoding as they advertise in their headers. For most intents and purposes, we make a few guesses based on the encoding the headers / doc provide and then failing that we just take that encoding with the 'ignore' parameter.

Clearly not the most /accurate/ solution, but robust enough for us. I may try to dig deeper into the exact cause, because this is a very unsatisfying solution -- it's nothing more than a stop-gap for many cases. We ran a lot of this through valgrind which gave some unsettling warnings about conditional jumps using uninitialized memory, but libxml2 is not somewhere I'd like to spend much time debugging.

Incidentally, I did try to replicate this using just the libxml2 bindings directly and was unable to. That may be my lack of experience using those bindings directly, or that may be indicative of the problem actually living in lxml.

As a single reference point, this is the temporary fix we're pursuing, and we crawl about 10-20 million pages / day with this particular crawler.