Comment 12 for bug 1240696

Revision history for this message
Dan Lecocq (q-dan) wrote :

The lack of deterministic reproducibility (why it crashes after k iterations on one run, but m iterations on another run) is typical of reading uninitialized memory. Further evidence of this is the fact that if you load a bad page once and repeatedly parse it, you can expect the same result every time:

###############################################
# If this works for the first run, it will continue working fine:
import os
from lxml import etree

with open('problem.html', 'rb') as fin:
    content = fin.read().decode('utf-8', 'ignore').encode('utf-8')

for i in xrange(1000):
    print i
    tree = etree.fromstring(content, etree.HTMLParser(recover=True))

###############################################
# This pretty reliably crashes:
import os
from lxml import etree

for i in xrange(1000):
    print i
    with open('problem.html', 'rb') as fin:
        content = fin.read().decode('utf-8', 'ignore').encode('utf-8')
    tree = etree.fromstring(content, etree.HTMLParser(recover=True))

###############################################

I Think this property may have be mentioned in another report of the same bug, but I can't find it at the moment. This should be considered further evidence that somewhere, uninitialized memory is getting read.

From valgrind (I'll try to get an instance of this with debugging symbols):

==15551== Conditional jump or move depends on uninitialised value(s)
==15551== at 0x7C61805: ??? (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.7.8)
==15551== by 0x7C61CDE: ??? (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.7.8)
==15551== by 0x7C657DF: ??? (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.7.8)
==15551== by 0x7C663AE: htmlParseDocument (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.7.8)
==15551== by 0x7C6901B: ??? (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.7.8)
==15551== by 0x74B0954: __pyx_f_4lxml_5etree_11_BaseParser__parseDoc (lxml.etree.c:88919)
==15551== by 0x7498BF3: __pyx_f_4lxml_5etree__parseDoc (lxml.etree.c:92370)
==15551== by 0x749C0BD: __pyx_f_4lxml_5etree__parseMemoryDocument (lxml.etree.c:93571)
==15551== by 0x7507DC3: __pyx_pw_4lxml_5etree_23fromstring (lxml.etree.c:63285)
==15551== by 0x497EA3: PyEval_EvalFrameEx (in /usr/bin/python2.7)
==15551== by 0x49F1BF: PyEval_EvalCodeEx (in /usr/bin/python2.7)
==15551== by 0x4A9080: PyRun_FileExFlags (in /usr/bin/python2.7)

I've tried (unsuccessfully) a few times now to replicate this using the libxml2-provided python API, but perhaps someone more familiar with the internals of lxml might have better luck? I've tried tracing through the code from the invocation above and tried to replicate the calls to libxml2 directly.