A document makes memory bomb out of lxml (nondeterministical, but reproducable)

Bug #1240696 reported by Andraz on 2013-10-16
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
libxml2
Confirmed
Medium
lxml
Critical
Unassigned

Bug Description

The script below and attached problematic html basically make sure that the program bombs by taking all available memory. Pretty nasty since the system becomes unresponsive.
Usually happens at a few hundred iterations of the loop. If it doesn't freeze the first time you run it, try a few more times. To make it even stranger, the problem does not happen if that read() is pushed out of the loop.

reproduced on two different systems:

>>> print("%-20s: %s" % ('Python', sys.version_info))
Python : sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
>>> print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
lxml.etree : (3, 2, 3, 0)
>>> print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
libxml used : (2, 7, 8)
>>> print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
libxml compiled : (2, 7, 8)
>>> print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
libxslt used : (1, 1, 26)
>>> print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))
libxslt compiled : (1, 1, 26)

AND:

>>> print("%-20s: %s" % ('Python', sys.version_info))
Python : sys.version_info(major=2, minor=7, micro=3, releaselevel='candidate', serial=2)
>>> print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
lxml.etree : (3, 2, 3, 0)
>>> print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
libxml used : (2, 8, 0)
>>> print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
libxml compiled : (2, 8, 0)
>>> print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
libxslt used : (1, 1, 26)
>>> print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))
libxslt compiled : (1, 1, 26)
>>> ## -----
...

Test case to reproduce it:

#!/usr/bin/env python
import lxml.html
i = 0
while True:
        i += 1
        page_html = open("problem.html", "r").read()
        a = lxml.html.document_fromstring(page_html)
        print i

Andraz (andraz-tori) wrote :
Andraz (andraz-tori) wrote :

Since this is a simple reproducible test-case, should I resubmit the test case in that originak bug #948627 ?

Dan Lecocq (q-dan) wrote :

I have been able to evince this with a handful of other pages that I can include, but the solution of using a `unicode` object described in https://bugs.launchpad.net/lxml/+bug/948627 seems to solve it:

This version evinces the bug:

```python
# This evinces the bug
import lxml.html
for i in xrange(100000):
    print i
    with open('problem.html') as fin:
        page = lxml.html.document_fromstring(fin.read())
```

But this version does /not/ have the bug. Not sure if it's supposed to be 'utf-8', but I guessed.

```python
# This evinces the bug
import lxml.html
for i in xrange(100000):
    print i
    with open('1240696.html') as fin:
        page = lxml.html.document_fromstring(fin.read().decode('utf-8'))
```

I've been able to evince this behavior on two system:

Ubuntu 12.04:
Python : sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree : (3, 2, 1, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Ubuntu 13.10:
Python : sys.version_info(major=2, minor=7, micro=5, releaselevel='final', serial=0)
lxml.etree : (3, 2, 3, 0)
libxml used : (2, 9, 1)
libxml compiled : (2, 9, 1)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

Oddly enough, as of yet, I've not been able to reproduce this with Mac:
Python : sys.version_info(major=2, minor=7, micro=1, releaselevel='final', serial=0)
lxml.etree : (3, 2, 3, 0)
libxml used : (2, 9, 1)
libxml compiled : (2, 9, 1)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 24)

Perhaps that might suggest the problematic code path.

Andraz (andraz-tori) wrote :

Would be a perfect fix if every page on the web would be utf-8-decodable :)

Dan Lecocq (q-dan) wrote :

Agreed. And many of the pages that we crawl aren't even the same encoding as they advertise in their headers. For most intents and purposes, we make a few guesses based on the encoding the headers / doc provide and then failing that we just take that encoding with the 'ignore' parameter.

Clearly not the most /accurate/ solution, but robust enough for us. I may try to dig deeper into the exact cause, because this is a very unsatisfying solution -- it's nothing more than a stop-gap for many cases. We ran a lot of this through valgrind which gave some unsettling warnings about conditional jumps using uninitialized memory, but libxml2 is not somewhere I'd like to spend much time debugging.

Incidentally, I did try to replicate this using just the libxml2 bindings directly and was unable to. That may be my lack of experience using those bindings directly, or that may be indicative of the problem actually living in lxml.

As a single reference point, this is the temporary fix we're pursuing, and we crawl about 10-20 million pages / day with this particular crawler.

Dan Lecocq (q-dan) wrote :

It /seems/ to be an issue of the pages declaring their own charset, but being stored as another. In particular, our crawler converts everything it sees to 'utf-8' based on its content-encoding header. But all the examples we found that evince this bug declare an encoding of Shift_JIS within the doc. And the example provided in this ticket provides an encoding of 'GB2312'.

I've not contrived an example of this, but my suspicion is that these are cases where we're not providing enough content to be a valid document in the declared encoding. At least, this would jive with our findings in valgrind where we were accessing uninitialized memory when parsing these docs.

Andraz (andraz-tori) wrote :

example attached to this bug report (problem.html) does not come with any strange encoding specified in the HTML (actually no encoding is specified)... so I don't think this is plausible core issue.

Dan Lecocq (q-dan) wrote :

Unless I'm reading it wrong, I see a line:

'<meta http-equiv="content-type" content="text/html; charset=GB2312"></head><body><div class="ya_cnt1">'

And like the example provided in https://bugs.launchpad.net/lxml/+bug/948627, it is also not decodable subject to the provided encoding ('gb2312'):

>>> content.decode('gb2312')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 17027-17028: illegal multibyte sequence

Andraz (andraz-tori) wrote :

Ahhhhh I missed that one because it's not in the head! Sorry for wrong info!

And this is in the middle of the <body> while it should be in <head>!
Maybe that's what makes the lxml really go astray?

Actually if the problem is so easy then a regex throwing out an meta tags from inside <head> might be enough for workaround in my case too.

thank you so much!

(still would be great if lxml devs could somehow figure out root cause :)

Dan Lecocq (q-dan) wrote :

No worries! In fact, it's in a second `<head>` element inside of an `<html>` element. Pretty malformed, to be sure. It seems that perhaps it's libxml2 that's not robust against dealing with mal-encoded content because lxml doesn't deal directly with anything with the internal charset declarations.

If so, that's unfortunate, because libxml2 is not a happy place to spend time debugging. I would like to be able to duplicate this using the `libxml2` python interface directly, which would enable us to legitimately file a bug with them and simultaneously prove that it's not in lxml. I took a stab at it and was unable, but perhaps I'll try again.

scoder (scoder) wrote :

I can reproduce the problem with the provided example (thanks!). What I see in gdb is that libxml2 keeps reporting the same parser error over and over again. Don't know why yet, nor why it works in some runs and then suddenly stops working.

If someone wants to take a look, set a breakpoint on the "__pyx_f_4lxml_5etree__forwardParserError" function and then investigate the struct passed as the "__pyx_v_error" argument. It refers to the same error all over: "Invalid char in CDATA 0x1".

Dan Lecocq (q-dan) wrote :

The lack of deterministic reproducibility (why it crashes after k iterations on one run, but m iterations on another run) is typical of reading uninitialized memory. Further evidence of this is the fact that if you load a bad page once and repeatedly parse it, you can expect the same result every time:

###############################################
# If this works for the first run, it will continue working fine:
import os
from lxml import etree

with open('problem.html', 'rb') as fin:
    content = fin.read().decode('utf-8', 'ignore').encode('utf-8')

for i in xrange(1000):
    print i
    tree = etree.fromstring(content, etree.HTMLParser(recover=True))

###############################################
# This pretty reliably crashes:
import os
from lxml import etree

for i in xrange(1000):
    print i
    with open('problem.html', 'rb') as fin:
        content = fin.read().decode('utf-8', 'ignore').encode('utf-8')
    tree = etree.fromstring(content, etree.HTMLParser(recover=True))

###############################################

I Think this property may have be mentioned in another report of the same bug, but I can't find it at the moment. This should be considered further evidence that somewhere, uninitialized memory is getting read.

From valgrind (I'll try to get an instance of this with debugging symbols):

==15551== Conditional jump or move depends on uninitialised value(s)
==15551== at 0x7C61805: ??? (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.7.8)
==15551== by 0x7C61CDE: ??? (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.7.8)
==15551== by 0x7C657DF: ??? (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.7.8)
==15551== by 0x7C663AE: htmlParseDocument (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.7.8)
==15551== by 0x7C6901B: ??? (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.7.8)
==15551== by 0x74B0954: __pyx_f_4lxml_5etree_11_BaseParser__parseDoc (lxml.etree.c:88919)
==15551== by 0x7498BF3: __pyx_f_4lxml_5etree__parseDoc (lxml.etree.c:92370)
==15551== by 0x749C0BD: __pyx_f_4lxml_5etree__parseMemoryDocument (lxml.etree.c:93571)
==15551== by 0x7507DC3: __pyx_pw_4lxml_5etree_23fromstring (lxml.etree.c:63285)
==15551== by 0x497EA3: PyEval_EvalFrameEx (in /usr/bin/python2.7)
==15551== by 0x49F1BF: PyEval_EvalCodeEx (in /usr/bin/python2.7)
==15551== by 0x4A9080: PyRun_FileExFlags (in /usr/bin/python2.7)

I've tried (unsuccessfully) a few times now to replicate this using the libxml2-provided python API, but perhaps someone more familiar with the internals of lxml might have better luck? I've tried tracing through the code from the invocation above and tried to replicate the calls to libxml2 directly.

scoder (scoder) wrote :

Ah, that makes sense. Then my guess is that libxml2's parser (may?) end up reading beyond its input buffer in the case that a <script> tag is not terminated. Essentially, <script> is parsed like CDATA, so the terminator is basically limited to a </script> tag or the end of the file.

Dan Lecocq (q-dan) wrote :

It seems unlikely to me that something like an unterminated script would cause it to choke. It certainly does appear that the tag is registered as unterminated, though:

examples = [
    '<html><head><title>Foo</title><script>This is an unterminated script',
    '<html><head><script>Mismatched close</head><body><p>Hello</p></body></html>'
]

from lxml import etree

for content in examples:
    tree = etree.fromstring(content, etree.HTMLParser(recover=True))
    print '%s => %s' % (content, tree.xpath('//script')[0].text)

The example provided here does appear to have an unterminated script, but none of the examples I have do (attaching them now).

scoder (scoder) wrote :

I ran all of your examples under valgrind and couldn't find any problems with them. No illegal memory access, no lockups, no memory leaks.

I used libxml2 2.9.1 and the latest lxml master.

Dan Lecocq (q-dan) wrote :

Are you running on Mac or Linux? Because I have not been able to reproduce this under Mac -- only Ubuntu 12.04 and 13.10.

scoder (scoder) wrote :

Ubuntu 13.10, but with a self-built libxml2 2.9.1.

So, on your side, you get lockups with all those files using the code you presented above? Could you run it under valgrind control?

I did this:

      valgrind --tool=memcheck --leak-check=no --num-callers=30 \
           --suppressions=LXML_SRC_DIR/valgrind-python.supp \
           python html_parse_test.py SOME_FILE.html

where parse_test.py is this:

"""
import lxml.html

import sys
if len(sys.argv) > 1:
    filename = sys.argv[1]
else:
    filename = "problem.html"

def test():
    i = 0
    while True:
        i += 1
        page_html = open(filename, "r").read()
        lxml.html.document_fromstring(page_html)
        print i

test()
"""

You'll get lots of output in the beginning which you can ignore. It only gets interesting right before the first "1" is printed, and then from that point onwards. If there's no stack trace that involves libxml2 before you kill it, valgrind is happy with it.

scoder (scoder) wrote :

This bug report seems to be related:

https://bugzilla.gnome.org/show_bug.cgi?id=706952

Andraz (andraz-tori) wrote :
Download full text (508.2 KiB)

scoder, here's the valgrind dump. It's full, search for "ITERATION 1".

==10560== Memcheck, a memory error detector
==10560== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==10560== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==10560== Command: python reprod.py
==10560==
==10560== Conditional jump or move depends on uninitialised value(s)
==10560== at 0x48DC68: PyObject_Free (obmalloc.c:969)
==10560== by 0x48E82C: _PyObject_DebugFreeApi (obmalloc.c:1483)
==10560== by 0x48E636: _PyMem_DebugFree (obmalloc.c:1405)
==10560== by 0x47DF36: dictresize (dictobject.c:665)
==10560== by 0x47E2FA: PyDict_SetItem (dictobject.c:793)
==10560== by 0x4A92F4: PyString_InternInPlace (stringobject.c:4747)
==10560== by 0x4A93B3: PyString_InternFromString (stringobject.c:4774)
==10560== by 0x4C270F: init_slotdefs (typeobject.c:6228)
==10560== by 0x4C2B4C: add_operators (typeobject.c:6382)
==10560== by 0x4BA8D6: PyType_Ready (typeobject.c:4012)
==10560== by 0x4BA804: PyType_Ready (typeobject.c:3976)
==10560== by 0x48C74A: _Py_ReadyTypes (object.c:2066)
==10560== by 0x5536F8: Py_InitializeEx (pythonrun.c:186)
==10560== by 0x553F11: Py_Initialize (pythonrun.c:377)
==10560== by 0x56F7CF: Py_Main (main.c:545)
==10560== by 0x418E63: main (python.c:23)
==10560==
==10560== Use of uninitialised value of size 8
==10560== at 0x48DC8C: PyObject_Free (obmalloc.c:969)
==10560== by 0x48E82C: _PyObject_DebugFreeApi (obmalloc.c:1483)
==10560== by 0x48E636: _PyMem_DebugFree (obmalloc.c:1405)
==10560== by 0x47DF36: dictresize (dictobject.c:665)
==10560== by 0x47E2FA: PyDict_SetItem (dictobject.c:793)
==10560== by 0x4A92F4: PyString_InternInPlace (stringobject.c:4747)
==10560== by 0x4A93B3: PyString_InternFromString (stringobject.c:4774)
==10560== by 0x4C270F: init_slotdefs (typeobject.c:6228)
==10560== by 0x4C2B4C: add_operators (typeobject.c:6382)
==10560== by 0x4BA8D6: PyType_Ready (typeobject.c:4012)
==10560== by 0x4BA804: PyType_Ready (typeobject.c:3976)
==10560== by 0x48C74A: _Py_ReadyTypes (object.c:2066)
==10560== by 0x5536F8: Py_InitializeEx (pythonrun.c:186)
==10560== by 0x553F11: Py_Initialize (pythonrun.c:377)
==10560== by 0x56F7CF: Py_Main (main.c:545)
==10560== by 0x418E63: main (python.c:23)
==10560==
==10560== Invalid read of size 4
==10560== at 0x48DC59: PyObject_Free (obmalloc.c:969)
==10560== by 0x48E82C: _PyObject_DebugFreeApi (obmalloc.c:1483)
==10560== by 0x48E636: _PyMem_DebugFree (obmalloc.c:1405)
==10560== by 0x47DF36: dictresize (dictobject.c:665)
==10560== by 0x47E2FA: PyDict_SetItem (dictobject.c:793)
==10560== by 0x4C2C45: add_operators (typeobject.c:6402)
==10560== by 0x4BA8D6: PyType_Ready (typeobject.c:4012)
==10560== by 0x48C792: _Py_ReadyTypes (object.c:2075)
==10560== by 0x5536F8: Py_InitializeEx (pythonrun.c:186)
==10560== by 0x553F11: Py_Initialize (pythonrun.c:377)
==10560== by 0x56F7CF: Py_Main (main.c:545)
==10560== by 0x418E63: main (python.c:23)
==10560== Address 0x63a9020 is 752 bytes inside a block of size 800 free'd
==10560== at 0x4C2A82E: free (in /usr/lib/valgr...

Andraz (andraz-tori) wrote :

Was the valgrind output useful to any degree?

[our servers bombed today because of this, so this reminded me of this thread :)]

scoder (scoder) wrote :

Ah, yes, thanks for the trace. It didn't show the named of internal calls in libxml2, but it definitely looks like what you'd expect with the libxml2 bug I linked to. The problematic parts (the findings right before the first iteration) are also essentially the same as what Dan Lecocq presented before.

So, sadly, I'm pretty sure now that it's a problem that can only really be solved in libxml2.

scoder (scoder) wrote :

Not marking this "confirmed", as it's not a bug in lxml itself. Definitely "critical", though.

Changed in lxml:
importance: Undecided → Critical
status: New → Triaged
Changed in libxml2:
importance: Unknown → Medium
status: Unknown → New
Changed in libxml2:
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.