infinite loop/memory consumption bug

Bug #948627 reported by Bryan Burns
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
lxml
Incomplete
Undecided
Unassigned

Bug Description

Certain web pages when parsed by lxml.html.fromstring() will go into some form of infinite loop, consuming every byte of memory on the system. What's weird is this behavior appears to be "random" (sometimes the content will parse fine, other times it causes the bug.)

Here's an example of the bug in action:

In [2]: import lxml.html

In [3]: import urllib2

In [4]: lxml.html.fromstring(urllib2.urlopen("http://www.chujingdian.com/plus/recommend.php?aid=970").read())
Out[4]: <Element html at 0x109ffcdd0>

In [5]: lxml.html.fromstring(urllib2.urlopen("http://www.chujingdian.com/plus/recommend.php?aid=970").read())
Out[5]: <Element html at 0x10a023170>

In [6]: lxml.html.fromstring(urllib2.urlopen("http://www.chujingdian.com/plus/recommend.php?aid=970").read())
Out[6]: <Element html at 0x10a023470>

In [7]: lxml.html.fromstring(urllib2.urlopen("http://www.chujingdian.com/plus/recommend.php?aid=970").read())
^C^\Quit: 3

The first three times the page is loaded and parsed fine. The 4th time it causes the bug.

It seems more likely to happen when loading "fresh" content from urllib2, but it still happens if you pass in the same string over and over (so it's not the string contents that are changing.) Here is an example of that:

In [1]: import lxml.html

In [2]: import urllib2

In [3]: data = urllib2.urlopen("http://www.chujingdian.com/plus/recommend.php?aid=970").read()

In [4]: lxml.html.fromstring(data)
Out[4]: <Element html at 0x10a660950>

In [5]: lxml.html.fromstring(data)
Out[5]: <Element html at 0x10a660cb0>

In [6]: lxml.html.fromstring(data)
Out[6]: <Element html at 0x10a660fb0>

In [7]: lxml.html.fromstring(data)
Out[7]: <Element html at 0x10a6832f0>

In [8]: lxml.html.fromstring(data)
Out[8]: <Element html at 0x10a6835f0>

In [9]: lxml.html.fromstring(data)
Out[9]: <Element html at 0x10a6838f0>

In [10]: lxml.html.fromstring(data)
Out[10]: <Element html at 0x10a683bf0>

In [11]: lxml.html.fromstring(data)
Out[11]: <Element html at 0x10a683ef0>

In [12]: lxml.html.fromstring(data)
Out[12]: <Element html at 0x10a686230>

In [13]: lxml.html.fromstring(data)
Out[13]: <Element html at 0x10a686530>

In [14]: lxml.html.fromstring(data)
Out[14]: <Element html at 0x10a686830>

In [15]: lxml.html.fromstring(data)
Out[15]: <Element html at 0x10a686b30>

In [16]: lxml.html.fromstring(data)
Out[16]: <Element html at 0x10a686e30>

In [17]: lxml.html.fromstring(data)
Out[17]: <Element html at 0x10a689170>

In [18]: lxml.html.fromstring(data)
Out[18]: <Element html at 0x10a660770>

In [19]: lxml.html.fromstring(data)
Out[19]: <Element html at 0x10a660b30>

In [20]: lxml.html.fromstring(data)
Out[20]: <Element html at 0x10a660e30>

In [21]: lxml.html.fromstring(data)
Out[21]: <Element html at 0x10a6831d0>

In [22]: lxml.html.fromstring(data)
^C^\Quit: 3

It took about 20 tries instead of just 4 above, but the bug still happens.

When the bug occurs, the fromstring() call never returns, and the python process starts growing rapidly in size until all memory is consumed. stracing the process shows an endless sequence of brk() calls.

I have been able to reproduce this bug on both ubuntu and OSX. Here are the system outputs for both:

Ubuntu
------------------
>>> print("%-20s: %s" % ('Python', sys.version_info))
Python : sys.version_info(major=2, minor=7, micro=2, releaselevel='final', serial=0)
>>> print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
lxml.etree : (2, 3, 0, 0)
>>> print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
libxml used : (2, 7, 8)
>>> print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
libxml compiled : (2, 7, 8)
>>> print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
libxslt used : (1, 1, 26)
>>> print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))
libxslt compiled : (1, 1, 26)

OSX
-------------------
In [3]: print("%-20s: %s" % ('Python', sys.version_info))
Python : sys.version_info(major=2, minor=7, micro=2, releaselevel='final', serial=0)

In [4]: print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
lxml.etree : (2, 3, 3, 0)

In [5]: print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
libxml used : (2, 7, 8)

In [6]: print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
libxml compiled : (2, 7, 8)

In [7]: print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
libxslt used : (1, 1, 26)

In [8]: print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))
libxslt compiled : (1, 1, 26)

For posterity, here is the contents of one of the pages I've noticed causes this bug:

In [2]: urllib2.urlopen("http://www.chujingdian.com/plus/recommend.php?aid=970").read()
Out[2]: '<html>\r\n<head>\r\n<title>DedeCMS\xe6\x8f\x90\xe7\xa4\xba\xe4\xbf\xa1\xe6\x81\xaf</title>\r\n<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />\r\n<base target=\'_self\'/>\r\n<style>div{line-height:160%;}</style></head>\r\n<body leftmargin=\'0\' topmargin=\'0\' bgcolor=\'#FFFFFF\'>\r\n<center>\r\n<script>\r\n var pgo=0;\r\n function JumpUrl(){\r\n if(pgo==0){ location=\'javascript:history.go(-1);\'; pgo=1; }\r\n }\r\ndocument.write("<br /><div style=\'width:450px;padding:0px;border:1px solid #DADADA;\'><div style=\'padding:6px;font-size:12px;border-bottom:1px solid #DADADA;background:#DBEEBD url(/plus/img/wbg.gif)\';\'><b>DedeCMS \xe6\x8f\x90\xe7\xa4\xba\xe4\xbf\xa1\xe6\x81\xaf\xef\xbc\x81</b></div>");\r\ndocument.write("<div style=\'height:130px;font-size:10pt;background:#ffffff\'><br />");\r\ndocument.write("\xe6\x97\xa0\xe6\xb3\x95\xe6\x8a\x8a\xe6\x9c\xaa\xe7\x9f\xa5\xe6\x96\x87\xe6\xa1\xa3\xe6\x8e\xa8\xe8\x8d\x90\xe7\xbb\x99\xe5\xa5\xbd\xe5\x8f\x8b!");\r\ndocument.write("<br /><a href=\'javascript:history.go(-1);\'>\xe5\xa6\x82\xe6\x9e\x9c\xe4\xbd\xa0\xe7\x9a\x84\xe6\xb5\x8f\xe8\xa7\x88\xe5\x99\xa8\xe6\xb2\xa1\xe5\x8f\x8d\xe5\xba\x94\xef\xbc\x8c\xe8\xaf\xb7\xe7\x82\xb9\xe5\x87\xbb\xe8\xbf\x99\xe9\x87\x8c...</a><br/></div>");\r\nsetTimeout(\'JumpUrl()\',5000);</script>\r\n</center>\r\n</body>\r\n</html>\r\n'

Something that looks suspicious is the <meta> tag advertises a charset of "gb2312", which appears to be incorrect:

In [3]: _.decode('gb2312')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
/Users/bryan/<ipython-input-3-4631a5ceb915> in <module>()
----> 1 _.decode('gb2312')

UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 30-31: illegal multibyte sequence

I figured out a workaround to this bug in production, which is to hand fromstring() a unicode object instead of a str object. It would still be nice for this bug to be fixed :)

Revision history for this message
scoder (scoder) wrote :

Cannot reproduce with current release (3.1.2).

Changed in lxml:
status: New → Incomplete
Revision history for this message
scoder (scoder) wrote :

Marking as a duplicate as the other bug has essential information in it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.