infinite loop/memory consumption bug
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Incomplete
|
Undecided
|
Unassigned |
Bug Description
Certain web pages when parsed by lxml.html.
Here's an example of the bug in action:
In [2]: import lxml.html
In [3]: import urllib2
In [4]: lxml.html.
Out[4]: <Element html at 0x109ffcdd0>
In [5]: lxml.html.
Out[5]: <Element html at 0x10a023170>
In [6]: lxml.html.
Out[6]: <Element html at 0x10a023470>
In [7]: lxml.html.
^C^\Quit: 3
The first three times the page is loaded and parsed fine. The 4th time it causes the bug.
It seems more likely to happen when loading "fresh" content from urllib2, but it still happens if you pass in the same string over and over (so it's not the string contents that are changing.) Here is an example of that:
In [1]: import lxml.html
In [2]: import urllib2
In [3]: data = urllib2.urlopen("http://
In [4]: lxml.html.
Out[4]: <Element html at 0x10a660950>
In [5]: lxml.html.
Out[5]: <Element html at 0x10a660cb0>
In [6]: lxml.html.
Out[6]: <Element html at 0x10a660fb0>
In [7]: lxml.html.
Out[7]: <Element html at 0x10a6832f0>
In [8]: lxml.html.
Out[8]: <Element html at 0x10a6835f0>
In [9]: lxml.html.
Out[9]: <Element html at 0x10a6838f0>
In [10]: lxml.html.
Out[10]: <Element html at 0x10a683bf0>
In [11]: lxml.html.
Out[11]: <Element html at 0x10a683ef0>
In [12]: lxml.html.
Out[12]: <Element html at 0x10a686230>
In [13]: lxml.html.
Out[13]: <Element html at 0x10a686530>
In [14]: lxml.html.
Out[14]: <Element html at 0x10a686830>
In [15]: lxml.html.
Out[15]: <Element html at 0x10a686b30>
In [16]: lxml.html.
Out[16]: <Element html at 0x10a686e30>
In [17]: lxml.html.
Out[17]: <Element html at 0x10a689170>
In [18]: lxml.html.
Out[18]: <Element html at 0x10a660770>
In [19]: lxml.html.
Out[19]: <Element html at 0x10a660b30>
In [20]: lxml.html.
Out[20]: <Element html at 0x10a660e30>
In [21]: lxml.html.
Out[21]: <Element html at 0x10a6831d0>
In [22]: lxml.html.
^C^\Quit: 3
It took about 20 tries instead of just 4 above, but the bug still happens.
When the bug occurs, the fromstring() call never returns, and the python process starts growing rapidly in size until all memory is consumed. stracing the process shows an endless sequence of brk() calls.
I have been able to reproduce this bug on both ubuntu and OSX. Here are the system outputs for both:
Ubuntu
------------------
>>> print("%-20s: %s" % ('Python', sys.version_info))
Python : sys.version_
>>> print("%-20s: %s" % ('lxml.etree', etree.LXML_
lxml.etree : (2, 3, 0, 0)
>>> print("%-20s: %s" % ('libxml used', etree.LIBXML_
libxml used : (2, 7, 8)
>>> print("%-20s: %s" % ('libxml compiled', etree.LIBXML_
libxml compiled : (2, 7, 8)
>>> print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_
libxslt used : (1, 1, 26)
>>> print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_
libxslt compiled : (1, 1, 26)
OSX
-------------------
In [3]: print("%-20s: %s" % ('Python', sys.version_info))
Python : sys.version_
In [4]: print("%-20s: %s" % ('lxml.etree', etree.LXML_
lxml.etree : (2, 3, 3, 0)
In [5]: print("%-20s: %s" % ('libxml used', etree.LIBXML_
libxml used : (2, 7, 8)
In [6]: print("%-20s: %s" % ('libxml compiled', etree.LIBXML_
libxml compiled : (2, 7, 8)
In [7]: print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_
libxslt used : (1, 1, 26)
In [8]: print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_
libxslt compiled : (1, 1, 26)
For posterity, here is the contents of one of the pages I've noticed causes this bug:
In [2]: urllib2.urlopen("http://
Out[2]: '<html>
Something that looks suspicious is the <meta> tag advertises a charset of "gb2312", which appears to be incorrect:
In [3]: _.decode('gb2312')
-------
UnicodeDecodeError Traceback (most recent call last)
/Users/
----> 1 _.decode('gb2312')
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 30-31: illegal multibyte sequence
I figured out a workaround to this bug in production, which is to hand fromstring() a unicode object instead of a str object. It would still be nice for this bug to be fixed :)
Cannot reproduce with current release (3.1.2).