After parsing certain rouge html file, all further parses of any data raise an excepton
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
libxml2 |
Fix Released
|
High
|
|||
lxml |
Fix Released
|
Medium
|
scoder |
Bug Description
After loading certain file (in attachment), all further parses fail.
The test code is very simple (also in attachment):
import lxml.html
data = open("test.html", "r").read()
lxml.html.
print "First parse succeeds\
# now next line is where a crash happens (but probably caused by previous parse)
# any lxml html parsing from here on returns an error, no matter the input
lxml.html.
Exception thrown:
File "test.py", line 8, in <module>
lxml.
File "/usr/lib/
return document_
File "/usr/lib/
value = etree.fromstrin
File "lxml.etree.pyx", line 2532, in lxml.etree.
File "parser.pxi", line 1545, in lxml.etree.
File "parser.pxi", line 1424, in lxml.etree.
File "parser.pxi", line 938, in lxml.etree.
File "parser.pxi", line 539, in lxml.etree.
File "parser.pxi", line 625, in lxml.etree.
File "parser.pxi", line 574, in lxml.etree.
lxml.etree.
Tested in lxml 2.2.8 on python 2.6 and lxml 2.2.4 on python 2.5
>>> import sys
>>> from lxml import etree
>>> print("%-20s: %s" % ('Python', sys.version_info))
Python : (2, 6, 6, 'final', 0)
>>> print("%-20s: %s" % ('lxml.etree', etree.LXML_
lxml.etree : (2, 2, 8, 0)
>>> print("%-20s: %s" % ('libxml used', etree.LIBXML_
libxml used : (2, 7, 7)
>>> print("%-20s: %s" % ('libxml compiled', etree.LIBXML_
libxml compiled : (2, 7, 7)
>>> print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_
libxslt used : (1, 1, 26)
>>> print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_
libxslt compiled : (1, 1, 26)
The real issue is that one rouge input causes all _further_ parses to fail. It messes up global state of lxml somehow.
Changed in libxml2: | |
importance: | Unknown → High |
status: | Unknown → New |
Changed in libxml2: | |
status: | New → Fix Released |
Thanks for the report and the test case. It turns out that this is a problem in libxml2, which fails to reset a fatal error flag in the parser context at the next parser run. It's easy to work around in lxml,etree, so Ive committed a quick fix.
diff -r 9ce32c6e84f4 src/lxml/parser.pxi parser. pxi Wed Oct 20 20:01:33 2010 +0200 parser. pxi Thu Oct 21 19:41:31 2010 +0200
htmlparser. htmlCtxtReset( self._c_ ctxt) ctxt.disableSAX = 0 # work around bug in libxml2 ctxt.spaceTab is not NULL or \
_LIBXML_ VERSION_ INT >= 20629: # work around bug in libxml2
xmlparser. xmlClearParserC txt(self. _c_ctxt)
--- a/src/lxml/
+++ b/src/lxml/
@@ -504,6 +504,7 @@
if self._c_ctxt is not NULL:
if self._c_ctxt.html:
+ self._c_
elif self._c_