After parsing certain rouge html file, all further parses of any data raise an excepton

Bug #661890 reported by Andraz
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libxml2
Fix Released
High
lxml
Fix Released
Medium
scoder

Bug Description

After loading certain file (in attachment), all further parses fail.

The test code is very simple (also in attachment):

import lxml.html
data = open("test.html", "r").read()
lxml.html.fromstring(data)

print "First parse succeeds\n------------------\n-----------------\n"
# now next line is where a crash happens (but probably caused by previous parse)
# any lxml html parsing from here on returns an error, no matter the input
lxml.html.fromstring("<html></html>") # Here exception is thrown every time

Exception thrown:
  File "test.py", line 8, in <module>
    lxml.html.fromstring("<html></html>")
  File "/usr/lib/python2.5/site-packages/lxml/html/__init__.py", line 601, in fromstring
    return document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/usr/lib/python2.5/site-packages/lxml/html/__init__.py", line 511, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48634)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72245)
  File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71106)
  File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67875)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178)
  File "parser.pxi", line 574, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64657)
lxml.etree.XMLSyntaxError: line 213793: htmlParseEntityRef: expecting ';'

Tested in lxml 2.2.8 on python 2.6 and lxml 2.2.4 on python 2.5

>>> import sys
>>> from lxml import etree
>>> print("%-20s: %s" % ('Python', sys.version_info))
Python : (2, 6, 6, 'final', 0)
>>> print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
lxml.etree : (2, 2, 8, 0)
>>> print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
libxml used : (2, 7, 7)
>>> print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
libxml compiled : (2, 7, 7)
>>> print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
libxslt used : (1, 1, 26)
>>> print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))
libxslt compiled : (1, 1, 26)

The real issue is that one rouge input causes all _further_ parses to fail. It messes up global state of lxml somehow.

Revision history for this message
Andraz (andraz-tori) wrote :
Revision history for this message
scoder (scoder) wrote :

Thanks for the report and the test case. It turns out that this is a problem in libxml2, which fails to reset a fatal error flag in the parser context at the next parser run. It's easy to work around in lxml,etree, so Ive committed a quick fix.

diff -r 9ce32c6e84f4 src/lxml/parser.pxi
--- a/src/lxml/parser.pxi Wed Oct 20 20:01:33 2010 +0200
+++ b/src/lxml/parser.pxi Thu Oct 21 19:41:31 2010 +0200
@@ -504,6 +504,7 @@
         if self._c_ctxt is not NULL:
             if self._c_ctxt.html:
                 htmlparser.htmlCtxtReset(self._c_ctxt)
+ self._c_ctxt.disableSAX = 0 # work around bug in libxml2
             elif self._c_ctxt.spaceTab is not NULL or \
                     _LIBXML_VERSION_INT >= 20629: # work around bug in libxml2
                 xmlparser.xmlClearParserCtxt(self._c_ctxt)

Changed in lxml:
assignee: nobody → Stefan Behnel (scoder)
importance: Undecided → Medium
status: New → Fix Committed
Changed in libxml2:
importance: Unknown → High
status: Unknown → New
Changed in libxml2:
status: New → Fix Released
Revision history for this message
scoder (scoder) wrote :

Work-around for older libxml2 versions released in lxml 2.3, fixed in libxml2 2.7.8.

Changed in lxml:
milestone: none → 2.3
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.