etree.iterparse() raises lxml.etree.XMLSyntaxError: None

Bug #1185701 reported by Tojaj on 2013-05-30
60
This bug affects 12 people
Affects Status Importance Assigned to Milestone
lxml
Medium
scoder
Fedora
Unknown
Unknown
Ubuntu
Undecided
Unassigned

Bug Description

Using of etree.iterparse() on valid xml throw a weird exception without description.

Python : sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree : (3, 2, 1, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Combination:
libxml2.x86_64 0:2.7.8-7.fc17
libxml2-python.x86_64 0:2.7.8-7.fc17
python-lxml.x86_64 0:2.3.5-1.fc17
== WORKS FINE ==

Combination:
libxml2.x86_64 0:2.7.8-7.fc17
libxml2-python.x86_64 0:2.7.8-7.fc17
python-lxml.x86_64 0:3.2.1-1.fc17
== WORKS FINE ==

But when I use the current libxml2 and libxml2-python:
libxml2.x86_64 0:2.7.8-9.fc17
libxml2-python.x86_64 0:2.7.8-9.fc17
python-lxml.x86_64 0:3.2.1-1.fc17
== ERROR ==

How reproducible:

XML:
====
<?xml version="1.0" encoding="UTF-8"?>
<metadata>
<foo>
  <bar>a</bar>
</foo>
<foo>
  <bar>b</bar>
</foo>
</metadata>

Reproducer:
===========
#!/usr/bin/python
from lxml import etree
for element in etree.iterparse(open("xml.xml")):
    print element[0], element[1].tag

Actual results:
end bar
end foo
end bar
end foo
end metadata
Traceback (most recent call last):
  File "./reproducer.py", line 3, in <module>
    for element in etree.iterparse(open("xml.xml")):
  File "iterparse.pxi", line 484, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:113793)
  File "iterparse.pxi", line 537, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:114367)
  File "parser.pxi", line 627, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:84362)
lxml.etree.XMLSyntaxError: None

Expected results:
end bar
end foo
end bar
end foo
end metadata

Note:
My original bug report at RedHat bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=874546

Piet Delport (pjdelport) wrote :

I'm experiencing the same problem here: etree.iterparse() throws an empty XMLSyntaxError, after the final "end" event for the document's root element is emitted.

Versions:

* Ubuntu 13.04 (current updates applied)
* python2.7 2.7.4-2ubuntu3

* libxml2 2.9.0+dfsg1-4ubuntu4.1
* libxslt1.1 1.1.27-1ubuntu2
* python-lxml 3.1.0-1

I tested with the latest lxml from PyPI (3.2.1), in a virtualenv, built against the same system libraries above, with the same result.

Piet Delport (pjdelport) wrote :

Addendum, for completeness:

* python-libxml2 2.9.0+dfsg1-4ubuntu4.1

Piet Delport (pjdelport) wrote :

Relevant traceback with lxml 3.1.0:

Traceback (most recent call last):
[...]
  File "test.py", line 10, in elems
    for (state, elem) in context:
  File "iterparse.pxi", line 484, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:112869)
  File "iterparse.pxi", line 537, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:113442)
  File "parser.pxi", line 607, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:83577)
lxml.etree.XMLSyntaxError: None

Relevant traceback with lxml 3.2.1 (same lxml version & line numbers as the traceback Tomas posted):

Traceback (most recent call last):
[...]
  File "test.py", line 10, in elems
    for (state, elem) in context:
  File "iterparse.pxi", line 484, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:114974)
  File "iterparse.pxi", line 537, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:115556)
  File "parser.pxi", line 627, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:85308)
lxml.etree.XMLSyntaxError: None

Piet Delport (pjdelport) wrote :

For the problem of raising XMLSyntaxError on the next iteration right after the document root closes, the following workaround seems to step around the problem:

    for (state, elem) in etree.iterparse(f):
        [...process...]

        # Work around bug #1185701 by bailing out after the end of the document root.
        if elem.getparent() is None:
            break

zjt9C9hsLy4b7z4N (yli) wrote :

It won't work if there is a tag parameter such as

for event, element in etree.iterparse(f, tag = 'html'):

Artem Korzhenevskiy (azurlay) wrote :

Another workaround:

content = etree.iterparse(open(file_path, 'rb'), tag='my_tag')
while True:
    try:
        event, elem = content.next()
        # process element
        yield result
    except (etree.XMLSyntaxError, StopIteration):
        break
del content

The error raises at the end of the file therefore all elements are successfully processed.

Piet Delport (pjdelport) wrote :

Artem: That version will silence all legitimate XMLSyntaxErrors too, though.

yli: That's true, although it's easy to work around that by using a guard instead:

    for (state, elem) in etree.iterparse(f):
        if elem.tag == 'foo':
            [...process...]

        # Work around bug #1185701 by bailing out after the end of the document root.
        if elem.getparent() is None:
            break

scoder (scoder) wrote :

I can reproduce this, however, only when I take the indirection of opening the file myself.

This raises an exception:

from lxml import etree
for element in etree.iterparse(open("test.xml")):
    print element[0], element[1].tag

This works:

from lxml import etree
for element in etree.iterparse("test.xml"):
    print element[0], element[1].tag

The latter is more straight forward anyway, so it provides a reasonable work-around for now, I guess.

This might have been introduced by the changes about closing input files in iterparse(), introduced in lxml 2.3/2.3.1.

Changed in lxml:
assignee: nobody → scoder (scoder)
importance: Undecided → Medium
status: New → Confirmed
Artem Korzhenevskiy (azurlay) wrote :

scoder: I tryed to pass filename to etree.iterparse(), but error still raises.

scoder (scoder) wrote :
Changed in lxml:
status: Confirmed → Fix Committed
Jiri Popelka (jpopelka) wrote :

That commit seems to fix it here. Thank you !

scoder (scoder) wrote :

Fixed in lxml 3.2.2.

Changed in lxml:
status: Fix Committed → Fix Released

I still have the same problem. I've cloned the nova repo last week, so I think I have the last version of lxml. Am I right?

scoder (scoder) wrote :

No idea what the "nova repo" is, but you can ask lxml for its version by printing "lxml.etree.__version__".

I am getting a similar error in the xml parser:

DEBUG:root:Result:
<indexBrowserTreeViewResponse>
  <data>
    <type>G</type>
    <leaf>false</leaf>
    <nodeName>/</nodeName>
    <path>/</path>
    <children>
      <indexBrowserTreeNode>
        <type>G</type>
        <leaf>false</leaf>
        <nodeName>ubuntubuilderbase</nodeName>
        <path>/ubuntubuilderbase/</path>
        <repositoryId>builder</repositoryId>
        <locallyAvailable>false</locallyAvailable>
        <artifactTimestamp>0</artifactTimestamp>
      </indexBrowserTreeNode>
    </children>
    <repositoryId>builder</repositoryId>
    <locallyAvailable>false</locallyAvailable>
    <artifactTimestamp>0</artifactTimestamp>
  </data>
</indexBrowserTreeViewResponse>
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nathan/develop/nexus_download.py", line 128, in <module>
    nexus_parse_tree(nexus_builder_repo_response, file=True, debug=True)
  File "/home/nathan/develop/nexus_download.py", line 114, in nexus_parse_tree
    nexus_xml_tree = etree.iterparse(BytesIO(nexus_xml_string_or_file), my_parser) # Parse file.
  File "iterparse.pxi", line 105, in lxml.etree.iterparse.__init__ (src/lxml/lxml.etree.c:129924)
  File "parser.pxi", line 1508, in lxml.etree.XMLPullParser.__init__ (src/lxml/lxml.etree.c:103447)
  File "parser.pxi", line 813, in lxml.etree._BaseParser._collectEvents (src/lxml/lxml.etree.c:96793)
TypeError: 'lxml.etree.XMLParser' object is not iterable

scoder (scoder) wrote :

That's unrelated to this ticket. It's due to incorrect usage of iterparse(). Please read the documentation. Passing a parser into iterparse() is not possible and the second argument of iterparse() is not for what the author of the code apparently thinks it is.

Changed in ubuntu:
status: New → Fix Released
Changed in fedora:
importance: Unknown → Undecided
status: Unknown → New
status: New → Fix Released
importance: Undecided → Unknown
status: Fix Released → Unknown
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.