etree.iterparse() raises lxml.etree.XMLSyntaxError: None

Bug #1185701 reported by Tojaj on 2013-05-30
60
This bug affects 12 people
Affects Status Importance Assigned to Milestone
lxml
Medium
scoder
Fedora
Fix Released
Undecided
Ubuntu
Undecided
Unassigned

Bug Description

Using of etree.iterparse() on valid xml throw a weird exception without description.

Python : sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree : (3, 2, 1, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Combination:
libxml2.x86_64 0:2.7.8-7.fc17
libxml2-python.x86_64 0:2.7.8-7.fc17
python-lxml.x86_64 0:2.3.5-1.fc17
== WORKS FINE ==

Combination:
libxml2.x86_64 0:2.7.8-7.fc17
libxml2-python.x86_64 0:2.7.8-7.fc17
python-lxml.x86_64 0:3.2.1-1.fc17
== WORKS FINE ==

But when I use the current libxml2 and libxml2-python:
libxml2.x86_64 0:2.7.8-9.fc17
libxml2-python.x86_64 0:2.7.8-9.fc17
python-lxml.x86_64 0:3.2.1-1.fc17
== ERROR ==

How reproducible:

XML:
====
<?xml version="1.0" encoding="UTF-8"?>
<metadata>
<foo>
  <bar>a</bar>
</foo>
<foo>
  <bar>b</bar>
</foo>
</metadata>

Reproducer:
===========
#!/usr/bin/python
from lxml import etree
for element in etree.iterparse(open("xml.xml")):
    print element[0], element[1].tag

Actual results:
end bar
end foo
end bar
end foo
end metadata
Traceback (most recent call last):
  File "./reproducer.py", line 3, in <module>
    for element in etree.iterparse(open("xml.xml")):
  File "iterparse.pxi", line 484, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:113793)
  File "iterparse.pxi", line 537, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:114367)
  File "parser.pxi", line 627, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:84362)
lxml.etree.XMLSyntaxError: None

Expected results:
end bar
end foo
end bar
end foo
end metadata

Note:
My original bug report at RedHat bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=874546

Description of problem:
Using of etree.iterparse() on valid xml throw a weird exception without description.

Version-Release number of selected component (if applicable):
Version: 2.3.5
Release: 1.fc17

How reproducible:

XML:
====
<?xml version="1.0" encoding="UTF-8"?>
<metadata>
<foo>
  <bar>a</bar>
</foo>
<foo>
  <bar>b</bar>
</foo>
</metadata>

Reproducer:
===========
#!/usr/bin/python
from lxml import etree
for element in etree.iterparse(open("xml.xml")):
    print element[0], element[1].tag

Actual results:
end bar
end foo
end bar
end foo
end metadata
Traceback (most recent call last):
  File "./reproducer.py", line 3, in <module>
    for element in etree.iterparse(open("xml.xml")):
  File "iterparse.pxi", line 491, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:103790)
  File "iterparse.pxi", line 543, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:104333)
  File "parser.pxi", line 601, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:79743)
lxml.etree.XMLSyntaxError: None

Expected results:
end bar
end foo
end bar
end foo
end metadata

Downgrade of: libxml2, libxml2-devel and libxml2-python from 0:2.7.8-9.fc17 to 0:2.7.8-7.fc17 solves the problem!

Thanks for the investigation Tomas.

I can confirm this on F18 too.
I see the problem with libxml2-2.9.0-1.fc18 and libxml2-2.9.0-0rc1.fc18,
downgrading to libxml2-2.8.0-2.fc18 makes it work again.

I also noticed that if I first read the xml file and pass the xml string to iterparse() instead of the file object it works ok.
I mean
f = open("xml.xml")
xml = f.read()
for element in etree.parse(StringIO(xml)):
instead of
for element in etree.iterparse(open("xml.xml")):

Daniel, any idea ?

Hum, no idea ... we have had errors reported for parsing from
memory string, but that was for very large documents and you're
seeing the opposite on a small document instead
http://git.gnome.org/browse/libxml2/commit/?id=153cf15905cf4ec080612ada6703757d10caba1e

you don't seems to be doing actual validation here (just
well formedness checking) so that should not be the
validation error fixed there:
http://git.gnome.org/browse/libxml2/commit/?id=6c91aa384f48ff6d406553a6dd47fd556c1ef2e6

I tried to put a breakpoint in libxml2 main routine which
concentrates all error reports:

(gdb) b __xmlRaiseError
Breakpoint 1 at 0x33d7835890: file error.c, line 459.
(gdb) c
Continuing.

>>> for element in etree.iterparse(open("tst.xml")):
... print element[0], element[1].tag
...
end bar
end foo
end bar
end foo
end metadata
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "iterparse.pxi", line 491, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:103790)
  File "iterparse.pxi", line 543, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:104333)
  File "parser.pxi", line 601, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:79743)
lxml.etree.XMLSyntaxError: None
>>>

so no I don't know what is going on there,
the last chunk of
http://git.gnome.org/browse/libxml2/diff/parser.c?id=6c91aa384f48ff6d406553a6dd47fd556c1ef2e6

may however fix stray parse error with the reader as you experienced,
but the problem was present in older releases, so I doubt it's this,

Daniel

Hello,
I am experiencing the same problem with lxml.

A valid xml file fails to be parsed, while converting it into string using the described here method fixes the issue.

I would also like to mention that the problem is reproduced only on my working environment, other guys from my team don't experience this problem.

Can you please tell me, if you have found any other solution? Is this bug planned to be fixed?

Thank you in advance,
J.

The error seems not raised by libxml2, otherwise my breakpoint in
__xmlRaiseError would have been raised. Seems to me that libxml2
update raised an error in lxml , reassigning to python-lxml

Daniel

Download full text (3.6 KiB)

I have tried debugging it with python-lxml-2.3.5-1.fc17

apparently hitting line 601 of
https://github.com/lxml/lxml/blob/master/src/lxml/parser.pxi

elif ctxt.lastError.message is not NULL:
...
raise XMLSyntaxError(message, code, line, column)

end bar
end foo
end bar
end foo
end metadata
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "iterparse.pxi", line 491, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:103790)
  File "iterparse.pxi", line 543, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:104333)
  File "parser.pxi", line 601, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:79743)
lxml.etree.XMLSyntaxError: None
>>>
Program received signal SIGINT, Interrupt.
0x000000360b8ea9d3 in __select_nocancel ()
    at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) b _raiseParseError
Function "_raiseParseError" not defined.
Make breakpoint pending on future shared library load? (y or [n]) n
(gdb) l src/lxml/lxml.etree.c:79743
79738 __Pyx_GOTREF(__pyx_t_8);
79739 __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
79740 __Pyx_DECREF(((PyObject *)__pyx_t_6)); __pyx_t_6 = 0;
79741 __Pyx_Raise(__pyx_t_8, 0, 0, 0);
79742 __Pyx_DECREF(__pyx_t_8); __pyx_t_8 = 0;
79743 {__pyx_filename = __pyx_f[3]; __pyx_lineno = 601; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
79744 }
79745 __pyx_L3:;
79746
79747 __pyx_r = 0;
(gdb) src/lxml/lxml.etree.c:79700
Undefined command: "src". Try "help".
(gdb) l src/lxml/lxml.etree.c:79700
79695 __Pyx_GIVEREF(__pyx_t_4);
79696 PyTuple_SET_ITEM(__pyx_t_8, 3, __pyx_t_7);
79697 __Pyx_GIVEREF(__pyx_t_7);
79698 __pyx_t_5 = 0;
79699 __pyx_t_4 = 0;
79700 __pyx_t_7 = 0;
79701 __pyx_t_7 = PyObject_Call(__pyx_t_6, ((PyObject *)__pyx_t_8), NULL); if (unlikely(!__pyx_t_7)) {__pyx_filename = __pyx_f[3]; __pyx_lineno = 599; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
79702 __Pyx_GOTREF(__pyx_t_7);
79703 __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;
79704 __Pyx_DECREF(((PyObject *)__pyx_t_8)); __pyx_t_8 = 0;
(gdb)
79705 __Pyx_Raise(__pyx_t_7, 0, 0, 0);
79706 __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
79707 {__pyx_filename = __pyx_f[3]; __pyx_lineno = 599; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
79708 goto __pyx_L3;
79709 }
79710 /*else*/ {
79711
79712 /* "/builddir/build/BUILD/lxml-2.3.5/src/lxml/parser.pxi":601
79713 * raise XMLSyntaxError(message, code, line, column)
79714 * else:
(gdb)
79715 * raise XMLSyntaxError(None, xmlerror.XML_ERR_INTERNAL_ERROR, 0, 0) # <<<<<<<<<<<<<<
79716 *
79717 * cdef xmlDoc* _handleParseResult(_ParserContext context,
79718 */
79719 __pyx_t_7 = __Pyx_GetName(__pyx_m, __pyx_n_s__XMLSyntaxError); if (unlikely(!__pyx_t_7)) {__pyx_filename = __pyx_f[3]; __pyx_lineno = 601; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
79720 __Pyx_GOTREF(__pyx_t_7);
79721 __pyx_t_8 = PyInt_FromLong(XML_ERR_INTERNAL_ERROR); if (unlikely(!__pyx_t_8)) {__pyx_filename = __pyx_f[3]; __pyx_lineno = 601; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
79722 __Pyx_GOTREF(__p...

Read more...

I have encountered this bug (ubuntu box, lxml 3.x version). Besides that
I encounered another bug which seems to be related: elem.getnext()
returnes None despite elem having a sibling beneth it. Unfortunately
I encountered this while using a big and private xml file which I can
not share.

The issue with getnext() seems related to this one becuase the workaround
suggested in this ticket (using StringIO instead of file object) solved
both of my issues.

python-lxml-3.2.0-1.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/python-lxml-3.2.0-1.fc18

python-lxml-3.2.0-1.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/python-lxml-3.2.0-1.fc17

python-lxml-3.2.0-1.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/python-lxml-3.2.0-1.fc19

Still exists with python-lxml-3.2.0-1.fc19.x86_64

Has anyone reported this upstream? I don't have the time/experience to debug this myself but I'm certainly willing to pull in patches that are destined for upstream.

Package python-lxml-3.2.0-1.fc18:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing python-lxml-3.2.0-1.fc18'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-7875/python-lxml-3.2.0-1.fc18
then log in and leave karma (feedback).

python-lxml-3.2.1-1.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/python-lxml-3.2.1-1.fc17

python-lxml-3.2.1-1.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/python-lxml-3.2.1-1.fc19

python-lxml-3.2.1-1.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/python-lxml-3.2.1-1.fc18

python-lxml-3.2.1-1.fc18 has been pushed to the Fedora 18 stable repository. If problems still persist, please make note of it in this bug report.

python-lxml-3.2.1-1.fc17 has been pushed to the Fedora 17 stable repository. If problems still persist, please make note of it in this bug report.

python-lxml-3.2.1-1.fc19 has been pushed to the Fedora 19 stable repository. If problems still persist, please make note of it in this bug report.

Hi, I just test the python-lxml-3.2.1-1 and the problem still persists.

Combination:
libxml2.x86_64 0:2.7.8-7.fc17
libxml2-python.x86_64 0:2.7.8-7.fc17
python-lxml.x86_64 0:2.3.5-1.fc17
== WORKS FINE ==

Combination:
libxml2.x86_64 0:2.7.8-7.fc17
libxml2-python.x86_64 0:2.7.8-7.fc17
python-lxml.x86_64 0:3.2.1-1.fc17
== WORKS FINE ==

But when I update the libxml2 and libxml2-python:
libxml2.x86_64 0:2.7.8-9.fc17
libxml2-python.x86_64 0:2.7.8-9.fc17
python-lxml.x86_64 0:3.2.1-1.fc17
== ERROR ==

How reproducible:

XML:
====
<?xml version="1.0" encoding="UTF-8"?>
<metadata>
<foo>
  <bar>a</bar>
</foo>
<foo>
  <bar>b</bar>
</foo>
</metadata>

Reproducer:
===========
#!/usr/bin/python
from lxml import etree
for element in etree.iterparse(open("xml.xml")):
    print element[0], element[1].tag

Actual results:
end bar
end foo
end bar
end foo
end metadata
Traceback (most recent call last):
  File "./reproducer.py", line 3, in <module>
    for element in etree.iterparse(open("xml.xml")):
  File "iterparse.pxi", line 484, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:113793)
  File "iterparse.pxi", line 537, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:114367)
  File "parser.pxi", line 627, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:84362)
lxml.etree.XMLSyntaxError: None

(Note: line numbers slightly changed since my first report)

Expected results:
end bar
end foo
end bar
end foo
end metadata

Thomas, please report this upstream to the lxml developers so that this can get fixed. I do not have the time nor knowledge to fix bugs like this. I'll be AFK next week so if you want to get an updated package into testing you'll need to work with the upstream developers quickly to get a patch.

Details about the mailing list can be found here: http://lxml.de/mailinglist/

I've just reported the bug to the upstream: https://bugs.launchpad.net/lxml/+bug/1185701

Piet Delport (pjdelport) wrote :

I'm experiencing the same problem here: etree.iterparse() throws an empty XMLSyntaxError, after the final "end" event for the document's root element is emitted.

Versions:

* Ubuntu 13.04 (current updates applied)
* python2.7 2.7.4-2ubuntu3

* libxml2 2.9.0+dfsg1-4ubuntu4.1
* libxslt1.1 1.1.27-1ubuntu2
* python-lxml 3.1.0-1

I tested with the latest lxml from PyPI (3.2.1), in a virtualenv, built against the same system libraries above, with the same result.

Piet Delport (pjdelport) wrote :

Addendum, for completeness:

* python-libxml2 2.9.0+dfsg1-4ubuntu4.1

Piet Delport (pjdelport) wrote :

Relevant traceback with lxml 3.1.0:

Traceback (most recent call last):
[...]
  File "test.py", line 10, in elems
    for (state, elem) in context:
  File "iterparse.pxi", line 484, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:112869)
  File "iterparse.pxi", line 537, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:113442)
  File "parser.pxi", line 607, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:83577)
lxml.etree.XMLSyntaxError: None

Relevant traceback with lxml 3.2.1 (same lxml version & line numbers as the traceback Tomas posted):

Traceback (most recent call last):
[...]
  File "test.py", line 10, in elems
    for (state, elem) in context:
  File "iterparse.pxi", line 484, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:114974)
  File "iterparse.pxi", line 537, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:115556)
  File "parser.pxi", line 627, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:85308)
lxml.etree.XMLSyntaxError: None

Piet Delport (pjdelport) wrote :

For the problem of raising XMLSyntaxError on the next iteration right after the document root closes, the following workaround seems to step around the problem:

    for (state, elem) in etree.iterparse(f):
        [...process...]

        # Work around bug #1185701 by bailing out after the end of the document root.
        if elem.getparent() is None:
            break

zjt9C9hsLy4b7z4N (yli) wrote :

It won't work if there is a tag parameter such as

for event, element in etree.iterparse(f, tag = 'html'):

Artem Korzhenevskiy (azurlay) wrote :

Another workaround:

content = etree.iterparse(open(file_path, 'rb'), tag='my_tag')
while True:
    try:
        event, elem = content.next()
        # process element
        yield result
    except (etree.XMLSyntaxError, StopIteration):
        break
del content

The error raises at the end of the file therefore all elements are successfully processed.

Piet Delport (pjdelport) wrote :

Artem: That version will silence all legitimate XMLSyntaxErrors too, though.

yli: That's true, although it's easy to work around that by using a guard instead:

    for (state, elem) in etree.iterparse(f):
        if elem.tag == 'foo':
            [...process...]

        # Work around bug #1185701 by bailing out after the end of the document root.
        if elem.getparent() is None:
            break

scoder (scoder) wrote :

I can reproduce this, however, only when I take the indirection of opening the file myself.

This raises an exception:

from lxml import etree
for element in etree.iterparse(open("test.xml")):
    print element[0], element[1].tag

This works:

from lxml import etree
for element in etree.iterparse("test.xml"):
    print element[0], element[1].tag

The latter is more straight forward anyway, so it provides a reasonable work-around for now, I guess.

This might have been introduced by the changes about closing input files in iterparse(), introduced in lxml 2.3/2.3.1.

Changed in lxml:
assignee: nobody → scoder (scoder)
importance: Undecided → Medium
status: New → Confirmed

This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that
we may not be able to fix it before Fedora 17 is end of life. If you
would still like to see this bug fixed and are able to reproduce it
against a later version of Fedora, you are encouraged change the
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Artem Korzhenevskiy (azurlay) wrote :

scoder: I tryed to pass filename to etree.iterparse(), but error still raises.

scoder (scoder) wrote :
Changed in lxml:
status: Confirmed → Fix Committed
Jiri Popelka (jpopelka) wrote :

That commit seems to fix it here. Thank you !

scoder (scoder) wrote :

Fixed in lxml 3.2.2.

Changed in lxml:
status: Fix Committed → Fix Released

I still have the same problem. I've cloned the nova repo last week, so I think I have the last version of lxml. Am I right?

scoder (scoder) wrote :

No idea what the "nova repo" is, but you can ask lxml for its version by printing "lxml.etree.__version__".

The fix for this went out a while ago, not sure why the bug never got closed.

I am getting a similar error in the xml parser:

DEBUG:root:Result:
<indexBrowserTreeViewResponse>
  <data>
    <type>G</type>
    <leaf>false</leaf>
    <nodeName>/</nodeName>
    <path>/</path>
    <children>
      <indexBrowserTreeNode>
        <type>G</type>
        <leaf>false</leaf>
        <nodeName>ubuntubuilderbase</nodeName>
        <path>/ubuntubuilderbase/</path>
        <repositoryId>builder</repositoryId>
        <locallyAvailable>false</locallyAvailable>
        <artifactTimestamp>0</artifactTimestamp>
      </indexBrowserTreeNode>
    </children>
    <repositoryId>builder</repositoryId>
    <locallyAvailable>false</locallyAvailable>
    <artifactTimestamp>0</artifactTimestamp>
  </data>
</indexBrowserTreeViewResponse>
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nathan/develop/nexus_download.py", line 128, in <module>
    nexus_parse_tree(nexus_builder_repo_response, file=True, debug=True)
  File "/home/nathan/develop/nexus_download.py", line 114, in nexus_parse_tree
    nexus_xml_tree = etree.iterparse(BytesIO(nexus_xml_string_or_file), my_parser) # Parse file.
  File "iterparse.pxi", line 105, in lxml.etree.iterparse.__init__ (src/lxml/lxml.etree.c:129924)
  File "parser.pxi", line 1508, in lxml.etree.XMLPullParser.__init__ (src/lxml/lxml.etree.c:103447)
  File "parser.pxi", line 813, in lxml.etree._BaseParser._collectEvents (src/lxml/lxml.etree.c:96793)
TypeError: 'lxml.etree.XMLParser' object is not iterable

scoder (scoder) wrote :

That's unrelated to this ticket. It's due to incorrect usage of iterparse(). Please read the documentation. Passing a parser into iterparse() is not possible and the second argument of iterparse() is not for what the author of the code apparently thinks it is.

Changed in ubuntu:
status: New → Fix Released
Changed in fedora:
importance: Unknown → Undecided
status: Unknown → New
status: New → Fix Released
importance: Undecided → Unknown
status: Fix Released → Unknown
Changed in fedora:
importance: Unknown → Undecided
status: Unknown → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.