lxml

Bug #2057780
Comment #3

Comment 3 for bug 2057780

Revision history for this message

Hans-Henrik Stærfeldt (bombmandk) wrote on 2024-03-15: Re: [Bug 2057780] Re: lxml.etree.XMLSyntaxError: Memory allocation failed - but no memory used

I implemented a c-only sax parsing (from a libxml2 example online) and it
managed to traverse the entire xml file.
It only printed the matched end tag of the entries I wanted, and didn't
build a document.
But that parsed the entire file with no problems.

On Thu, 14 Mar 2024 at 16:45, scoder <email address hidden> wrote:

> Thanks for the report. That's a very large file indeed.
>
> Could you check if there is anything else in the error log at the time
> when it fails?
>
> print(exception.error_log)
>
> should give you the current list of logged errors. It might be that lxml
> misguesses the reason for the error and somehow reports a non-fatal
> memory related error (however that might happen) instead of an actual
> parsing failure.
>
> It might also be that you are hitting some other resource limit in
> libxml2. Have you tried stream-parsing the document with libxml2's
> xmllint tool?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2057780
>
> Title:
> lxml.etree.XMLSyntaxError: Memory allocation failed - but no memory
> used
>
> Status in lxml:
> New
>
> Bug description:
> Python : sys.version_info(major=3, minor=10, micro=8,
> releaselevel='final', serial=0)
> lxml.etree : (5, 1, 0, 0)
> libxml used : (2, 12, 3)
> libxml compiled : (2, 12, 3)
> libxslt used : (1, 1, 39)
> libxslt compiled : (1, 1, 39)
>
>
> I am parsing a very large XML (500G) file using lxml.etree.iterparse
> like this
> The individual records are not very large. This runs for about an hour,
> memory
> not getting close to 100M while it runs. Machine has hundreds of
> gigabytes of
> memory, and its mostly not utilised while this ran.
>
>
> with open(largexmlfilepath, "rb") as xmlstream:
> for _, elem in lxml.etree.iterparse( # pylint:
> disable=c-extension-no-member
> xmlstream, events=("end",), remove_blank_text=True, tag=tagname
> ):
> # Don't actually try to do anything
> assert elem
> elem.clear()
> while elem.getprevious() is not None:
> del elem.getparent()[0]
>
>
> After about 4931000 records processed, I get this
> ...
> File "src/lxml/iterparse.pxi", line 210, in
> lxml.etree.iterparse.__next__
> File "src/lxml/iterparse.pxi", line 195, in
> lxml.etree.iterparse.__next__
> File "src/lxml/iterparse.pxi", line 230, in
> lxml.etree.iterparse._read_more_events
> File "src/lxml/parser.pxi", line 1432, in lxml.etree._FeedParser.feed
> File "src/lxml/parser.pxi", line 609, in
> lxml.etree._ParserContext._handleParseResult
> File "src/lxml/parser.pxi", line 618, in
> lxml.etree._ParserContext._handleParseResultDoc
> File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult
> File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError
> File "/mnt/docker/All_Mammalia.xml", line -1924384678
> lxml.etree.XMLSyntaxError: Memory allocation failed
>
> It is very reproducible. It seems to fail exactly at the same place.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/lxml/+bug/2057780/+subscriptions
>
>

On Thu, 14 Mar 2024 at 16:45, scoder <2057780@bugs.launchpad.net> wrote:

> Thanks for the report. That's a very large file indeed.
>
> Could you check if there is anything else in the error log at the time
> when it fails?
>
>     print(exception.error_log)
>
> should give you the current list of logged errors. It might be that lxml
> misguesses the reason for the error and somehow reports a non-fatal
> memory related error (however that might happen) instead of an actual
> parsing failure.
>
> It might also be that you are hitting some other resource limit in
> libxml2. Have you tried stream-parsing the document with libxml2's
> xmllint tool?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2057780
>
> Title:
>   lxml.etree.XMLSyntaxError: Memory allocation failed - but no memory
>   used
>
> Status in lxml:
>   New
>
> Bug description:
>   Python              : sys.version_info(major=3, minor=10, micro=8,
> releaselevel='final', serial=0)
>   lxml.etree          : (5, 1, 0, 0)
>   libxml used         : (2, 12, 3)
>   libxml compiled     : (2, 12, 3)
>   libxslt used        : (1, 1, 39)
>   libxslt compiled    : (1, 1, 39)
>
>
>   I am parsing a very large XML (500G) file using lxml.etree.iterparse
> like this
>   The individual records are not very large. This runs for about an hour,
> memory
>   not getting close to 100M while it runs. Machine has hundreds of
> gigabytes of
>   memory, and its mostly not utilised while this ran.
>
>
>   with open(largexmlfilepath, "rb") as xmlstream:
>      for _, elem in lxml.etree.iterparse(  # pylint:
> disable=c-extension-no-member
>          xmlstream, events=("end",), remove_blank_text=True, tag=tagname
>      ):
>         # Don't actually try to do anything
>         assert elem
>         elem.clear()
>         while elem.getprevious() is not None:
>             del elem.getparent()[0]
>
>
>   After about 4931000 records processed, I get this
>   ...
>     File "src/lxml/iterparse.pxi", line 210, in
> lxml.etree.iterparse.__next__
>     File "src/lxml/iterparse.pxi", line 195, in
> lxml.etree.iterparse.__next__
>     File "src/lxml/iterparse.pxi", line 230, in
> lxml.etree.iterparse._read_more_events
>     File "src/lxml/parser.pxi", line 1432, in lxml.etree._FeedParser.feed
>     File "src/lxml/parser.pxi", line 609, in
> lxml.etree._ParserContext._handleParseResult
>     File "src/lxml/parser.pxi", line 618, in
> lxml.etree._ParserContext._handleParseResultDoc
>     File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult
>     File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError
>     File "/mnt/docker/All_Mammalia.xml", line -1924384678
>   lxml.etree.XMLSyntaxError: Memory allocation failed
>
>   It is very reproducible. It seems to fail exactly at the same place.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/lxml/+bug/2057780/+subscriptions
>
>