lxml.etree.XMLSyntaxError: Memory allocation failed - but no memory used

Bug #2057780 reported by Hans-Henrik Stærfeldt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

Python : sys.version_info(major=3, minor=10, micro=8, releaselevel='final', serial=0)
lxml.etree : (5, 1, 0, 0)
libxml used : (2, 12, 3)
libxml compiled : (2, 12, 3)
libxslt used : (1, 1, 39)
libxslt compiled : (1, 1, 39)

I am parsing a very large XML (500G) file using lxml.etree.iterparse like this
The individual records are not very large. This runs for about an hour, memory
not getting close to 100M while it runs. Machine has hundreds of gigabytes of
memory, and its mostly not utilised while this ran.

with open(largexmlfilepath, "rb") as xmlstream:
   for _, elem in lxml.etree.iterparse( # pylint: disable=c-extension-no-member
       xmlstream, events=("end",), remove_blank_text=True, tag=tagname
   ):
      # Don't actually try to do anything
      assert elem
      elem.clear()
      while elem.getprevious() is not None:
          del elem.getparent()[0]

After about 4931000 records processed, I get this
...
  File "src/lxml/iterparse.pxi", line 210, in lxml.etree.iterparse.__next__
  File "src/lxml/iterparse.pxi", line 195, in lxml.etree.iterparse.__next__
  File "src/lxml/iterparse.pxi", line 230, in lxml.etree.iterparse._read_more_events
  File "src/lxml/parser.pxi", line 1432, in lxml.etree._FeedParser.feed
  File "src/lxml/parser.pxi", line 609, in lxml.etree._ParserContext._handleParseResult
  File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError
  File "/mnt/docker/All_Mammalia.xml", line -1924384678
lxml.etree.XMLSyntaxError: Memory allocation failed

It is very reproducible. It seems to fail exactly at the same place.

Revision history for this message
Hans-Henrik Stærfeldt (bombmandk) wrote :

I should clarify - this error message has also been observed processing other unrelated large XML sources,
so it is unlikely to be the XML source that has some issue.

Revision history for this message
scoder (scoder) wrote :

Thanks for the report. That's a very large file indeed.

Could you check if there is anything else in the error log at the time when it fails?

    print(exception.error_log)

should give you the current list of logged errors. It might be that lxml misguesses the reason for the error and somehow reports a non-fatal memory related error (however that might happen) instead of an actual parsing failure.

It might also be that you are hitting some other resource limit in libxml2. Have you tried stream-parsing the document with libxml2's xmllint tool?

Revision history for this message
Hans-Henrik Stærfeldt (bombmandk) wrote : Re: [Bug 2057780] Re: lxml.etree.XMLSyntaxError: Memory allocation failed - but no memory used
Download full text (3.2 KiB)

I implemented a c-only sax parsing (from a libxml2 example online) and it
managed to traverse the entire xml file.
It only printed the matched end tag of the entries I wanted, and didn't
build a document.
But that parsed the entire file with no problems.

On Thu, 14 Mar 2024 at 16:45, scoder <email address hidden> wrote:

> Thanks for the report. That's a very large file indeed.
>
> Could you check if there is anything else in the error log at the time
> when it fails?
>
> print(exception.error_log)
>
> should give you the current list of logged errors. It might be that lxml
> misguesses the reason for the error and somehow reports a non-fatal
> memory related error (however that might happen) instead of an actual
> parsing failure.
>
> It might also be that you are hitting some other resource limit in
> libxml2. Have you tried stream-parsing the document with libxml2's
> xmllint tool?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2057780
>
> Title:
> lxml.etree.XMLSyntaxError: Memory allocation failed - but no memory
> used
>
> Status in lxml:
> New
>
> Bug description:
> Python : sys.version_info(major=3, minor=10, micro=8,
> releaselevel='final', serial=0)
> lxml.etree : (5, 1, 0, 0)
> libxml used : (2, 12, 3)
> libxml compiled : (2, 12, 3)
> libxslt used : (1, 1, 39)
> libxslt compiled : (1, 1, 39)
>
>
> I am parsing a very large XML (500G) file using lxml.etree.iterparse
> like this
> The individual records are not very large. This runs for about an hour,
> memory
> not getting close to 100M while it runs. Machine has hundreds of
> gigabytes of
> memory, and its mostly not utilised while this ran.
>
>
> with open(largexmlfilepath, "rb") as xmlstream:
> for _, elem in lxml.etree.iterparse( # pylint:
> disable=c-extension-no-member
> xmlstream, events=("end",), remove_blank_text=True, tag=tagname
> ):
> # Don't actually try to do anything
> assert elem
> elem.clear()
> while elem.getprevious() is not None:
> del elem.getparent()[0]
>
>
> After about 4931000 records processed, I get this
> ...
> File "src/lxml/iterparse.pxi", line 210, in
> lxml.etree.iterparse.__next__
> File "src/lxml/iterparse.pxi", line 195, in
> lxml.etree.iterparse.__next__
> File "src/lxml/iterparse.pxi", line 230, in
> lxml.etree.iterparse._read_more_events
> File "src/lxml/parser.pxi", line 1432, in lxml.etree._FeedParser.feed
> File "src/lxml/parser.pxi", line 609, in
> lxml.etree._ParserContext._handleParseResult
> File "src/lxml/parser.pxi", line 618, in
> lxml.etree._ParserContext._handleParseResultDoc
> File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult
> File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError
> File "/mnt/docker/All_Mammalia.xml", line -1924384678
> lxml.etree.XMLSyntaxError: Memory allocation failed
>
> It is very reproducible. It seems to fail exactly at the same place.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad...

Read more...

Revision history for this message
Hans-Henrik Stærfeldt (bombmandk) wrote :

print(exception.error_log) gives

/mnt/docker/All_Mammalia.xml:-1924384678:38:FATAL:PARSER:ERR_NO_MEMORY: Memory allocation failed

Revision history for this message
scoder (scoder) wrote :

Thanks. Does it change anything if you replace the "elem.clear()" in your code with "del elem[:]" ?

Revision history for this message
Hans-Henrik Stærfeldt (bombmandk) wrote :
Download full text (3.9 KiB)

Tried out del elem[:], same result;

[hzys@ip-10-79-219-18 build]( entrez-xml-parser-fail)$ python memorybug.py
Testing lxml.etree.iterparse with cleanup
count: 4931000/4931000 eta:0:00:00
mem:31.52M/mnt/docker/All_Mammalia.xml:-1924384678:38:FATAL:PARSER:ERR_NO_MEMORY:
Memory allocation failed
Traceback (most recent call last):
  File "/develop/hzys/bdas/etl/entrez/mammals/elastic/build/memorybug.py",
line 101, in <module>
    raise excpt
  File "/develop/hzys/bdas/etl/entrez/mammals/elastic/build/memorybug.py",
line 98, in <module>
    test_lxml_iterate(argparser().parse_args())
  File "/develop/hzys/bdas/etl/entrez/mammals/elastic/build/memorybug.py",
line 49, in test_lxml_iterate
    for _, elem in lxml.etree.iterparse( # pylint:
disable=c-extension-no-member
  File "src/lxml/iterparse.pxi", line 208, in lxml.etree.iterparse.__next__
  File "src/lxml/iterparse.pxi", line 193, in lxml.etree.iterparse.__next__
  File "src/lxml/iterparse.pxi", line 228, in
lxml.etree.iterparse._read_more_events
  File "src/lxml/parser.pxi", line 1451, in lxml.etree._FeedParser.feed
  File "src/lxml/parser.pxi", line 624, in
lxml.etree._ParserContext._handleParseResult
  File "src/lxml/parser.pxi", line 633, in
lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 743, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 672, in lxml.etree._raiseParseError
  File "/mnt/docker/All_Mammalia.xml", line -1924384678
lxml.etree.XMLSyntaxError: Memory allocation failed

On Fri, 15 Mar 2024 at 18:35, scoder <email address hidden> wrote:

> Thanks. Does it change anything if you replace the "elem.clear()" in
> your code with "del elem[:]" ?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2057780
>
> Title:
> lxml.etree.XMLSyntaxError: Memory allocation failed - but no memory
> used
>
> Status in lxml:
> New
>
> Bug description:
> Python : sys.version_info(major=3, minor=10, micro=8,
> releaselevel='final', serial=0)
> lxml.etree : (5, 1, 0, 0)
> libxml used : (2, 12, 3)
> libxml compiled : (2, 12, 3)
> libxslt used : (1, 1, 39)
> libxslt compiled : (1, 1, 39)
>
>
> I am parsing a very large XML (500G) file using lxml.etree.iterparse
> like this
> The individual records are not very large. This runs for about an hour,
> memory
> not getting close to 100M while it runs. Machine has hundreds of
> gigabytes of
> memory, and its mostly not utilised while this ran.
>
>
> with open(largexmlfilepath, "rb") as xmlstream:
> for _, elem in lxml.etree.iterparse( # pylint:
> disable=c-extension-no-member
> xmlstream, events=("end",), remove_blank_text=True, tag=tagname
> ):
> # Don't actually try to do anything
> assert elem
> elem.clear()
> while elem.getprevious() is not None:
> del elem.getparent()[0]
>
>
> After about 4931000 records processed, I get this
> ...
> File "src/lxml/iterparse.pxi", line 210, in
> lxml.etree.iterparse.__next__
> File "src/lxml/iterparse.pxi", line 195, in
> lxml.etree.iterparse...

Read more...

Revision history for this message
Hans-Henrik Stærfeldt (bombmandk) wrote :
Download full text (4.2 KiB)

As you can see the rss memory was only 31M, so it was not aggregating memory

On Wed, 20 Mar 2024 at 08:56, Hans-Henrik Stærfeldt <
<email address hidden>> wrote:

> Tried out del elem[:], same result;
>
> [hzys@ip-10-79-219-18 build]( entrez-xml-parser-fail)$ python memorybug.py
> Testing lxml.etree.iterparse with cleanup
> count: 4931000/4931000 eta:0:00:00
> mem:31.52M/mnt/docker/All_Mammalia.xml:-1924384678:38:FATAL:PARSER:ERR_NO_MEMORY:
> Memory allocation failed
> Traceback (most recent call last):
> File "/develop/hzys/bdas/etl/entrez/mammals/elastic/build/memorybug.py",
> line 101, in <module>
> raise excpt
> File "/develop/hzys/bdas/etl/entrez/mammals/elastic/build/memorybug.py",
> line 98, in <module>
> test_lxml_iterate(argparser().parse_args())
> File "/develop/hzys/bdas/etl/entrez/mammals/elastic/build/memorybug.py",
> line 49, in test_lxml_iterate
> for _, elem in lxml.etree.iterparse( # pylint:
> disable=c-extension-no-member
> File "src/lxml/iterparse.pxi", line 208, in lxml.etree.iterparse.__next__
> File "src/lxml/iterparse.pxi", line 193, in lxml.etree.iterparse.__next__
> File "src/lxml/iterparse.pxi", line 228, in
> lxml.etree.iterparse._read_more_events
> File "src/lxml/parser.pxi", line 1451, in lxml.etree._FeedParser.feed
> File "src/lxml/parser.pxi", line 624, in
> lxml.etree._ParserContext._handleParseResult
> File "src/lxml/parser.pxi", line 633, in
> lxml.etree._ParserContext._handleParseResultDoc
> File "src/lxml/parser.pxi", line 743, in lxml.etree._handleParseResult
> File "src/lxml/parser.pxi", line 672, in lxml.etree._raiseParseError
> File "/mnt/docker/All_Mammalia.xml", line -1924384678
> lxml.etree.XMLSyntaxError: Memory allocation failed
>
> On Fri, 15 Mar 2024 at 18:35, scoder <email address hidden> wrote:
>
>> Thanks. Does it change anything if you replace the "elem.clear()" in
>> your code with "del elem[:]" ?
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/2057780
>>
>> Title:
>> lxml.etree.XMLSyntaxError: Memory allocation failed - but no memory
>> used
>>
>> Status in lxml:
>> New
>>
>> Bug description:
>> Python : sys.version_info(major=3, minor=10, micro=8,
>> releaselevel='final', serial=0)
>> lxml.etree : (5, 1, 0, 0)
>> libxml used : (2, 12, 3)
>> libxml compiled : (2, 12, 3)
>> libxslt used : (1, 1, 39)
>> libxslt compiled : (1, 1, 39)
>>
>>
>> I am parsing a very large XML (500G) file using lxml.etree.iterparse
>> like this
>> The individual records are not very large. This runs for about an hour,
>> memory
>> not getting close to 100M while it runs. Machine has hundreds of
>> gigabytes of
>> memory, and its mostly not utilised while this ran.
>>
>>
>> with open(largexmlfilepath, "rb") as xmlstream:
>> for _, elem in lxml.etree.iterparse( # pylint:
>> disable=c-extension-no-member
>> xmlstream, events=("end",), remove_blank_text=True, tag=tagname
>> ):
>> # Don't actually try to do anything
>> assert elem
>> elem.clear()
>> while elem.getprevious...

Read more...

Revision history for this message
Hans-Henrik Stærfeldt (bombmandk) wrote :

Is there anything else I can du to help understand what the problem is?

Revision history for this message
scoder (scoder) wrote :

Is there anything special about the XML format? Like a large number of different tag names, large test sections, many entities, many namespaces, … anything unusual besides the large file size?

Could you try passing the option "huge_tree=True" into iterparse()?

Since libxml2 parse errors are only processed after terminating and returning from the parser, the (Python) stack trace doesn't help much in understanding where the original error occurred inside of libxml2's parser.

Revision history for this message
Hans-Henrik Stærfeldt (bombmandk) wrote :

I already tried running with "huge_tree=True" - it had the same problem .

I have trached down the location of which records the problem occurs.

The attached file is an excerpt of the data file that is being processed.
I put in some comments

# ( blah blah blah )

The file contains the last record to be successfully processed, and I have also cut out the following records, for debugging. The data is public. No secrecy needed.

Remember that the machine has approx 500G free memory.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.