Comment 13 for bug 2057780

Revision history for this message
scoder (scoder) wrote :

Coming back to this, it's still difficult to guess what might cause this. I don't see anything suspicious in your XML example.

One issue I've seen with older libxml2 versions was that specific cut positions in the chunked data reading could lead to parse errors. Haven't seen this with 2.12.x, but you never know. You could change your code to use the XMLPullParser instead of iterparse. It's just a slightly different style, but it would allow you to control exactly what chunks you pass into the parser.

https://lxml.de/parsing.html#incremental-event-parsing

It's basically just

parser = etree.XMLPullParser(events=['end'])
while (data := xmlstream.read(chunk_size)):
    parser.feed(data)
    for _, element in parser.read_events():
         ...

For your file size, I'd try a somewhat large chunk size. The default in iterparse is just 32 KiB, so it goes back asking for new data quite frequently.