Serialisation error when writing a large file (> 2.5 GB) with xmlfile

Bug #1570388 reported by Charlie_X
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
Confirmed
Medium
Unassigned

Bug Description

No idea what's causing this but I'd guess it's memory related. It's been reproduced with lxml 3.6.0 on other systems.

Python : sys.version_info(major=3, minor=4, micro=4, releaselevel='final', serial=0)
lxml.etree : (3, 6, 0, 0)
libxml used : (2, 9, 2)
libxml compiled : (2, 9, 2)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

The code to reproduce the bug can be found at https://bitbucket.org/snippets/openpyxl/j6kBG. Except set the dataframe to random.rd(500000, 100) and only the openpyxl_stream() function should run in __main__. This require around 2GB RAM and take about 15 minutes before failing.

Tracebacks like

http://paste.ofcode.org/hCyBA6Z9n8EXaw7b9tNZxu and https://bpaste.net/show/32450eac0481

Revision history for this message
Frank Steggink (fsteggink) wrote :
Download full text (3.3 KiB)

I'm experiencing a similar solution. Unfortunately the links to the tracebacks don't work anymore, so I'll post mine:

Traceback (most recent call last):
  File "/usr/local/bin/stetl", line 4, in <module>
    __import__('pkg_resources').run_script('Stetl==1.3', 'stetl')
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 739, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1494, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python2.7/dist-packages/Stetl-1.3-py2.7.egg/EGG-INFO/scripts/stetl", line 41, in <module>
    main()
  File "/usr/local/lib/python2.7/dist-packages/Stetl-1.3-py2.7.egg/EGG-INFO/scripts/stetl", line 32, in main
    etl.run()
  File "/usr/local/lib/python2.7/dist-packages/Stetl-1.3-py2.7.egg/stetl/etl.py", line 159, in run
    chain.run()
  File "/usr/local/lib/python2.7/dist-packages/Stetl-1.3-py2.7.egg/stetl/chain.py", line 174, in run
    packet = self.first_comp.process(packet)
  File "/usr/local/lib/python2.7/dist-packages/Stetl-1.3-py2.7.egg/stetl/component.py", line 218, in process
    packet = self.next.process(packet)
  File "/usr/local/lib/python2.7/dist-packages/Stetl-1.3-py2.7.egg/stetl/component.py", line 218, in process
    packet = self.next.process(packet)
  File "/usr/local/lib/python2.7/dist-packages/Stetl-1.3-py2.7.egg/stetl/component.py", line 218, in process
    packet = self.next.process(packet)
  File "/usr/local/lib/python2.7/dist-packages/Stetl-1.3-py2.7.egg/stetl/component.py", line 204, in process
    packet = self.invoke(packet)
  File "/etl/bgt/stetlbgt/subfeaturehandler.py", line 146, in invoke
    xf.flush()
  File "src/lxml/serializer.pxi", line 1347, in lxml.etree.xmlfile.__exit__
  File "src/lxml/serializer.pxi", line 1685, in lxml.etree._IncrementalFileWriter._close
  File "src/lxml/serializer.pxi", line 1691, in lxml.etree._IncrementalFileWriter._handle_error
  File "src/lxml/serializer.pxi", line 199, in lxml.etree._raiseSerialisationError
lxml.etree.SerialisationError: unknown error -2055577339

Yes, I'm still using Python 2.7, and I know I need to upgrade asap, but unfortunately I'm not able to do that at this moment due to other priorities. This error still occurs in the latest version of lxml (version 4.4.2), and also in version 3.7.

I also did some investigation, and I found suspect this error is caused by the fact that libxml returns a signed 32 bit value when calling the function xmlOutputBufferClose. See http://www.xmlsoft.org/html/libxml-xmlIO.html#xmlOutputBufferClose.

My local test file which I monitored was slightly larger than 2.1 GB, or 2^31. The error code was just a bit smaller than -2^31. So I decided to sum both of them (negating the sign of the error code of course), and the sum was exactly 2^32.

I have no hope that this error will ever be fixed in lxml, since that would have major complications (or hopefully they're working on a 64 bit port). But I hope it will be feasible for lxml to have this error fixed.

The file which I'm eventually generating is valid XML, so as a workaround I decided to catch the etree.Serialisat...

Read more...

Revision history for this message
Frank Steggink (fsteggink) wrote :

With "I have no hope that this error will ever be fixed in lxml, since that would have major complications" I actually mean "I have no hope that this error will ever be fixed in libxml, since that would have major complications". My apologies.

Revision history for this message
scoder (scoder) wrote :

This should improve the situation:
https://github.com/lxml/lxml/commit/cfceec54a8d5b684e2572b02addf0adf5e786f2f
While it's still possible that the serialised data size hits exactly -1 on integer wrap-around, it's much less likely than hitting a negative number.

I think it might be possible to further improve this by matching up more closely with what libxml2 does in its xmlOutputBufferClose() and looking at the buffer error state before calling it. PR welcome for that. Probably worth a helper function that wraps the call.

Changed in lxml:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Charlie_X (charlie) wrote :

Only ever had the error once but nice to know it's not been forgotten!

Revision history for this message
scoder (scoder) wrote :

I released 4.5.1 with this fix, but will leave the ticket open since I don't consider it complete.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.