Elements cannot be pickled

Bug #736708 reported by Joshua Hopp
This bug affects 4 people
Affects Status Importance Assigned to Milestone

Bug Description

_Element objects should provide __getstate__, __setstate__, __reduce__ methods in order to work with pickle (http://docs.python.org/library/pickle.html).

Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 2, 6, 0)
libxml used : (2, 7, 7)
libxml compiled : (2, 7, 6)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Revision history for this message
Joshua Hopp (joshuahopp) wrote :
Revision history for this message
scoder (scoder) wrote :

This has been discussed on the mailing list several times, please read the archives for reasons why this is not trivial.

scoder (scoder)
Changed in lxml:
importance: Undecided → Wishlist
status: New → Triaged
Revision history for this message
Marcin Raczyński (marc1nr) wrote :

It is worse if we use pickle.HIGHEST_PROTOCOL - we can pickle but unpickling give us misleading error description:

>>> from lxml import etree
>>> import pickle
>>> etree.__version__
>>> pickled = pickle.dumps(etree.Element('x'), protocol=pickle.HIGHEST_PROTOCOL)
>>> pickle.loads(pickled)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "src/lxml/etree.pyx", line 1131, in lxml.etree._Element.__repr__
  File "src/lxml/etree.pyx", line 981, in lxml.etree._Element.tag.__get__
  File "src/lxml/apihelpers.pxi", line 19, in lxml.etree._assertValidNode
AssertionError: invalid Element proxy at 140260172089392

Revision history for this message
Marcin Raczyński (marc1nr) wrote :

I used Python 2.7.15rc1

Revision history for this message
Marcin Raczyński (marc1nr) wrote :

One of ugly consequences (mysterious exception while using multiprocessing and lxml): https://stackoverflow.com/questions/29570715/how-to-fix-lxml-assertion-error

Revision history for this message
Marcin Raczyński (marc1nr) wrote :

From https://bugs.python.org/issue34894 (by Serhiy Storchaka):

lxml.etree classes don't implement any methods related to pickling: __reduce__, __reduce_ex__, __getstate__, __setstate__, __getnewargs__, __getnewargs_ex__. But there are extension classes which contain the state invisible to Python. In this case they are pickled as empty classes that leads to unexpected error while unpickling. Python 3 detects such cases and raise exceptions while pickling. This change was not backported to 2.7 for compatibility reasons.

The only way to fix this issue in 2.7 is implementing pickle related methods (e.g. __getstate__ or __reduce__) in lxml.etree classes. They should either raise an exception, preventing pickling these objects, or implement support of pickling.

Revision history for this message
scoder (scoder) wrote :

Pickling is not trivial. Why? Because the object representation of an XML tree depends on more than just the serialised data. The element classes used to represent the tree in memory depend on the element lookup, which is configured through the parser or document. That lookup cannot be easily pickled as its state can be arbitrarily large and it might even be user implemented, i.e. code instead of data.

Thus, while serialising the tree to XML is obviously trivial, unpickling a serialised tree into the expected object representation is hard. That can make a pickle/unpickle cycle surprising for users, since it would depend on the current state of the program and not just the serialised data.

One could argue that this is as good as it gets, and might be good enough. I would consider a PR that implements a simple object-agnostic pickle/unpickle mechanism for arbitrary Elements.

Changed in lxml:
status: Triaged → Confirmed
Revision history for this message
scoder (scoder) wrote :

Note that Elements from lxml.objectify support pickling, so that could serve as an example.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.