parsed XML raises lxml.etree.SerialisationError: IO_ENCODER on lxml.etree.tostring() call

Bug #400588 reported by Mateusz Korniak
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Medium
scoder

Bug Description

When trying to dump do string parsed XML it raises such exception [1].
Testcase attached.

[1]
$ python test.py
lxml.etree: (2, 2, 0, 0)
libxml used: (2, 7, 3)
libxml compiled: (2, 7, 3)
libxslt used: (1, 1, 24)
libxslt compiled: (1, 1, 24)
Got: lxml_elem: <Element nokaut at 81827fc>
Counted 13797 tags: 'offer', string.count()ed: 27602
Traceback (most recent call last):
  File "test.py", line 28, in <module>
    lxml_txt = lxml.etree.tostring(lxml_elem)
  File "lxml.etree.pyx", line 2625, in lxml.etree.tostring (src/lxml/lxml.etree.c:52048)
  File "serializer.pxi", line 124, in lxml.etree._tostring (src/lxml/lxml.etree.c:83724)
  File "serializer.pxi", line 149, in lxml.etree._raiseSerialisationError (src/lxml/lxml.etree.c:84012)
lxml.etree.SerialisationError: IO_ENCODER

Revision history for this message
Mateusz Korniak (matkor) wrote :
Revision history for this message
Mateusz Korniak (matkor) wrote :
Revision history for this message
scoder (scoder) wrote :

It is quite possible that the recovery mode leaves broken content in the tree here. However, recovering from broken data is not guaranteed to succeed, so I won't give a high priority to this problem.

Changed in lxml:
importance: Undecided → Wishlist
status: New → Confirmed
Revision history for this message
scoder (scoder) wrote :

I think I changed my mind on this.It's very annoying that this only happens at serialisation time and not right during parsing. The problem is that (for some hard to understand reason) libxml2 switches to Latin-1 decoding when it encounters UTF-8 errors, so the tree ends up with a mixed encoding in this case.

My proposal would be to treat this case as a hard error and raise an exception, even if the user asked the parser to "recover". There might be a way to post-process the tree (all text content, all names) to recover more gracefully from this, but even that would only be a half-baked solution that cannot provide a "correct" result, simply because there is no correct result in the case of illegal input.

Revision history for this message
scoder (scoder) wrote :

Bug #1322781 has an additional test case that uses the HTML parser.

Revision history for this message
scoder (scoder) wrote :
Changed in lxml:
assignee: nobody → scoder (scoder)
importance: Wishlist → Medium
status: Confirmed → Fix Committed
Revision history for this message
Szépe Viktor (szepe.viktor) wrote :

I was switching to lxml because it is said to be the most fault-tolerant.

Revision history for this message
scoder (scoder) wrote :

There isn't really something you can do if your document has encoding problems. Meaning, there is no right way to process the input, so you will almost certainly loose some of your data. And recovering isn't really straight forward either in this case (not given the behaviour of libxml2's parser).

I agree that the new behaviour isn't ideal, but I consider it better to raise an error immediately instead of silently setting up an illegal state that will eventually trigger incorrect behaviour. On top of that, improvements are certainly welcome.

Revision history for this message
Szépe Viktor (szepe.viktor) wrote :

This simple function can replace or ingore invalid characters
https://docs.python.org/2/library/stdtypes.html#str.encode
Is it possible to implement it in lxml?

Revision history for this message
Szépe Viktor (szepe.viktor) wrote :

Or: Is it possible to run encode() on the document before giving it to lxml?

Revision history for this message
scoder (scoder) wrote :

Changing the decoding layer would be possible but isn't trivial. I'd like to avoid that for now.

Passing a readily decoded Unicode string into the parser works, though.

Revision history for this message
scoder (scoder) wrote :

Marking as done to get this ticket closed. Fix was released in lxml 3.4.0.

Changed in lxml:
milestone: none → 3.4
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.