parsed XML raises lxml.etree.SerialisationError: IO_ENCODER on lxml.etree.tostring() call

Bug #400588 reported by Mateusz Korniak on 2009-07-17
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
lxml
Medium
scoder

Bug Description

When trying to dump do string parsed XML it raises such exception [1].
Testcase attached.

[1]
$ python test.py
lxml.etree: (2, 2, 0, 0)
libxml used: (2, 7, 3)
libxml compiled: (2, 7, 3)
libxslt used: (1, 1, 24)
libxslt compiled: (1, 1, 24)
Got: lxml_elem: <Element nokaut at 81827fc>
Counted 13797 tags: 'offer', string.count()ed: 27602
Traceback (most recent call last):
  File "test.py", line 28, in <module>
    lxml_txt = lxml.etree.tostring(lxml_elem)
  File "lxml.etree.pyx", line 2625, in lxml.etree.tostring (src/lxml/lxml.etree.c:52048)
  File "serializer.pxi", line 124, in lxml.etree._tostring (src/lxml/lxml.etree.c:83724)
  File "serializer.pxi", line 149, in lxml.etree._raiseSerialisationError (src/lxml/lxml.etree.c:84012)
lxml.etree.SerialisationError: IO_ENCODER

Mateusz Korniak (matkor) wrote :
Mateusz Korniak (matkor) wrote :
scoder (scoder) wrote :

It is quite possible that the recovery mode leaves broken content in the tree here. However, recovering from broken data is not guaranteed to succeed, so I won't give a high priority to this problem.

Changed in lxml:
importance: Undecided → Wishlist
status: New → Confirmed
scoder (scoder) wrote :

I think I changed my mind on this.It's very annoying that this only happens at serialisation time and not right during parsing. The problem is that (for some hard to understand reason) libxml2 switches to Latin-1 decoding when it encounters UTF-8 errors, so the tree ends up with a mixed encoding in this case.

My proposal would be to treat this case as a hard error and raise an exception, even if the user asked the parser to "recover". There might be a way to post-process the tree (all text content, all names) to recover more gracefully from this, but even that would only be a half-baked solution that cannot provide a "correct" result, simply because there is no correct result in the case of illegal input.

scoder (scoder) wrote :

Bug #1322781 has an additional test case that uses the HTML parser.

scoder (scoder) wrote :
Changed in lxml:
assignee: nobody → scoder (scoder)
importance: Wishlist → Medium
status: Confirmed → Fix Committed
Szépe Viktor (szepe.viktor) wrote :

I was switching to lxml because it is said to be the most fault-tolerant.

scoder (scoder) wrote :

There isn't really something you can do if your document has encoding problems. Meaning, there is no right way to process the input, so you will almost certainly loose some of your data. And recovering isn't really straight forward either in this case (not given the behaviour of libxml2's parser).

I agree that the new behaviour isn't ideal, but I consider it better to raise an error immediately instead of silently setting up an illegal state that will eventually trigger incorrect behaviour. On top of that, improvements are certainly welcome.

Szépe Viktor (szepe.viktor) wrote :

This simple function can replace or ingore invalid characters
https://docs.python.org/2/library/stdtypes.html#str.encode
Is it possible to implement it in lxml?

Szépe Viktor (szepe.viktor) wrote :

Or: Is it possible to run encode() on the document before giving it to lxml?

scoder (scoder) wrote :

Changing the decoding layer would be possible but isn't trivial. I'd like to avoid that for now.

Passing a readily decoded Unicode string into the parser works, though.

scoder (scoder) wrote :

Marking as done to get this ticket closed. Fix was released in lxml 3.4.0.

Changed in lxml:
milestone: none → 3.4
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers