invalid UTF-8 characters cause error

Bug #1322781 reported by Szépe Viktor
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Triaged
Undecided
scoder

Bug Description

It is an email corrector:

payload = msg.get_payload(decode=True)
parser = etree.HTMLParser(encoding=str(charset))
dom_tree = etree.fromstring(payload, parser)
#fails
etree.dump(dom_tree, pretty_print=True)
#fails
output = etree.tostring(dom_tree, pretty_print=True, method='html')

  File "/usr/local/lib/python2.6/dist-packages/pythonfilter/gyogyito2.py", line 149, in check_htmlonly
    etree.dump(dom_tree, pretty_print=True)
  File "lxml.etree.pyx", line 3070, in lxml.etree.dump (src/lxml/lxml.etree.c:68729)
  File "lxml.etree.pyx", line 3157, in lxml.etree.tostring (src/lxml/lxml.etree.c:69346)
  File "serializer.pxi", line 135, in lxml.etree._tostring (src/lxml/lxml.etree.c:114380)
  File "serializer.pxi", line 195, in lxml.etree._raiseSerialisationError (src/lxml/lxml.etree.c:115052)
lxml.etree.SerialisationError: IO_ENCODER

The msg is
https://gist.githubusercontent.com/szepeviktor/9a82c38754cf1e83dcc8/raw/8590189a449b577df1ec7ef465cf60ed53919bab/uni3.orig

Please start me up, where to begin debugging. I think the �-s are the cause.
Thank you!

Revision history for this message
Szépe Viktor (szepe.viktor) wrote :
Revision history for this message
Szépe Viktor (szepe.viktor) wrote :

Debian squeeze 32 bit

Python 2.6.6 (r266:84292, Dec 27 2010, 00:02:40)
[GCC 4.4.5] on linux2

pip freeze|grep lxml
lxml==3.3.5

Revision history for this message
Szépe Viktor (szepe.viktor) wrote :
Revision history for this message
scoder (scoder) wrote :

Works for me using latest lxml master and libxml2 2.9.1.

Changed in lxml:
assignee: nobody → scoder (scoder)
status: New → Triaged
Revision history for this message
scoder (scoder) wrote :

issue #400588 refers to the more general problem, so let's move the discussion over there.

Revision history for this message
Szépe Viktor (szepe.viktor) wrote :

OK. You can close this one.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.