UnicodeEncodeError when parse XML with unicoded tags names

Bug #1416339 reported by petrikoz on 2015-01-30
This bug affects 1 person
Affects Status Importance Assigned to Milestone

Bug Description


Python : sys.version_info(major=2, minor=7, micro=6, releaselevel='final', serial=0)
lxml.etree : (3, 4, 1, 0)
libxml used : (2, 9, 1)
libxml compiled : (2, 9, 1)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

When parse XML file with utf-8 encoding and unicoded tags name have UnicodeEncodeError.

Code for illustration bug:

> import urllib2
> from lxml import etree

> xml_file = urllib2.urlopen('https://gist.githubusercontent.com/petrikoz/b5307616bab6247a958a/raw/81a52e2079986e57ab227e72e01b2e7b6f8d97ee/lxml-bug.xml')
> root = etree.XML(xml_file.read())
> root
> <repr(<lxml.etree._Element at 0x7f59f0733488>) failed: UnicodeEncodeError: 'ascii' codec can't encode characters in position 9-25: ordinal not in range(128)>

Why is it so?
Why lxml not returned unicoded representation of tag with unicoded name?

scoder (scoder) wrote :

My guess is that it's a Py2.x problem. IIRC, repr() is expected to return a byte string in Py2.x, and lxml returns a unicode string. Python then fails to encode it to a byte string. So the error happens outside of lxml, inside of Python. This has been fixed in Python 3.x, which properly supports (and in fact requires) a unicode text string as result of repr().

Given that the tag name may not be representable with an ASCII encoded byte string (and clearly is not in this case), there isn't really a correct way to do this. I mean, lxml could return something like "unprintable tag name" for non-ascii tag names in repr(), but that wouldn't really be satisfactory... Although it would still be better than letting Python raise an exception.

Changed in lxml:
assignee: nobody → scoder (scoder)
importance: Undecided → Low
status: New → Triaged
petrikoz (po-zelenin) wrote :

Ok. Thank you for your answer.

I agree what would be better letting Python raise an exception.

Olli Pottonen (olli-pottonen) wrote :

I think this is fixed on master on github, see https://github.com/lxml/lxml/pull/159.
But the sample xml is not available anymore, so I could not test and find out for certain.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers