Unicode Emoji raise etree.XMLSyntaxError at etree.fromstring()

Bug #1538213 reported by Minho Ryang on 2016-01-26
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
lxml
Undecided
Unassigned

Bug Description

OS X 10.11.2(15C50)
Python : sys.version_info(major=3, minor=5, micro=0, releaselevel='final', serial=0)
lxml.etree : (3, 5, 0, 0)
libxml used : (2, 9, 2)
libxml compiled : (2, 9, 2)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

I want U+1F576 Sunglasses!
But this test.py won't worked.

```python
#!/usr/bin/env python3
import sys
from lxml import html, etree

print("%-20s: %s" % ('Python', sys.version_info))
print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))

uni = "<p>Unicode! \U0001F576 Sunglasses!</p>"
#t = html.fragment_fromstring(uni) # XXX: lxml.etree.ParserError: Document is empty
t = etree.fromstring(uni, parser=etree.XMLParser(encoding='unicode'))
print("B", etree.tostring(t))
print("U", etree.tostring(t, encoding='unicode'))
```

```pytb
Traceback (most recent call last):
  File "test.py", line 14, in <module>
    t = etree.fromstring(uni, parser=etree.XMLParser(encoding='unicode'))
  File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:82934)
  File "src/lxml/parser.pxi", line 1819, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:124533)
  File "src/lxml/parser.pxi", line 1700, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:122964)
  File "src/lxml/parser.pxi", line 1040, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:116705)
  File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:110510)
  File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:112276)
  File "src/lxml/parser.pxi", line 613, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:111124)
```

Blake Winton (bwinton) wrote :

I'm running into the same problem, but can verify that the same code works with Python 2.7.10, which should hopefully help narrow it down a little… :)

David D Lowe (flimm) wrote :

I also experience this bug with these version numbers, although the error message is a bit more helpful.

Python : sys.version_info(major=3, minor=6, micro=1, releaselevel='final', serial=0)
lxml.etree : (3, 7, 3, 0)
libxml used : (2, 9, 4)
libxml compiled : (2, 9, 4)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)
Traceback (most recent call last):
  File "hi.py", line 14, in <module>
    t = etree.fromstring(uni, parser=etree.XMLParser(encoding='unicode'))
  File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:79010)
  File "src/lxml/parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:118341)
  File "src/lxml/parser.pxi", line 1729, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:116899)
  File "src/lxml/parser.pxi", line 1063, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:110886)
  File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105109)
  File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106817)
  File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105671)
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range, line 1, column 2

scoder (scoder) wrote :

Example code works for me on Linux.

Changed in lxml:
status: New → Triaged
scoder (scoder) wrote :

Closing, can't reproduce.

Changed in lxml:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers