lxml

parsing Chinese string starts with `<` raises ParserError

Bug #1374250 reported by wonderfuly on 2014-09-26

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Confirmed	Low	Unassigned

Bug Description

>>> from lxml.html import fromstring
>>> fromstring('<你')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 634, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 535, in document_fromstring
    "Document is empty")
ParserError: Document is empty

你 is a Chinese character, the unicode representation is: '\u4f60'

>>> from lxml.html import fromstring
>>> fromstring('<\u4f60')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 634, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 535, in document_fromstring
    "Document is empty")
ParserError: Document is empty

So it seems like the combination `<\u` is the issue.

Version Info:

Python : sys.version_info(major=2, minor=7, micro=6, releaselevel='final', serial=0)
lxml.etree : (3, 4, 0, 0)
libxml used : (2, 9, 0)
libxml compiled : (2, 9, 0)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

Revision history for this message

scoder (scoder) wrote on 2014-12-06:

The exception is actually correct, there is no document to parse here.

However, given that the parser tries to recover from parse errors, It can be argued that it should return a document regardless, i.e. it should create an empty tag and return that.

Patches welcome.

Changed in lxml:
importance:	Undecided → Low
status:	New → Confirmed

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.