parsing Chinese string starts with `<` raises ParserError

Bug #1374250 reported by wonderfuly on 2014-09-26
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Low
Unassigned

Bug Description

>>> from lxml.html import fromstring
>>> fromstring('<你')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 634, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 535, in document_fromstring
    "Document is empty")
ParserError: Document is empty

你 is a Chinese character, the unicode representation is: '\u4f60'

>>> from lxml.html import fromstring
>>> fromstring('<\u4f60')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 634, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 535, in document_fromstring
    "Document is empty")
ParserError: Document is empty

So it seems like the combination `<\u` is the issue.

Version Info:

Python : sys.version_info(major=2, minor=7, micro=6, releaselevel='final', serial=0)
lxml.etree : (3, 4, 0, 0)
libxml used : (2, 9, 0)
libxml compiled : (2, 9, 0)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

scoder (scoder) wrote :

The exception is actually correct, there is no document to parse here.

However, given that the parser tries to recover from parse errors, It can be argued that it should return a document regardless, i.e. it should create an empty tag and return that.

Patches welcome.

Changed in lxml:
importance: Undecided → Low
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers