empty document causes strange parse error (memory pointer issue?)

Bug #761215 reported by James William Pye on 2011-04-14
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Low
Unassigned

Bug Description

b*jwp@torch:clients 0$ python lxmlv.py
Python : sys.version_info(major=2, minor=7, micro=1, releaselevel='final', serial=0)
lxml.etree : (2, 3, -99, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 3)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 24)

Python 2.7.1 (r271:86832, Jan 19 2011, 15:23:13)
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml.html
>>> lxml.html
<module 'lxml.html' from '/pluto/python/lib/python2.7/site-packages/lxml/html/__init__.py'>
>>> lxml.html.fromstring
<function fromstring at 0x1005e36e0>
>>> fs=lxml.html.fromstring
>>> fs
<function fromstring at 0x1005e36e0>
>>> fs(b'')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/pluto/python/lib/python2.7/site-packages/lxml/html/__init__.py", line 634, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/pluto/python/lib/python2.7/site-packages/lxml/html/__init__.py", line 532, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 2740, in lxml.etree.fromstring (src/lxml/lxml.etree.c:52793)
  File "parser.pxi", line 1556, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:79602)
  File "parser.pxi", line 1435, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:78449)
  File "parser.pxi", line 943, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75099)
  File "parser.pxi", line 547, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71467)
  File "parser.pxi", line 628, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72340)
  File "parser.pxi", line 579, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71851)
lxml.etree.XMLSyntaxError: None
>>> fs(b'<x/>')
<Element x at 0x1005ced10>
>>> fs(b'')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/pluto/python/lib/python2.7/site-packages/lxml/html/__init__.py", line 634, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/pluto/python/lib/python2.7/site-packages/lxml/html/__init__.py", line 532, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 2740, in lxml.etree.fromstring (src/lxml/lxml.etree.c:52793)
  File "parser.pxi", line 1556, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:79602)
  File "parser.pxi", line 1435, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:78449)
  File "parser.pxi", line 943, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75099)
  File "parser.pxi", line 547, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71467)
  File "parser.pxi", line 628, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72340)
  File "parser.pxi", line 577, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71815)
lxml.etree.XMLSyntaxError: line 1: Tag x invalid

Olli Pottonen (olli-pottonen) wrote :

Is this really a bug? XML document must contain root node, I suppose HTML must as well, so an empty string is not a valid XML/HTML document. Did I miss something?

Of course you can argue that empty document should cause an informate, useful error, not a strange one.

scoder (scoder) wrote :

The problem is that the parser fails without providing an error message, so the last error message of a different run "leaks" into the next parser run. It's more like two problems in one, actually.

Changed in lxml:
status: New → Triaged
status: Triaged → Confirmed
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers