empty document causes strange parse error (memory pointer issue?)

Bug #761215 reported by James William Pye
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Confirmed
Low
Unassigned

Bug Description

b*jwp@torch:clients 0$ python lxmlv.py
Python : sys.version_info(major=2, minor=7, micro=1, releaselevel='final', serial=0)
lxml.etree : (2, 3, -99, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 3)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 24)

Python 2.7.1 (r271:86832, Jan 19 2011, 15:23:13)
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml.html
>>> lxml.html
<module 'lxml.html' from '/pluto/python/lib/python2.7/site-packages/lxml/html/__init__.py'>
>>> lxml.html.fromstring
<function fromstring at 0x1005e36e0>
>>> fs=lxml.html.fromstring
>>> fs
<function fromstring at 0x1005e36e0>
>>> fs(b'')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/pluto/python/lib/python2.7/site-packages/lxml/html/__init__.py", line 634, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/pluto/python/lib/python2.7/site-packages/lxml/html/__init__.py", line 532, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 2740, in lxml.etree.fromstring (src/lxml/lxml.etree.c:52793)
  File "parser.pxi", line 1556, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:79602)
  File "parser.pxi", line 1435, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:78449)
  File "parser.pxi", line 943, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75099)
  File "parser.pxi", line 547, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71467)
  File "parser.pxi", line 628, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72340)
  File "parser.pxi", line 579, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71851)
lxml.etree.XMLSyntaxError: None
>>> fs(b'<x/>')
<Element x at 0x1005ced10>
>>> fs(b'')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/pluto/python/lib/python2.7/site-packages/lxml/html/__init__.py", line 634, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/pluto/python/lib/python2.7/site-packages/lxml/html/__init__.py", line 532, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 2740, in lxml.etree.fromstring (src/lxml/lxml.etree.c:52793)
  File "parser.pxi", line 1556, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:79602)
  File "parser.pxi", line 1435, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:78449)
  File "parser.pxi", line 943, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75099)
  File "parser.pxi", line 547, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71467)
  File "parser.pxi", line 628, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72340)
  File "parser.pxi", line 577, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71815)
lxml.etree.XMLSyntaxError: line 1: Tag x invalid

Revision history for this message
Olli Pottonen (olli-pottonen) wrote :

Is this really a bug? XML document must contain root node, I suppose HTML must as well, so an empty string is not a valid XML/HTML document. Did I miss something?

Of course you can argue that empty document should cause an informate, useful error, not a strange one.

Revision history for this message
scoder (scoder) wrote :

The problem is that the parser fails without providing an error message, so the last error message of a different run "leaks" into the next parser run. It's more like two problems in one, actually.

Changed in lxml:
status: New → Triaged
status: Triaged → Confirmed
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.