HTML parser errors on leading whitespace

Bug #690319 reported by Rich Schumacher on 2010-12-14
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
Medium
scoder

Bug Description

lxml.html.fromstring() throws a TypeError when parsing HTML with leading whitespace.

Test script:
import traceback
import lxml.html

html = """

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema"/>
<head>
    <title>test</title>
</head>
"""

try:
    lxml.html.fromstring(html)
except Exception, e:
    traceback.print_exc(e)
else:
    print "Parsed successfully"

That will result in the following output:
Traceback (most recent call last):
  File "parse.py", line 14, in <module>
    lxml.html.fromstring(html)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/lxml/html/__init__.py", line 636, in fromstring
    if (len(body) == 1 and (not body.text or not body.text.strip())
TypeError: object of type 'NoneType' has no len()

I believe the fix should be as easy as calling strip() on the input, as the following line works:
lxml.html.fromstring(html.strip())

Version information:
Python : sys.version_info(major=2, minor=7, micro=1, releaselevel='final', serial=0)
lxml.etree : (2, 2, 8, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

scoder (scoder) wrote :

Thanks for the report. I don't think that calling .strip() is the right fix, this needs some more investigation (as in "why does it hold a None value anyway?").

Changed in lxml:
status: New → Triaged
scoder (scoder) wrote :

Crash is fixed here:

https://github.com/lxml/lxml/commit/7b7958e175f0218cea58d4f42644f8ee07437f2e

The handling of whitespace at the beginning of input data should still be improved.

Changed in lxml:
assignee: nobody → Stefan Seelmann (2-ubuntu-d)
importance: Undecided → Medium
scoder (scoder) wrote :
Changed in lxml:
assignee: Stefan Seelmann (2-ubuntu-d) → scoder (scoder)
status: Triaged → Fix Committed
scoder (scoder) wrote :

Fixed in lxml 3.2.0.

Changed in lxml:
status: Fix Committed → Fix Released
scoder (scoder) on 2013-04-28
Changed in lxml:
milestone: none → 3.2
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers