HTML parser errors on leading whitespace

Bug #690319 reported by Rich Schumacher
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Medium
scoder

Bug Description

lxml.html.fromstring() throws a TypeError when parsing HTML with leading whitespace.

Test script:
import traceback
import lxml.html

html = """

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema"/>
<head>
    <title>test</title>
</head>
"""

try:
    lxml.html.fromstring(html)
except Exception, e:
    traceback.print_exc(e)
else:
    print "Parsed successfully"

That will result in the following output:
Traceback (most recent call last):
  File "parse.py", line 14, in <module>
    lxml.html.fromstring(html)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/lxml/html/__init__.py", line 636, in fromstring
    if (len(body) == 1 and (not body.text or not body.text.strip())
TypeError: object of type 'NoneType' has no len()

I believe the fix should be as easy as calling strip() on the input, as the following line works:
lxml.html.fromstring(html.strip())

Version information:
Python : sys.version_info(major=2, minor=7, micro=1, releaselevel='final', serial=0)
lxml.etree : (2, 2, 8, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Revision history for this message
scoder (scoder) wrote :

Thanks for the report. I don't think that calling .strip() is the right fix, this needs some more investigation (as in "why does it hold a None value anyway?").

Changed in lxml:
status: New → Triaged
Revision history for this message
scoder (scoder) wrote :

Crash is fixed here:

https://github.com/lxml/lxml/commit/7b7958e175f0218cea58d4f42644f8ee07437f2e

The handling of whitespace at the beginning of input data should still be improved.

Changed in lxml:
assignee: nobody → Stefan Seelmann (2-ubuntu-d)
importance: Undecided → Medium
Revision history for this message
scoder (scoder) wrote :
Changed in lxml:
assignee: Stefan Seelmann (2-ubuntu-d) → scoder (scoder)
status: Triaged → Fix Committed
Revision history for this message
scoder (scoder) wrote :

Fixed in lxml 3.2.0.

Changed in lxml:
status: Fix Committed → Fix Released
scoder (scoder)
Changed in lxml:
milestone: none → 3.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.