HTML parser errors on leading whitespace

Bug #690319 reported by Rich Schumacher
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fix Released

Bug Description

lxml.html.fromstring() throws a TypeError when parsing HTML with leading whitespace.

Test script:
import traceback
import lxml.html

html = """

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "">
<html xmlns:fb="" xmlns:og=""/>

except Exception, e:
    print "Parsed successfully"

That will result in the following output:
Traceback (most recent call last):
  File "", line 14, in <module>
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/lxml/html/", line 636, in fromstring
    if (len(body) == 1 and (not body.text or not body.text.strip())
TypeError: object of type 'NoneType' has no len()

I believe the fix should be as easy as calling strip() on the input, as the following line works:

Version information:
Python : sys.version_info(major=2, minor=7, micro=1, releaselevel='final', serial=0)
lxml.etree : (2, 2, 8, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Revision history for this message
scoder (scoder) wrote :

Thanks for the report. I don't think that calling .strip() is the right fix, this needs some more investigation (as in "why does it hold a None value anyway?").

Changed in lxml:
status: New → Triaged
Revision history for this message
scoder (scoder) wrote :

Crash is fixed here:

The handling of whitespace at the beginning of input data should still be improved.

Changed in lxml:
assignee: nobody → Stefan Seelmann (2-ubuntu-d)
importance: Undecided → Medium
Revision history for this message
scoder (scoder) wrote :
Changed in lxml:
assignee: Stefan Seelmann (2-ubuntu-d) → scoder (scoder)
status: Triaged → Fix Committed
Revision history for this message
scoder (scoder) wrote :

Fixed in lxml 3.2.0.

Changed in lxml:
status: Fix Committed → Fix Released
scoder (scoder)
Changed in lxml:
milestone: none → 3.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers