lxml

HTML parser errors on leading whitespace

Bug #690319 reported by Rich Schumacher on 2010-12-14

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	lxml	Fix Released	Medium	scoder	lxml 3.2

Bug Description

lxml.html.fromstring() throws a TypeError when parsing HTML with leading whitespace.

Test script:
import traceback
import lxml.html

html = """

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema"/>
<head>
<title>test</title>
</head>
"""

try:
    lxml.html.fromstring(html)
except Exception, e:
    traceback.print_exc(e)
else:
    print "Parsed successfully"

That will result in the following output:
Traceback (most recent call last):
  File "parse.py", line 14, in <module>
    lxml.html.fromstring(html)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/lxml/html/__init__.py", line 636, in fromstring
    if (len(body) == 1 and (not body.text or not body.text.strip())
TypeError: object of type 'NoneType' has no len()

I believe the fix should be as easy as calling strip() on the input, as the following line works:
lxml.html.fromstring(html.strip())

Version information:
Python : sys.version_info(major=2, minor=7, micro=1, releaselevel='final', serial=0)
lxml.etree : (2, 2, 8, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Revision history for this message

scoder (scoder) wrote on 2012-09-29:

Thanks for the report. I don't think that calling .strip() is the right fix, this needs some more investigation (as in "why does it hold a None value anyway?").

Changed in lxml:
status:	New → Triaged

Revision history for this message

scoder (scoder) wrote on 2013-04-28:

Crash is fixed here:

https://github.com/lxml/lxml/commit/7b7958e175f0218cea58d4f42644f8ee07437f2e

The handling of whitespace at the beginning of input data should still be improved.

Changed in lxml:
assignee:	nobody → Stefan Seelmann (2-ubuntu-d)
importance:	Undecided → Medium

Revision history for this message

scoder (scoder) wrote on 2013-04-28:

Whitespace handling is fixed here:

https://github.com/lxml/lxml/commit/aa847aee79d2b8889688ef163e3c36228191ba64

Changed in lxml:
assignee:	Stefan Seelmann (2-ubuntu-d) → scoder (scoder)
status:	Triaged → Fix Committed

Revision history for this message

scoder (scoder) wrote on 2013-04-28:

Fixed in lxml 3.2.0.

Changed in lxml:
status:	Fix Committed → Fix Released

scoder (scoder) on 2013-04-28

Changed in lxml:
milestone:	none → 3.2

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #686808

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.