HTML parser errors on leading whitespace
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Fix Released
|
Medium
|
scoder |
Bug Description
lxml.html.
Test script:
import traceback
import lxml.html
html = """
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://
<html xmlns:fb="http://
<head>
<title>
</head>
"""
try:
lxml.
except Exception, e:
traceback.
else:
print "Parsed successfully"
That will result in the following output:
Traceback (most recent call last):
File "parse.py", line 14, in <module>
lxml.
File "/opt/local/
if (len(body) == 1 and (not body.text or not body.text.strip())
TypeError: object of type 'NoneType' has no len()
I believe the fix should be as easy as calling strip() on the input, as the following line works:
lxml.html.
Version information:
Python : sys.version_
lxml.etree : (2, 2, 8, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)
Changed in lxml: | |
milestone: | none → 3.2 |
Thanks for the report. I don't think that calling .strip() is the right fix, this needs some more investigation (as in "why does it hold a None value anyway?").