improve support for whitespace normalisation and pretty-printing of HTML
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Confirmed
|
Wishlist
|
Unassigned |
Bug Description
```
from lxml import etree, html
s = '''<div>
<span>text</span>
</div>'''
tree1 = etree.fromstring(s, parser=
tree2 = html.fromstring(s, parser=
print etree.tostring(
print etree.tostring(
```
I would have expected the result to be the same in both case, the original string with all whitespace stripped (so it can be pretty-printed correctly)
Instead, the result are as expected for etree.fromstring:
<div>
but not for html.fromstring:
<div>
<span>
</div>
Python : (2, 6, 8, 'final', 0)
lxml.etree : (3, 2, 5, 0)
libxml used : (2, 8, 0)
libxml compiled : (2, 8, 0)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)
This is expected. HTML allows text content inside of <div> tags, so letting the parser discard it would alter the content. Essentially, when requesting blank text removal for XML, you have to either provide a DTD that defines where whitespace is relevant and where it can safely be discarded, or you will get an automatic heuristic. The HTML parser is smarter in that it knows where whitespace is relevant, so it will only discard it when it isn't (key term here is "ignorable whitespace"). Observe:
>>> et.tostring( et.fromstring( '<html> \n\n \n<body>\n \n <table> \n <tr> \n \n <td>\n \n </td> \n </tr> \n \n </table> \n </body> \n\n\n </html>', p)) table>< /body>< /html>'
'<html><body>\n \n <table><tr><td>\n \n </td> \n </tr></
I do admit that the removal may appear somewhat arbitrary in selected spots, but it's generally ok and safe. For example, removing the whitespace in this case would be a bad idea, as it would merge the two words:
<p> <span>some< /span> <em>text</em></p>
Not a bug. (And if it was a bug, it wouldn't be one in lxml since the parsing is done by libxml2...)