improve support for whitespace normalisation and pretty-printing of HTML

Bug #1285625 reported by Xavier (Open ERP)
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Confirmed
Wishlist
Unassigned

Bug Description

```
from lxml import etree, html

s = '''<div>
<span>text</span>
</div>'''

tree1 = etree.fromstring(s, parser=etree.XMLParser(remove_blank_text=True))
tree2 = html.fromstring(s, parser=html.HTMLParser(remove_blank_text=True))

print etree.tostring(tree1)
print etree.tostring(tree2)
```
I would have expected the result to be the same in both case, the original string with all whitespace stripped (so it can be pretty-printed correctly)

Instead, the result are as expected for etree.fromstring:
    <div><span>text</span></div>

but not for html.fromstring:
    <div>
    <span>text</span>
    </div>

Python : (2, 6, 8, 'final', 0)
lxml.etree : (3, 2, 5, 0)
libxml used : (2, 8, 0)
libxml compiled : (2, 8, 0)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

Revision history for this message
scoder (scoder) wrote :

This is expected. HTML allows text content inside of <div> tags, so letting the parser discard it would alter the content. Essentially, when requesting blank text removal for XML, you have to either provide a DTD that defines where whitespace is relevant and where it can safely be discarded, or you will get an automatic heuristic. The HTML parser is smarter in that it knows where whitespace is relevant, so it will only discard it when it isn't (key term here is "ignorable whitespace"). Observe:

>>> et.tostring(et.fromstring('<html>\n\n \n<body>\n \n <table> \n <tr> \n \n <td>\n \n </td> \n </tr> \n \n </table> \n </body> \n\n\n </html>', p))
'<html><body>\n \n <table><tr><td>\n \n </td> \n </tr></table></body></html>'

I do admit that the removal may appear somewhat arbitrary in selected spots, but it's generally ok and safe. For example, removing the whitespace in this case would be a bad idea, as it would merge the two words:

    <p><span>some</span> <em>text</em></p>

Not a bug. (And if it was a bug, it wouldn't be one in lxml since the parsing is done by libxml2...)

Changed in lxml:
status: New → Invalid
Revision history for this message
Xavier (Open ERP) (xmo-deactivatedaccount) wrote :

> The HTML parser is smarter in that it knows where whitespace is relevant, so it will only discard it when it isn't (key term here is "ignorable whitespace").

In that case, surely sending `remove_blank_text` to HTMLParser should be at least a warning if not an error, and pretty printing through html.tostring should be much the same?

Revision history for this message
scoder (scoder) wrote :

> surely sending `remove_blank_text` to HTMLParser should be at least a warning if not an error, and pretty printing through html.tostring should be much the same?

No, why? It's perfectly reasonable to use the HTMLParser with "remove_blank_text=True" (e.g. to save memory) as it requests the removal of whitespace-only sections that do not contribute to the content of the document. Similarly, pretty printing documents should not alter their content, so it only adds (whitespace) text where it does not break anything.

If you have a specific use case where you need a specific way of formatting a document, you are free to implement that. That's so easy that it's not worth making lxml cater to everyone's needs. The FAQ also has a couple of notes on it. I guess that section could use some comments regarding HTML specifically.

Revision history for this message
scoder (scoder) wrote :
Revision history for this message
Xavier (Open ERP) (xmo-deactivatedaccount) wrote :

> No, why? It's perfectly reasonable to use the HTMLParser with "remove_blank_text=True" (e.g. to save memory) as it requests the removal of whitespace-only sections that do not contribute to the content of the document.

Which is the vast minority of them, and I'd expect they're generally in the least interesting section of a document (outside the document body).

> Similarly, pretty printing documents should not alter their content, so it only adds (whitespace) text where it does not break anything.

From what I can see, the restriction is much tighter than that: adding whitespace where it does not break anything would allow LXML a much greater range of motion as it could expand or contract just about any existing whitespace sequence in the document. In fact, looking at this very page we're discussing the issue on, even on the few elements where interspersed whitespace is irrelevant it manages to not do much of a job:

* this is the current page, fetched via curl http://pygments.org/demo/272990/ it is full of extraneous whitespace and the indentation is a mess up to and including <head>

* this is a dump after parsing the page with remove_blank_text: http://pygments.org/demo/272996/ whitespace has been removed from between <head> elements entirely, and a very limited number of sequences have been removed from the body (some of why seems somewhat incoherent, why between <ul> and <li> but not </li> and </ul>?)

* this is a dump with pretty_print=True: http://pygments.org/demo/273002/ as far as I can see, this merely added newlines but *no indentation* in (a subset of) the few places where remove_blank_text had previously removed whitespace

Can the third version really be considered "pretty printed" when it makes few things better and many worse, especially in the page body? (compare table#affected-software in the first and third links)

(for reference/comparison, here's "tidy": http://pygments.org/demo/273013/)

Note: I'm not saying it's a big problem, or easy (it definitely is not that), but spending an hour trying to understand what mistakes I was making in my invocation of lxml's pretty printing before realising I didn't do any mistake and it was just not doing anything of interest was a bit frustrating.

Revision history for this message
scoder (scoder) wrote :

I agree that the behaviour is not "perfect". However, it's not lxml doing it but libxml2, even in both cases, in and out. And I'm not going to reimplement libxml2's parser or serialiser in lxml just to improve the situation.

If you want to write up some generally usable functions that a) remove all ignorable whitespace from a (parsed) in-memory HTML tree and b) inject indentation and/or c) normalise the whitespace in all possible places where it improves the pretty printing experience when the tree gets serialised, then please do. I'll happily add them as a new feature to lxml.html.

summary: - remove_blank_text has no effect on html.HTMLParser
+ improve support for whitespace normalisation and pretty-printing of HTML
Changed in lxml:
importance: Undecided → Wishlist
status: Invalid → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.