Comment 5 for bug 1285625

Revision history for this message
Xavier (Open ERP) (xmo-deactivatedaccount) wrote : Re: remove_blank_text has no effect on html.HTMLParser

> No, why? It's perfectly reasonable to use the HTMLParser with "remove_blank_text=True" (e.g. to save memory) as it requests the removal of whitespace-only sections that do not contribute to the content of the document.

Which is the vast minority of them, and I'd expect they're generally in the least interesting section of a document (outside the document body).

> Similarly, pretty printing documents should not alter their content, so it only adds (whitespace) text where it does not break anything.

From what I can see, the restriction is much tighter than that: adding whitespace where it does not break anything would allow LXML a much greater range of motion as it could expand or contract just about any existing whitespace sequence in the document. In fact, looking at this very page we're discussing the issue on, even on the few elements where interspersed whitespace is irrelevant it manages to not do much of a job:

* this is the current page, fetched via curl http://pygments.org/demo/272990/ it is full of extraneous whitespace and the indentation is a mess up to and including <head>

* this is a dump after parsing the page with remove_blank_text: http://pygments.org/demo/272996/ whitespace has been removed from between <head> elements entirely, and a very limited number of sequences have been removed from the body (some of why seems somewhat incoherent, why between <ul> and <li> but not </li> and </ul>?)

* this is a dump with pretty_print=True: http://pygments.org/demo/273002/ as far as I can see, this merely added newlines but *no indentation* in (a subset of) the few places where remove_blank_text had previously removed whitespace

Can the third version really be considered "pretty printed" when it makes few things better and many worse, especially in the page body? (compare table#affected-software in the first and third links)

(for reference/comparison, here's "tidy": http://pygments.org/demo/273013/)

Note: I'm not saying it's a big problem, or easy (it definitely is not that), but spending an hour trying to understand what mistakes I was making in my invocation of lxml's pretty printing before realising I didn't do any mistake and it was just not doing anything of interest was a bit frustrating.