Bug #1285625 “improve support for whitespace normalisation and p...” : Bugs : lxml

Revision history for this message

scoder (scoder) wrote on 2014-02-27:

#1

This is expected. HTML allows text content inside of <div> tags, so letting the parser discard it would alter the content. Essentially, when requesting blank text removal for XML, you have to either provide a DTD that defines where whitespace is relevant and where it can safely be discarded, or you will get an automatic heuristic. The HTML parser is smarter in that it knows where whitespace is relevant, so it will only discard it when it isn't (key term here is "ignorable whitespace"). Observe:

>>> et.tostring(et.fromstring('<html>\n\n \n<body>\n \n <table> \n <tr> \n \n <td>\n \n </td> \n </tr> \n \n </table> \n </body> \n\n\n </html>', p))
'<html><body>\n \n <table><tr><td>\n \n </td> \n </tr></table></body></html>'

I do admit that the removal may appear somewhat arbitrary in selected spots, but it's generally ok and safe. For example, removing the whitespace in this case would be a bad idea, as it would merge the two words:

Not a bug. (And if it was a bug, it wouldn't be one in lxml since the parsing is done by libxml2...)

Changed in lxml:
status:	New → Invalid

Revision history for this message

Xavier (Open ERP) (xmo-deactivatedaccount) wrote on 2014-02-28:

#2

> The HTML parser is smarter in that it knows where whitespace is relevant, so it will only discard it when it isn't (key term here is "ignorable whitespace").

In that case, surely sending `remove_blank_text` to HTMLParser should be at least a warning if not an error, and pretty printing through html.tostring should be much the same?

Revision history for this message

scoder (scoder) wrote on 2014-02-28:

#3

> surely sending `remove_blank_text` to HTMLParser should be at least a warning if not an error, and pretty printing through html.tostring should be much the same?

No, why? It's perfectly reasonable to use the HTMLParser with "remove_blank_text=True" (e.g. to save memory) as it requests the removal of whitespace-only sections that do not contribute to the content of the document. Similarly, pretty printing documents should not alter their content, so it only adds (whitespace) text where it does not break anything.

If you have a specific use case where you need a specific way of formatting a document, you are free to implement that. That's so easy that it's not worth making lxml cater to everyone's needs. The FAQ also has a couple of notes on it. I guess that section could use some comments regarding HTML specifically.

Revision history for this message

scoder (scoder) wrote on 2014-02-28:

#4

Doc updates:

https://github.com/lxml/lxml/commit/6ce136f18edd678d8cda00d1731f36508c2b106b

https://github.com/lxml/lxml/commit/b8269b96f197774f568030fd7b6964e428285a8e

Revision history for this message

Xavier (Open ERP) (xmo-deactivatedaccount) wrote on 2014-02-28:

#5

> No, why? It's perfectly reasonable to use the HTMLParser with "remove_blank_text=True" (e.g. to save memory) as it requests the removal of whitespace-only sections that do not contribute to the content of the document.

Which is the vast minority of them, and I'd expect they're generally in the least interesting section of a document (outside the document body).

> Similarly, pretty printing documents should not alter their content, so it only adds (whitespace) text where it does not break anything.

From what I can see, the restriction is much tighter than that: adding whitespace where it does not break anything would allow LXML a much greater range of motion as it could expand or contract just about any existing whitespace sequence in the document. In fact, looking at this very page we're discussing the issue on, even on the few elements where interspersed whitespace is irrelevant it manages to not do much of a job:

* this is the current page, fetched via curl http://pygments.org/demo/272990/ it is full of extraneous whitespace and the indentation is a mess up to and including <head>

* this is a dump after parsing the page with remove_blank_text: http://pygments.org/demo/272996/ whitespace has been removed from between <head> elements entirely, and a very limited number of sequences have been removed from the body (some of why seems somewhat incoherent, why between <ul> and <li> but not </li> and </ul>?)

* this is a dump with pretty_print=True: http://pygments.org/demo/273002/ as far as I can see, this merely added newlines but *no indentation* in (a subset of) the few places where remove_blank_text had previously removed whitespace

Can the third version really be considered "pretty printed" when it makes few things better and many worse, especially in the page body? (compare table#affected-software in the first and third links)

(for reference/comparison, here's "tidy": http://pygments.org/demo/273013/)

Note: I'm not saying it's a big problem, or easy (it definitely is not that), but spending an hour trying to understand what mistakes I was making in my invocation of lxml's pretty printing before realising I didn't do any mistake and it was just not doing anything of interest was a bit frustrating.