lxml

Bug #1285625
Comment #1

Comment 1 for bug 1285625

Revision history for this message

scoder (scoder) wrote on 2014-02-27: Re: remove_blank_text has no effect on html.HTMLParser

This is expected. HTML allows text content inside of <div> tags, so letting the parser discard it would alter the content. Essentially, when requesting blank text removal for XML, you have to either provide a DTD that defines where whitespace is relevant and where it can safely be discarded, or you will get an automatic heuristic. The HTML parser is smarter in that it knows where whitespace is relevant, so it will only discard it when it isn't (key term here is "ignorable whitespace"). Observe:

>>> et.tostring(et.fromstring('<html>\n\n \n<body>\n \n <table> \n <tr> \n \n <td>\n \n </td> \n </tr> \n \n </table> \n </body> \n\n\n </html>', p))
'<html><body>\n \n <table><tr><td>\n \n </td> \n </tr></table></body></html>'

I do admit that the removal may appear somewhat arbitrary in selected spots, but it's generally ok and safe. For example, removing the whitespace in this case would be a bad idea, as it would merge the two words:

Not a bug. (And if it was a bug, it wouldn't be one in lxml since the parsing is done by libxml2...)