Beautiful Soup

Bug #1868861
Comment #4

Comment 4 for bug 1868861

Revision history for this message

Leonard Richardson (leonardr) wrote on 2020-04-04:

Geoffrey, thanks for taking the time to report this issue.

get_text() is a bit of a blunt instrument, but it does know to filter out comments and other 'strings' that are not generally considered part of a document's text.

In the example document you gave, html.parser parses the contents of the <style> tag as an HTML comment, but lxml parses it as a CSS stylesheet that for some reason starts with "<!--". lxml's behavior comes from the HTML 5 spec (https://html.spec.whatwg.org/#the-style-element); html.parser doesn't treat the <style> tag differently from any other tag.

So in one sense this is a problem of a difference between parsers. However the underlying problem is parser-independent. Per the HTML5 spec, nothing inside a <style> tag should be considered "text" -- it's a stylesheet. What the contents of the <style> tag look like is irrelevant. A document that looks like "<style>This is text.</style>" actually contains no text, just a broken stylesheet.

There are a lot of places where get_text() comes up short, and this is another one. get_text() is not very reliable, but it is very popular, so this is always a dilemma for me.

The best solution may be to parse the contents of the <style> tag as a special type of string which is ignored by get_text() the same way a Comment or ProcessingInstruction is. That way you can find special strings of that type and treat them specially if you need to for any reason; it's not just about hiding that string from get_text().