Beautiful Soup

Bug #1768330
Comment #21

Comment 21 for bug 1768330

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-01-07:

#21

That list of HTML block elements is taken from HTMLTreeBuilder, where there's a comment saying it comes from the HTML spec. But that list must be pretty old because that language is not in the HTML spec anymore. It looks like the concept of elements being intrinsically "block" or "inline" has been replaced by a CSS concept called "formatting context" that I don't currently understand. (https://www.w3.org/TR/CSS2/visuren.html#normal-flow)

So in the worst case, as you say, rendering the content in a CSS-aware way would seem to be necessary to see which text nodes are relevant. That's definitely off the table. As an approximation, something that uses notions from the current HTML spec such as "flow content" and "phrasing content" might work in most situations. (https://html.spec.whatwg.org/#kinds-of-content)

However it works, for the text extraction algorithm I would implement a kind of strategy pattern, similar to the pattern used to choose the markup parser. I am very interested in providing a way for people to plug in their own text extraction algorithms, and not very interested in supporting any particular algorithm indefinitely.