I've marked bug 1882067 as a duplicate of this issue, although they're not directly related, because I think they come from the same place: a desire to use Beautiful Soup as a text preprocessor that can strip away "useless" markup.
In this case the concern is that some of the "useless" markup isn't so useless -- it conveys conceptual separations that are lost when you just extract all the text. In the case of 1882067, the concern is that some of the *text* is useless -- it's just whitespace and newlines that won't render in a web browser and ought to be collapsed for reading.
The challenge in both cases is distinguishing the "useless" stuff from the "useful" stuff.
I've marked bug 1882067 as a duplicate of this issue, although they're not directly related, because I think they come from the same place: a desire to use Beautiful Soup as a text preprocessor that can strip away "useless" markup.
In this case the concern is that some of the "useless" markup isn't so useless -- it conveys conceptual separations that are lost when you just extract all the text. In the case of 1882067, the concern is that some of the *text* is useless -- it's just whitespace and newlines that won't render in a web browser and ought to be collapsed for reading.
The challenge in both cases is distinguishing the "useless" stuff from the "useful" stuff.