Make it possible to consolidate the parse tree merging adjacent NavigableStrings

Bug #1697296 reported by Akkana Peck on 2017-06-11
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Wishlist
Unassigned

Bug Description

prettify() sometimes inserts incorrect whitespace that isn't actually there in the HTML.

For instance, in this example, which comes from trying to use BS to clean up HTML edited by someone using gmail:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''<h2>Hello <font color="#ff0000">W</font>orld</h2><p>Hi there</p>''', "lxml")
>>> f = soup.find("font")
>>> f.replaceWithChildren()
<font color="#ff0000"></font>
>>>
>>> soup
<html><body><h2>Hello World</h2><p>Hi there</p></body></html>

Looks right, the font tag was removed.

>>> soup.prettify()
u'<html>\n <body>\n <h2>\n Hello\n W\n orld\n </h2>\n <p>\n Hi there\n </p>\n </body>\n</html>'

Oops! Now there's incorrect whitespace between W and orld that wasn't there before prettify().

So using prettify() has changed the way the HTML renders, adding extra whitespace that shouldn't be there, even in the middle of words.

Leonard Richardson (leonardr) wrote :

Thanks for filing this issue. I don't know if I'm going to change this behavior, but I'll explain what you're seeing.

In the original document, there are three relevant elements: the <font> tag, the string "W" beneath the <font> tag, and the string "orld" after the font tag.

After calling replaceWithChildren() there are two relevant elements: the string "W" and the string "orld". Bumping "W" up a level doesn't merge it with "orld": there are still two elements. prettify() separates each element with whitespace, which leads to the seemingly extraneous whitespace you're seeing.

prettify() is designed to make the structure of the document obvious, so given that this is an unexpected result the best solution would be to change the structure rather than hide it. That is, merge "W" and "orld" when the opportunity arises, so that it is as though "World" had been in the original document. This seems reasonable -- this is what an HTML parser does when it encounters "W" and then "orld". But an HTML parser only has to faithfully represent a document as it is served. Beautiful Soup has to allow arbitrary operations on a document, including moving strings around, while avoiding unintended side effects.

I'm concerned about the side effects. "Merge two strings if they ever end up next to each other" is a superficially appealing strategy, it could easily lead to one string picking up other strings as it's moved around a document, creating a huge dustball-like string that wrecks the document. Whereas now the problem is that the "pretty" representation of a document exposes a non-pretty fact about its internal structure.

For this reason I'm inclined not to implement this change, but I'm interested in hearing your thoughts.

summary: - prettify() inserts incorrect whitespace between text nodes
+ Putting two NavigableStrings next to each other doesn't merge them

Thanks for the thoughtful response. I understand your concern about unintended consequences -- you're right, there could be some -- and in writing this response, I keep switching sides and convincing myself one way or the other.

In the end, though, I keep coming back to this: if you have a page showing on your browser, and you read in the source with BS, print it with prettify() and view the result in another browser tab, shouldn't the tabs look identical? If they don't, then prettify becomes merely a debugging tool, but you can't trust it as a prettyprinter for final HTML output since it might change the document.

One other, slightly more technical argument: whitespace nodes, even if they're not tags, are part of the structure of a document as far as browsers are concerned -- at least, in the gecko layout/DOM engine, the only browser code I'm familiar with, whitespace nodes are nodes inline with text nodes and treated similarly to them. Of course, the whole point of prettify() is to add whitespace, so it will necessarily change that aspect of the document's structure which maybe makes that argument moot.

I do understand your reservations, though. And probably what I should do in the long run is take the quickie prettyprinter I hacked up when I first encountered this bug, clean it up and use it as a module so I don't need to worry about prettify(). I prefer not to have newlines around inline nodes like <b> anyway, so it probably makes the most sense to have my own customizable version.

Leonard Richardson (leonardr) wrote :

I think this is a reasonable feature request if it's presented as something you can explicitly do to 'smooth out' a parse tree rather than as something that's supposed to happen automatically.

summary: - Putting two NavigableStrings next to each other doesn't merge them
+ Make it possible to consolidate the parse tree merging adjacent
+ NavigableStrings
tags: added: feature
Changed in beautifulsoup:
status: New → Confirmed
Changed in beautifulsoup:
importance: Undecided → Wishlist
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers