html5lib linkage issue

Bug #1809910 reported by Isaac Muse
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

During testing, this is an html5lib linkage issue that was found. There is already a merge request open to fix it, but I wanted to provide a simplified recreate to at least detail the problem which will be linked to the merge request.

This is the simple case that breaks:

<div><table id="1"><tr><td>Here's a nested table:<table id="2"><tr><td>foo</td></tr></table></td></div>
<div>This tag contains nothing but whitespace: <b> </b></div>

While it links well enough to display:

>>> soup
<html><head></head><body><div><div>This tag contains nothing but whitespace: <b> </b></div><table id="1"><tbody><tr><td>Here's a nested table:<table id="2"><tbody><tr><td>foo</td></tr></tbody></table></td>\n\n</tr></tbody></table></div></body></html>

>>> soup.b
<b> </b>

The links are not sound.

>>> soup.b.next_element
<table id="1"><tbody><tr><td>Here's a nested table:<table id="2"><tbody><tr><td>foo</td></tr></tbody></table></td>\n\n</tr></tbody></table>

The next_element **should** be ' ', the b tag's content. These problems can go unnoticed and usually manifest when performing an extraction that assumes (and frankly requires) good linkage to do things properly.

The merge request at https://code.launchpad.net/~facelessuser/beautifulsoup/html5lib-fix will fix this as it is more aggressive in ensuring sound linkage when building the tree moving forward.

Related branches

Revision history for this message
Leonard Richardson (leonardr) wrote :

Resolved by Isaac's code in revision 483.

Changed in beautifulsoup:
status: New → Fix Committed
Revision history for this message
Leonard Richardson (leonardr) wrote :

Once again, this is excellent work, not just in fixing the tree builder but in coming up with a worst-case HTML document.

Revision history for this message
Isaac Muse (facelessuser) wrote :

No problem.

Feel free to make it an even worse-case moving forward :). I was looking for something that stressed the linkage in ways that hadn't already been exposed, and that snippet was ridiculous enough that it seemed to work nicely. I'm sure we could throw some other cases in it as well.

Hopefully, the new code should be able to fix any such cases, but it doesn't hurt to explicitly put them in the "worse-case" to prevent future breakage.

Revision history for this message
Leonard Richardson (leonardr) wrote :

In 4.7.0 release.

Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.