html5lib tree builder creates an inconsistent tree when reparenting tags

Bug #1189267 reported by Robert Bitel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

It seems we will enter an infinate loop when we issue an all strings. Im not exactly sure what is going on but it seems like this happens if we have 2 of the exact same descendants one right after another and causes an infinite loop.

I have hacked together a fix, but I will not share because I know for sure it is wrong. It seems like the iterator is not detecting the last element.

Ive attached the broken html file. and any other information you need let me know.

Thanks

Revision history for this message
Robert Bitel (lonecow) wrote :
Revision history for this message
Leonard Richardson (leonardr) wrote :

I can't duplicate this problem with html.parser, lxml, or html5lib. What version of Beautiful Soup are you using, what parser are you telling it to use, and what code are you running?

If you're not using Beautiful Soup 4.2.1, you may be encountering bug 1182089, which has been fixed.

Revision history for this message
Robert Bitel (lonecow) wrote : Re: [Bug 1189267] Re: Infinate loop in Tag._all_strings
  • Test.py Edit (704 bytes, application/octet-stream; name="Test.py")

Im using Beautiful soup 4.2.1 with I installed with pip using a virtual
environment.
I was using html5lib parser

Attached is sample code.

On Sun, Jun 9, 2013 at 6:57 PM, Leonard Richardson <email address hidden>wrote:

> I can't duplicate this problem with html.parser, lxml, or html5lib. What
> version of Beautiful Soup are you using, what parser are you telling it
> to use, and what code are you running?
>
> If you're not using Beautiful Soup 4.2.1, you may be encountering bug
> 1182089, which has been fixed.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1189267
>
> Title:
> Infinate loop in Tag._all_strings
>
> Status in Beautiful Soup:
> New
>
> Bug description:
> It seems we will enter an infinate loop when we issue an all strings.
> Im not exactly sure what is going on but it seems like this happens if
> we have 2 of the exact same descendants one right after another and
> causes an infinite loop.
>
> I have hacked together a fix, but I will not share because I know for
> sure it is wrong. It seems like the iterator is not detecting the last
> element.
>
>
> Ive attached the broken html file. and any other information you need
> let me know.
>
> Thanks
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/beautifulsoup/+bug/1189267/+subscriptions
>

Revision history for this message
Leonard Richardson (leonardr) wrote : Re: Infinate loop in Tag._all_strings

I can duplicate the error now with the following minimal markup:

<p><em>foo</p>
<p>bar<a></a></em></p>

Revision history for this message
Leonard Richardson (leonardr) wrote :

The html5lib tree builder has had numerous problems caused by html5lib's tendency to rearrange the tree during parsing. At this point the easy problems have been fixed, and problems like this happen because the tree builder calls Beautiful Soup API methods while the tree is in an inconsistent state. The only solution I can see is to rewrite the html5lib tree builder to manipulate the tree directly, rather than calling Beautiful Soup API methods.

I don't understand html5lib's parsing algorithm very well, which makes this a very big task for which I don't have time right now. I have made a start in the branch attached to this bug.

summary: - Infinate loop in Tag._all_strings
+ html5lib tree builder creates an inconsistent tree when reparenting tags
Changed in beautifulsoup:
status: New → Confirmed
Changed in beautifulsoup:
status: Confirmed → Fix Committed
Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.