Tag object lost some by html5lib with incorrect html

Bug #1629825 reported by Hara Hidekazu
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

I faced one problem which Tag object lost some information by html5lib with incorrect html & after losing them I'll fail to operate it as a BS4 object.

When I try to operate following soup_html5lib obj from html1, I can do soup_html5lib.div , but I failed soup_html5lib.div.find_all('img').
I tried to find difference between a soup_lxml and soup_html5lib.

I found a point which 'parser_class' is lost in Tag obj with html5lib.

I know html1 has wrong html structure. But I can make Tag object once.

I'm not sure this is BS4 bug or html5lib or other.

I just expect avoid Traceback before operate.
If there is some option for avoiding it , let's me know.

Thanks for reading.

-----
from bs4 import BeautifulSoup
html1 = '''
<div><a href="http://www.theborneo.com/B16092503.jpg"><img src="http://www.theborneo.com/B16092503.jpg" /><p><noscript><img src="http://www.theborneo.com/B16092503.jpg" /></noscript></a> The woman mourning</p></div>
'''
html2 = '''
<div><a href="http://www.theborneo.com/B16092503.jpg"><img src="http://www.theborneo.com/B16092503.jpg" /><p><noscript><img src="http://www.theborneo.com/B16092503.jpg" /></noscript></p></a> The woman mourning</div>
'''

soup_lxml = BeautifulSoup(html1, 'lxml')
soup_lxml.div.find_all('img')

soup_html5lib = BeautifulSoup(html1, 'html5lib')
soup_html5lib.div.find_all('img')
...Traceback...
AttributeError: 'NoneType' object has no attribute 'next_element'
...

vars(soup_html5lib.div)
...
 'parser_class': None,
...
-----

on mysystem:
- Ubuntu 14.04.5 LTS
- Python 2.7.6
on pip:
- ipython (1.2.1)
- beautifulsoup4 (4.5.1)
- lxml (3.3.3)
- html5lib (0.999999999)

Revision history for this message
Leonard Richardson (leonardr) wrote :

Fixed in revision 434.

Changed in beautifulsoup:
status: New → Fix Committed
Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.