Tag object lost some by html5lib with incorrect html
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
I faced one problem which Tag object lost some information by html5lib with incorrect html & after losing them I'll fail to operate it as a BS4 object.
When I try to operate following soup_html5lib obj from html1, I can do soup_html5lib.div , but I failed soup_html5lib.
I tried to find difference between a soup_lxml and soup_html5lib.
I found a point which 'parser_class' is lost in Tag obj with html5lib.
I know html1 has wrong html structure. But I can make Tag object once.
I'm not sure this is BS4 bug or html5lib or other.
I just expect avoid Traceback before operate.
If there is some option for avoiding it , let's me know.
Thanks for reading.
-----
from bs4 import BeautifulSoup
html1 = '''
<div><a href="http://
'''
html2 = '''
<div><a href="http://
'''
soup_lxml = BeautifulSoup(
soup_lxml.
soup_html5lib = BeautifulSoup(
soup_html5lib.
...Traceback...
AttributeError: 'NoneType' object has no attribute 'next_element'
...
vars(soup_
...
'parser_class': None,
...
-----
on mysystem:
- Ubuntu 14.04.5 LTS
- Python 2.7.6
on pip:
- ipython (1.2.1)
- beautifulsoup4 (4.5.1)
- lxml (3.3.3)
- html5lib (0.999999999)
Changed in beautifulsoup: | |
status: | Fix Committed → Fix Released |
Fixed in revision 434.