Backend engine problem, html recognition error of lxml
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Invalid
|
Undecided
|
Unassigned |
Bug Description
I encountered a problem when extracting html using BS4. I choose backend engine between lxml and html5lib. But they give me different answers. I know its not the problem of BS. Yet it is valuable to record this difference to let others be aware. Here is my code snippet,
#------
#!/usr/bin/env python
#coding: utf-8
html_doc = """
<ul class="
<li>
<i class="cell maincell">
<p class="title"><a target="_blank" href="http://
</i>
</li>
</ul>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(
for ul in soup.find_all('ul', class_ = 'list-main-
for li in ul.find_all('li'):
print li
for i in li.find_all('i'):
print(
print i
#------
With lxml, it gives me the <p> tag out of the scope of <li>, actually it should be within the scope:
<li>
<i class="cell maincell">
</i><p class="title"><a href="http://
</li>
==================
<i class="cell maincell">
</i>
With html5lib, it gives me correct answer:
<li>
<i class="cell maincell">
<p class="title"><a href="http://
</i>
</li>
==================
<i class="cell maincell">
<p class="title"><a href="http://
</i>
I did not review the implementation of BS4 how it connects with backends. I could just assume it is the flaws of lxml.
description: | updated |
description: | updated |
Thanks for taking the time to post an example of the differences between parsers where people can see it. Since as you mention this isn't a bug in Beautiful Soup, I'm going to close this issue.