Backend engine problem, html recognition error of lxml

Bug #1609627 reported by caesar0301
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Invalid
Undecided
Unassigned

Bug Description

I encountered a problem when extracting html using BS4. I choose backend engine between lxml and html5lib. But they give me different answers. I know its not the problem of BS. Yet it is valuable to record this difference to let others be aware. Here is my code snippet,

#------------------CODE STARTED-----------------------
#!/usr/bin/env python
#coding: utf-8

html_doc = """
<ul class="list-main-icnset">
<li>
<i class="cell maincell">
 <p class="title"><a target="_blank" href="http://www.itjuzi.com/company/42904"><span>hello</span></a></p>
</i>
</li>
</ul>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "lxml")

for ul in soup.find_all('ul', class_ = 'list-main-icnset'):
 for li in ul.find_all('li'):
  print li
  for i in li.find_all('i'):
   print("==================")
   print i
#---------------------CODE ENDED----------------------

With lxml, it gives me the <p> tag out of the scope of <li>, actually it should be within the scope:

<li>
<i class="cell maincell">
</i><p class="title"><a href="http://www.itjuzi.com/company/42904" target="_blank"><span>hello</span></a></p>
</li>
==================
<i class="cell maincell">
</i>

With html5lib, it gives me correct answer:

<li>
<i class="cell maincell">
 <p class="title"><a href="http://www.itjuzi.com/company/42904" target="_blank"><span>hello</span></a></p>
</i>
</li>
==================
<i class="cell maincell">
 <p class="title"><a href="http://www.itjuzi.com/company/42904" target="_blank"><span>hello</span></a></p>
</i>

I did not review the implementation of BS4 how it connects with backends. I could just assume it is the flaws of lxml.

caesar0301 (caesar0301)
description: updated
description: updated
Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for taking the time to post an example of the differences between parsers where people can see it. Since as you mention this isn't a bug in Beautiful Soup, I'm going to close this issue.

Changed in beautifulsoup:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.