Beautiful Soup

Backend engine problem, html recognition error of lxml

Bug #1609627 reported by caesar0301 on 2016-08-04

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Invalid	Undecided	Unassigned

Bug Description

I encountered a problem when extracting html using BS4. I choose backend engine between lxml and html5lib. But they give me different answers. I know its not the problem of BS. Yet it is valuable to record this difference to let others be aware. Here is my code snippet,

#------------------CODE STARTED-----------------------
#!/usr/bin/env python
#coding: utf-8

html_doc = """
<ul class="list-main-icnset">
<li>

<a target="_blank" href="http://www.itjuzi.com/company/42904">hello</a>

</li>
</ul>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "lxml")

for ul in soup.find_all('ul', class_ = 'list-main-icnset'):
for li in ul.find_all('li'):
  print li
  for i in li.find_all('i'):
   print("==================")
   print i
#---------------------CODE ENDED----------------------

With lxml, it gives me the tag out of the scope of <li>, actually it should be within the scope:

<li>

<a href="http://www.itjuzi.com/company/42904" target="_blank">hello</a>
</li>
==================

With html5lib, it gives me correct answer:

<li>

<a href="http://www.itjuzi.com/company/42904" target="_blank">hello</a>

</li>
==================

<a href="http://www.itjuzi.com/company/42904" target="_blank">hello</a>

I did not review the implementation of BS4 how it connects with backends. I could just assume it is the flaws of lxml.

See original description

caesar0301 (caesar0301) on 2016-08-04

description:	updated
description:	updated

Revision history for this message

Leonard Richardson (leonardr) wrote on 2016-12-10:

Thanks for taking the time to post an example of the differences between parsers where people can see it. Since as you mention this isn't a bug in Beautiful Soup, I'm going to close this issue.

Changed in beautifulsoup:
status:	New → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.