cant resolve code correctly

Bug #1635514 reported by slink
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Invalid
Undecided
Unassigned

Bug Description

cant resolve code correctly, please see below:
<div id="rptAccidentList_ctl00_ltrDiscount" class="list_pro_price_gift" style="text-align:center"><div class="list_pro_discount" style="display:none" ></div>fee:87.15元&nbsp;&nbsp;&nbsp;&nbsp;rate:35%</div>

type code:
feestr = item.find('div', class_='list_pro_price_gift')
print(feestr)

return:
<div class="list_pro_price_gift" id="rptAccidentList_ctl00_ltrDiscount" style="text-align:center"></div>

expect return:
<div class="list_pro_price_gift" id="rptAccidentList_ctl00_ltrDiscount" style="text-align:center"><div class="list_pro_discount" style="display:none" ></div>fee:87.15元&nbsp;&nbsp;&nbsp;&nbsp;rate:35%</div>

Revision history for this message
Leonard Richardson (leonardr) wrote :
Download full text (3.8 KiB)

It sounds like you passed some HTML into Beautiful Soup and you're suprised at the way the underlying parser handled it. When this happens to me, I follow the principles I described in the "Errors when parsing a document" section of the Beautiful Soup documentation:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#errors-when-parsing-a-document

The first thing I consider is, "should I be using a different parser?" I use the diagnose() function to show me how different parsers handle the HTML you provided in your bug report.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#diagnose

I used this code to run the diagnose() function on the markup you provided with this bug report:

diagnose("""<div id="rptAccidentList_ctl00_ltrDiscount" class="list_pro_price_gift" style="text-align:center"><div class="list_pro_discount" style="display:none" ></div>fee:87.15元&nbsp;&nbsp;&nbsp;&nbsp;rate:35%</div>""")

Here's the output I see:

Diagnostic running on Beautiful Soup 4.5.0
Python version 2.7.10 (default, Oct 14 2015, 16:09:02)
[GCC 5.2.1 20151010]
Found lxml version 3.4.4.0
Found html5lib version 0.999

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<div class="list_pro_price_gift" id="rptAccidentList_ctl00_ltrDiscount" style="text-align:center">
 <div class="list_pro_discount" style="display:none">
 </div>
 fee:87.15ĺ
ƒ    rate:35%
</div>
--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
 <head>
 </head>
 <body>
  <div class="list_pro_price_gift" id="rptAccidentList_ctl00_ltrDiscount" style="text-align:center">
   <div class="list_pro_discount" style="display:none">
   </div>
   fee:87.15å…ƒ    rate:35%
  </div>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
 <body>
  <div class="list_pro_price_gift" id="rptAccidentList_ctl00_ltrDiscount" style="text-align:center">
   <div class="list_pro_discount" style="display:none">
   </div>
   fee:87.15ĺ
ƒ    rate:35%
  </div>
 </body>
</html>

All three of the HTML parsers put the <div> with class="list_pro_discount" inside the <div> with class="list_pro_price_gift". In other words, I can't reproduce your problem.

This tells me that the markup you posted with this bug report isn't the whole story. There's something _elsewhere_ in the document you parsed which confuses whichever HTML parser you're telling Beautiful Soup to use.

As I mention in the "Errors when parsing a document" section of the documentation, this is almost never a problem with Beautiful Soup. It's much more common that the HTML parser you told Beautiful Soup to use is seeing an ambiguous HTML document and making a decision about how to handle it, a decision that doesn't match what you were expecting. In this case you have two options: either tell Beautiful Soup to use a different parser, or change your code to work with the way you chosen parser actually parses the document.

Although you didn't mention which parser you told Beaut...

Read more...

Changed in beautifulsoup:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.