cant resolve code correctly
Bug #1635514 reported by
slink
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Invalid
|
Undecided
|
Unassigned |
Bug Description
cant resolve code correctly, please see below:
<div id="rptAccident
type code:
feestr = item.find('div', class_=
print(feestr)
return:
<div class="
expect return:
<div class="
Changed in beautifulsoup: | |
status: | New → Invalid |
To post a comment you must log in.
It sounds like you passed some HTML into Beautiful Soup and you're suprised at the way the underlying parser handled it. When this happens to me, I follow the principles I described in the "Errors when parsing a document" section of the Beautiful Soup documentation:
https:/ /www.crummy. com/software/ BeautifulSoup/ bs4/doc/ #errors- when-parsing- a-document
The first thing I consider is, "should I be using a different parser?" I use the diagnose() function to show me how different parsers handle the HTML you provided in your bug report.
https:/ /www.crummy. com/software/ BeautifulSoup/ bs4/doc/ #diagnose
I used this code to run the diagnose() function on the markup you provided with this bug report:
diagnose("""<div id="rptAccident List_ctl00_ ltrDiscount" class=" list_pro_ price_gift" style=" text-align: center" ><div class=" list_pro_ discount" style=" display: none" ></div> fee:87. 15元 & nbsp; rate:35% </div>" "")
Here's the output I see:
Diagnostic running on Beautiful Soup 4.5.0
Python version 2.7.10 (default, Oct 14 2015, 16:09:02)
[GCC 5.2.1 20151010]
Found lxml version 3.4.4.0
Found html5lib version 0.999
Trying to parse your markup with html.parser list_pro_ price_gift" id="rptAccident List_ctl00_ ltrDiscount" style=" text-align: center" > list_pro_ discount" style=" display: none"> ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- --- list_pro_ price_gift" id="rptAccident List_ctl00_ ltrDiscount" style=" text-align: center" > list_pro_ discount" style=" display: none"> ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- --- list_pro_ price_gift" id="rptAccident List_ctl00_ ltrDiscount" style=" text-align: center" > list_pro_ discount" style=" display: none">
Here's what html.parser did with the markup:
<div class="
<div class="
</div>
fee:87.15ĺ
rate:35%
</div>
-------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
<head>
</head>
<body>
<div class="
<div class="
</div>
fee:87.15å…ƒ rate:35%
</div>
</body>
</html>
-------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
<body>
<div class="
<div class="
</div>
fee:87.15ĺ
rate:35%
</div>
</body>
</html>
All three of the HTML parsers put the <div> with class=" list_pro_ discount" inside the <div> with class=" list_pro_ price_gift" . In other words, I can't reproduce your problem.
This tells me that the markup you posted with this bug report isn't the whole story. There's something _elsewhere_ in the document you parsed which confuses whichever HTML parser you're telling Beautiful Soup to use.
As I mention in the "Errors when parsing a document" section of the documentation, this is almost never a problem with Beautiful Soup. It's much more common that the HTML parser you told Beautiful Soup to use is seeing an ambiguous HTML document and making a decision about how to handle it, a decision that doesn't match what you were expecting. In this case you have two options: either tell Beautiful Soup to use a different parser, or change your code to work with the way you chosen parser actually parses the document.
Although you didn't mention which parser you told Beaut...