BS4 doesn't find text in link when content also contain markup
Bug #1492166 reported by
Mathieu Clabaut
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
With beautifulsoup4-
The following snippet:
import re
from bs4 import BeautifulSoup
dom = BeautifulSoup("<a href='/'><i class='
print(dom.
print(dom.
res = dom.find("a")
print(res.contents)
print(res.text)
displays:
[]
[]
[<i class="icon"></i>, 'tobefound']
tobefound
whereas the two first print shall display the whole "<a>" markup.
The same test works as expected with BS3 :
[u'tobefound']
[u'tobefound']
[<i class="icon"></i>, u'tobefound']
tobefound
To post a comment you must log in.
This is a documented change between BS3 and BS4:
http:// www.crummy. com/software/ BeautifulSoup/ bs4/doc/ #id18
"If you pass one of the find* methods both string and a tag-specific argument like name, Beautiful Soup will search for tags that match your tag-specific criteria and whose Tag.string matches your value for string. It will not find the strings themselves. Previously, Beautiful Soup ignored the tag-specific arguments and looked for strings."