Bug #1645513 “Change find() behavior when searching for both a t...” : Bugs : Beautiful Soup

Revision history for this message

Leonard Richardson (leonardr) wrote on 2016-12-11:

#1

The behavior you're seeing is by design.

"Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string."

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument

So you're searching for a tag with a special .string. How does .string work?

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string

"If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child."

"If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None."

The first <a> tag contains more than one thing ("Juggernaut" and a tag that contains "Store"), so its .string is defined to be None.

The second <a> tag contains one thing, a tag, which contains one thing, "menu", so its .string is defined to be "menu".

Changed in beautifulsoup:
status:	New → Invalid

Revision history for this message

Jim Simon (jksimoniii) wrote on 2016-12-12:

#2

Thanks for the great clarification Leonard!

I apologize for the misunderstanding, is there some where I can open a feature request to possibly address this use case? I'm struggling achieve my first use case from above

Revision history for this message

Leonard Richardson (leonardr) wrote on 2016-12-12:

#3

This issue can be the location of the feature request, but you're not the first person to ask for this. Issue 1366856 tracks a previous request to change the behavior of the 'find' methods when both a tag and a string are provided. I rejected that request for reasons I lay out in comment #2 on that issue. If you have a different idea of what you'd like Beautiful Soup to do, I'll treat it as a new request.

It looks like your use case is similar to https://bugs.launchpad.net/beautifulsoup/+bug/1518409, a second request to change the behavior of a search that provides both tag and string.

Maybe find_all("div", string="foo") should find a <div> tag if the <div> tag has a .string of "foo", _or_ it has a child that is the string "foo", _or_ it has a child with a .string of "foo". This would probably satisfy the person who filed issue 1518409, while avoiding the problems of the solution proposed in issue 1366856. It would break backwards compatibility but probably not so bad we couldn't put it in a feature release. Let me know if this is what you had in mind.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2016-12-12:

#4

Issue 1366856 is at https://bugs.launchpad.net/beautifulsoup/+bug/1366856; I thought it would be auto-linked and it wasn't.

Revision history for this message

Jim Simon (jksimoniii) wrote on 2016-12-14:

#5

Thanks again!

You're performance concerns are valid, and I plan to construct a similar solution in my application to achieve the case presented.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2016-12-14:

#6

I've reopened this issue while I consider the feature request.

Changed in beautifulsoup:
status:	Invalid → Confirmed

Revision history for this message

Jim Simon (jksimoniii) wrote on 2016-12-16:

#7

Great! Thanks, my expectations (which other issues eluded to) are that all "things" that exist in a tag, which aren't tags themselves are returned concatenated as the .string property. A simple example is as follows:

<parent>
 Hello,
 <tag1>
 beautiful
 </tag1>
 World!
</parent>

.find_all(name='parent' string="Hello") == ["Hello, World!"]

Let me know if this aligns with what you think, or if I can contribute in anyway!

Leonard Richardson (leonardr) on 2017-05-07

summary:

- Strange Behavior w/ find_all (name=str, string=str)
+ Change find() behavior when searching for both a tag and a string

Revision history for this message

Leonard Richardson (leonardr) wrote on 2018-07-18:

#8

Issue 1713129 is a request that find(name, string) find a tag that matches 'name' and which *contains* a string which matches 'string'.

Leonard Richardson (leonardr) on 2018-07-19

tags:

added: feature

Leonard Richardson (leonardr) on 2018-07-21

Changed in beautifulsoup:
importance:	Undecided → Wishlist

Revision history for this message

Isaac Muse (facelessuser) wrote on 2019-01-04:

#9

This may or may not affect this feature request, but I thought I'd point out that this is possible with the new "select" API's ":contains()" pseudo-class. It is a pseudo-class that was original proposed as an official CSS selector, but in the end was rejected. But due to its usefulness, some custom CSS libraries (JQuery, lxml's CSSselect, etc.) have opted to still implement it for scripting purposes. As of Beautiful Soup 4.7.0, the included soupsieve library, which provides "select" and "select_one" functionality, allows for this:

>>> html_var1 = "<a href=\"https://store.jtsstrength.com\" itemprop=\"url\">Juggernaut Store</a>"
>>> html_var2 = "<a class=\"nav-opener\" href=\"#\" id=\"showMenu\">menu</a>"
>>> bs4.BeautifulSoup(html_var1, 'html.parser').select("a:contains(Juggernaut)")
[<a href="https://store.jtsstrength.com" itemprop="url">Juggernaut Store</a>]
>>> bs4.BeautifulSoup(html_var2, 'html.parser').select("a:contains(menu)")
[<a class="nav-opener" href="#" id="showMenu">menu</a>]

Keep in mind that ":contains()" will search all the descendant text nodes of an element for the containing text, just as stated in the original proposed CSS spec. See below:

>>> bs4.BeautifulSoup(html_var1, 'html.parser').select("*:contains(Store)")
[<a href="https://store.jtsstrength.com" itemprop="url">Juggernaut Store</a>, Store]

Notice that two elements are returned because both "<a>" and "" contain "Store". So the more specific your selector, the better the results.

Using broad selectors like "*:contains()" can be very expensive, but using something like "p:contains()" would be much less expensive as it would only execute on "" tags.

Now, Beautiful Soup may or may not still want to change "find_all"'s behavior in light of this information, but I thought I would point this out for others viewing this issue in the future. "select" and "select_one" may be able to provide the desired functionality.

This may or may not affect this feature request, but I thought I'd point out that this is possible with the new "select" API's ":contains()" pseudo-class.  It is a pseudo-class that was original proposed as an official CSS selector, but in the end was rejected. But due to its usefulness, some custom CSS libraries (JQuery, lxml's CSSselect, etc.) have opted to still implement it for scripting purposes.  As of Beautiful Soup 4.7.0, the included soupsieve library, which provides "select" and "select_one" functionality, allows for this:

>>> html_var1 = "<a href=\"https://store.jtsstrength.com\" itemprop=\"url\">Juggernaut Store</a>"
>>> html_var2 = "<a class=\"nav-opener\" href=\"#\" id=\"showMenu\">menu</a>"
>>> bs4.BeautifulSoup(html_var1, 'html.parser').select("a:contains(Juggernaut)")
[<a href="https://store.jtsstrength.com" itemprop="url">Juggernaut Store</a>]
>>> bs4.BeautifulSoup(html_var2, 'html.parser').select("a:contains(menu)")
[<a class="nav-opener" href="#" id="showMenu">menu</a>]

Keep in mind that ":contains()" will search all the descendant text nodes of an element for the containing text, just as stated in the original proposed CSS spec. See below:

>>> bs4.BeautifulSoup(html_var1, 'html.parser').select("*:contains(Store)")
[<a href="https://store.jtsstrength.com" itemprop="url">Juggernaut Store</a>, Store]

Notice that two elements are returned because both "<a>" and "" contain "Store". So the more specific your selector, the better the results.

Using broad selectors like "*:contains()" can be very expensive, but using something like "p:contains()" would be much less expensive as it would only execute on "" tags.

Now, Beautiful Soup may or may not still want to change "find_all"'s behavior in light of this information, but I thought I would point this out for others viewing this issue in the future. "select" and "select_one" may be able to provide the desired functionality.

Revision history for this message

jnns (jnns) wrote on 2023-11-15:

#10

Thank you Isaac. I just wanted to let you know that your post is very helpful.

Beautiful Soup

Change find() behavior when searching for both a tag and a string

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches