Change find() behavior when searching for both a tag and a string

Bug #1645513 reported by Jim Simon on 2016-11-29
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Wishlist
Unassigned

Bug Description

Requirements
  - Python 3.5.2
  - beautifulsoup4==4.5.1

I'm trying to execute find_all by passing name and string values. I'm seeing what I believe are inconsistent results. Can someone please verify if this is a bug or not, or clarify if I'm misunderstanding the functionality.

In the example below, I expect each find_all to return 1 match. The first example does not

html_var1 = "<a href="https://store.jtsstrength.com" itemprop="url">Juggernaut <span class="red-category">Store</span></a>"
html_var2 = "<a class="nav-opener" href="#" id="showMenu"><span>menu</span></a>"

BeautifulSoup.(html_var1, 'html.parser').find_all(
  name='a'
  string='Juggernaut'
)
Output []

BeautifulSoup(html_var2, 'html.parser').find_all(
  name='a',
  string='menu'
)

Output ["menu"]

Leonard Richardson (leonardr) wrote :

The behavior you're seeing is by design.

"Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string."

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument

So you're searching for a tag with a special .string. How does .string work?

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string

"If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child."

"If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None."

The first <a> tag contains more than one thing ("Juggernaut" and a <span> tag that contains "Store"), so its .string is defined to be None.

The second <a> tag contains one thing, a <span> tag, which contains one thing, "menu", so its .string is defined to be "menu".

Changed in beautifulsoup:
status: New → Invalid
Jim Simon (jksimoniii) wrote :

Thanks for the great clarification Leonard!

I apologize for the misunderstanding, is there some where I can open a feature request to possibly address this use case? I'm struggling achieve my first use case from above

Leonard Richardson (leonardr) wrote :

This issue can be the location of the feature request, but you're not the first person to ask for this. Issue 1366856 tracks a previous request to change the behavior of the 'find' methods when both a tag and a string are provided. I rejected that request for reasons I lay out in comment #2 on that issue. If you have a different idea of what you'd like Beautiful Soup to do, I'll treat it as a new request.

It looks like your use case is similar to https://bugs.launchpad.net/beautifulsoup/+bug/1518409, a second request to change the behavior of a search that provides both tag and string.

Maybe find_all("div", string="foo") should find a <div> tag if the <div> tag has a .string of "foo", _or_ it has a child that is the string "foo", _or_ it has a child with a .string of "foo". This would probably satisfy the person who filed issue 1518409, while avoiding the problems of the solution proposed in issue 1366856. It would break backwards compatibility but probably not so bad we couldn't put it in a feature release. Let me know if this is what you had in mind.

Leonard Richardson (leonardr) wrote :

Issue 1366856 is at https://bugs.launchpad.net/beautifulsoup/+bug/1366856; I thought it would be auto-linked and it wasn't.

Jim Simon (jksimoniii) wrote :

Thanks again!

You're performance concerns are valid, and I plan to construct a similar solution in my application to achieve the case presented.

Leonard Richardson (leonardr) wrote :

I've reopened this issue while I consider the feature request.

Changed in beautifulsoup:
status: Invalid → Confirmed
Jim Simon (jksimoniii) wrote :

Great! Thanks, my expectations (which other issues eluded to) are that all "things" that exist in a tag, which aren't tags themselves are returned concatenated as the .string property. A simple example is as follows:

<parent>
  Hello,
  <tag1>
    beautiful
  </tag1>
  World!
</parent>

.find_all(name='parent' string="Hello") == ["Hello, World!"]

Let me know if this aligns with what you think, or if I can contribute in anyway!

summary: - Strange Behavior w/ find_all (name=str, string=str)
+ Change find() behavior when searching for both a tag and a string
Leonard Richardson (leonardr) wrote :

Issue 1713129 is a request that find(name, string) find a tag that matches 'name' and which *contains* a string which matches 'string'.

tags: added: feature
Changed in beautifulsoup:
importance: Undecided → Wishlist
Isaac Muse (facelessuser) wrote :

This may or may not affect this feature request, but I thought I'd point out that this is possible with the new "select" API's ":contains()" pseudo-class. It is a pseudo-class that was original proposed as an official CSS selector, but in the end was rejected. But due to its usefulness, some custom CSS libraries (JQuery, lxml's CSSselect, etc.) have opted to still implement it for scripting purposes. As of Beautiful Soup 4.7.0, the included soupsieve library, which provides "select" and "select_one" functionality, allows for this:

>>> html_var1 = "<a href=\"https://store.jtsstrength.com\" itemprop=\"url\">Juggernaut <span class=\"red-category\">Store</span></a>"
>>> html_var2 = "<a class=\"nav-opener\" href=\"#\" id=\"showMenu\"><span>menu</span></a>"
>>> bs4.BeautifulSoup(html_var1, 'html.parser').select("a:contains(Juggernaut)")
[<a href="https://store.jtsstrength.com" itemprop="url">Juggernaut <span class="red-category">Store</span></a>]
>>> bs4.BeautifulSoup(html_var2, 'html.parser').select("a:contains(menu)")
[<a class="nav-opener" href="#" id="showMenu"><span>menu</span></a>]

Keep in mind that ":contains()" will search all the descendant text nodes of an element for the containing text, just as stated in the original proposed CSS spec. See below:

>>> bs4.BeautifulSoup(html_var1, 'html.parser').select("*:contains(Store)")
[<a href="https://store.jtsstrength.com" itemprop="url">Juggernaut <span class="red-category">Store</span></a>, <span class="red-category">Store</span>]

Notice that two elements are returned because both "<a>" and "<span>" contain "Store". So the more specific your selector, the better the results.

Using broad selectors like "*:contains()" can be very expensive, but using something like "p:contains()" would be much less expensive as it would only execute on "<p>" tags.

Now, Beautiful Soup may or may not still want to change "find_all"'s behavior in light of this information, but I thought I would point this out for others viewing this issue in the future. "select" and "select_one" may be able to provide the desired functionality.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers