BS4 doesn't find text in link when content also contain markup

Bug #1492166 reported by Mathieu Clabaut
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Won't Fix
Undecided
Unassigned

Bug Description

With beautifulsoup4-4.4.0 :

The following snippet:

import re
from bs4 import BeautifulSoup

dom = BeautifulSoup("<a href='/'><i class='icon'></i>tobefound</a>",
                    "html.parser")
print(dom.find_all("a", text="tobefound"))
print(dom.find_all("a", text=re.compile("tobefound")))
res = dom.find("a")
print(res.contents)
print(res.text)

displays:

[]
[]
[<i class="icon"></i>, 'tobefound']
tobefound

whereas the two first print shall display the whole "<a>" markup.

The same test works as expected with BS3 :

[u'tobefound']
[u'tobefound']
[<i class="icon"></i>, u'tobefound']
tobefound

Revision history for this message
Mathieu Clabaut (mathieu.clabaut) wrote :
Revision history for this message
Leonard Richardson (leonardr) wrote :

This is a documented change between BS3 and BS4:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#id18

"If you pass one of the find* methods both string and a tag-specific argument like name, Beautiful Soup will search for tags that match your tag-specific criteria and whose Tag.string matches your value for string. It will not find the strings themselves. Previously, Beautiful Soup ignored the tag-specific arguments and looked for strings."

Changed in beautifulsoup:
status: New → Won't Fix
Revision history for this message
Mathieu Clabaut (mathieu.clabaut) wrote :

Thanks for the documentation pointer.

I still have some comments on the subject. I'm ready to understand it is a feature not a bug, but given that the documentation also say : "Tag.string now operates recursively. If tag A contains a single tag B and nothing else, then A.string is the same as B.string" I would expect BeautifulSoup("<a href='/'><i class='icon'></i>tobefound</a>", "html.parser").find_all("a", text="tobefound") to at least find the <a> tag. Whereas it finds nothings.

If the behaviour observed is indeed intended, I'd be very thankfull if you could provide me a way to find this <a> tag with a 'text' criteria in BS4.

Revision history for this message
Leonard Richardson (leonardr) wrote :

BeautifulSoup("<a href='/'><i class='icon'>tobefound</i></a>", "html.parser").find_all("a", text="tobefound") will find the <a> tag. In this case the <a> tag contains nothing but an <i> tag, which contains nothing but a string. In this case a.string inherits i.string, which is the string you're looking for.

You have <a><i class='icon'></i>tobefound</a>. In this case the <a> tag has two children--an <i> tag and a string--so .string is undefined.

To match the <a> tag I recommend defining a function to perform the exact match you want, and pass that function into find_all() (as per http://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function):

def a_with_tobefound_in_children(tag):
    return tag.name=='a' and "tobefound" in tag.children

BeautifulSoup("<a href='/'><i class='icon'></i>tobefound</a>", "html.parser").find_all(a_with_tobefound_in_children)

Revision history for this message
Mathieu Clabaut (mathieu.clabaut) wrote :

Ok.

I've ignored the « a single tag and nothing else » of the documentation. I do now understand.
Thank you very much for having taken time to shed some light on the way it works !

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.