Beautiful Soup

BS4 doesn't find text in link when content also contain markup

Bug #1492166 reported by Mathieu Clabaut on 2015-09-04

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Won't Fix	Undecided	Unassigned

Bug Description

With beautifulsoup4-4.4.0 :

The following snippet:

import re
from bs4 import BeautifulSoup

dom = BeautifulSoup("<a href='/'>tobefound</a>",
"html.parser")
print(dom.find_all("a", text="tobefound"))
print(dom.find_all("a", text=re.compile("tobefound")))
res = dom.find("a")
print(res.contents)
print(res.text)

displays:

[]
[]
[, 'tobefound']
tobefound

whereas the two first print shall display the whole "<a>" markup.

The same test works as expected with BS3 :

[u'tobefound']
[u'tobefound']
[, u'tobefound']
tobefound

Revision history for this message

Mathieu Clabaut (mathieu.clabaut) wrote on 2015-09-04:

Example of failure Edit (292 bytes, text/x-python)

Revision history for this message

Leonard Richardson (leonardr) wrote on 2015-09-28:

This is a documented change between BS3 and BS4:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#id18

"If you pass one of the find* methods both string and a tag-specific argument like name, Beautiful Soup will search for tags that match your tag-specific criteria and whose Tag.string matches your value for string. It will not find the strings themselves. Previously, Beautiful Soup ignored the tag-specific arguments and looked for strings."

Changed in beautifulsoup:
status:	New → Won't Fix

Revision history for this message

Mathieu Clabaut (mathieu.clabaut) wrote on 2015-09-29:

Thanks for the documentation pointer.

I still have some comments on the subject. I'm ready to understand it is a feature not a bug, but given that the documentation also say : "Tag.string now operates recursively. If tag A contains a single tag B and nothing else, then A.string is the same as B.string" I would expect BeautifulSoup("<a href='/'>tobefound</a>", "html.parser").find_all("a", text="tobefound") to at least find the <a> tag. Whereas it finds nothings.

If the behaviour observed is indeed intended, I'd be very thankfull if you could provide me a way to find this <a> tag with a 'text' criteria in BS4.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2015-09-29:

BeautifulSoup("<a href='/'>tobefound</a>", "html.parser").find_all("a", text="tobefound") will find the <a> tag. In this case the <a> tag contains nothing but an tag, which contains nothing but a string. In this case a.string inherits i.string, which is the string you're looking for.

You have <a>tobefound</a>. In this case the <a> tag has two children--an tag and a string--so .string is undefined.

To match the <a> tag I recommend defining a function to perform the exact match you want, and pass that function into find_all() (as per http://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function):

def a_with_tobefound_in_children(tag):
return tag.name=='a' and "tobefound" in tag.children

BeautifulSoup("<a href='/'>tobefound</a>", "html.parser").find_all(a_with_tobefound_in_children)

Revision history for this message

Mathieu Clabaut (mathieu.clabaut) wrote on 2015-09-29:

Ok.

I've ignored the « a single tag and nothing else » of the documentation. I do now understand.
Thank you very much for having taken time to shed some light on the way it works !

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Example of failure Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.