Change find_all(text='something') to match .text instead .string

Bug #1366856 reported by Martin Häcker
34
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Won't Fix
Undecided
Unassigned

Bug Description

Right now it is very confusing that find_all matches .string instead of .text when given a text argument. This produces the problem that you cannot search for tags that have a specific text inside a sub tag as .string will just return None in that case.

As an added benefit, that change would make BS4 behave more like jQuery which would make it easier to learn.

Revision history for this message
Carey Metcalfe (carey-a-metcalfe-deactivatedaccount) wrote :

+1

The fix is simple from a code point of view. 'found.string' has to be changed to 'found.text' on this line: http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/element.py#L1520 .

Ensuring that it doesn't break compatibility might be an issue since it's in the SoupStrainer which is used for basically everything to do with searching.

Still, it seems intuitive that 'find(text="blah")' should match the text property, not the string, especially since if you needed to match the string, you could just use 'find(string="blah")'.

Revision history for this message
Leonard Richardson (leonardr) wrote :

I'm not comfortable with this change on two levels. First, changing the behavior of text= will break backwards compatibility. If you call it something like all_text= it would be okay, but then you're not making text= less confusing.

Second, .text is very different from .string in that it's a calculated property, not a preset string value. Calculating .text for a tag requires recursively crawling all children of the tag and concatenating all their .string values into one big string. This is fine when you just need to rip all the text out of a tag, but it's a performance disaster when you call it from find(), which _also_ recursively crawls all children of the tag.

To seriously implement this would mean caching the value of .text. This is not a bad idea in and of itself, but then we'd have to recursively invalidate the cache when the tree changes, which would be a huge source of rarely-seen bugs.

Implementing text= this way is also going to give you way too many results: if a tag matches, all its parents also match. The current text argument (which uses .string) is smart enough to do the right thing. Making this new system do the right thing (return the _minimal_ set of tags that match) would be more complicated.

Finally, it's not difficult to do this from within Beautiful Soup if this is what you really want to do:

BeautifulSoup(markup).find_all(lambda x: 'matching text' in x.text)

For all these reasons, I'm rejecting this feature request.

Changed in beautifulsoup:
status: New → Won't Fix
Revision history for this message
pR0Ps (pr0ps) wrote :

Would it then be possible to rename the "text" arg to "string" in all the "find*" functions to align with the matching behavior?

Ex: 'find_all(self, name=None, attrs={}, recursive=True, string=None, limit=None, **kwargs)'

This would obviously be a breaking change, causing codebases that specify 'text="blah"' to use the computed text property, rather than string (although it would be fine if it was used positionally). Possibly for a later release?

This would preserve the current behavior of searching by string, not text, but allow users to use the text property in searches if they explicitly ask to (something that is currently not possible without using a function).

Revision history for this message
Martin Häcker (spamfaenger) wrote :

I am all for renaming the argument from text= to string= to match what is actually implemented. text= could be retained as a forwarding alias for some versions that can be deprecated to slowly free it up to be reintroduced with the behavior that actually matches the .text property.

Is there anything that would prevent this from working?

Revision history for this message
Martin Häcker (spamfaenger) wrote :

Ping?

Revision history for this message
Leonard Richardson (leonardr) wrote :

I'm willing to add string= as an alias for text= and to start using string= in the documentation. Absent a sensible alternative implementation for text= I won't be changing any more than that, and after I do add that implementation it will probably be severala couple years before text= changes its behavior.

Revision history for this message
Martin Häcker (spamfaenger) wrote :

Rejoice! I like it. The deprecation cycle has to start somewhere!

Best regards,
Martin

Changed in beautifulsoup:
status: Won't Fix → Confirmed
Revision history for this message
Leonard Richardson (leonardr) wrote :

Revision 370 introduces string as an alias for text in the find* methods. I'm going to close this issue; if anyone comes up with an efficient way to implement searching the equivalent of get_text(), it should be put into a new issue.

Changed in beautifulsoup:
status: Confirmed → Fix Committed
Changed in beautifulsoup:
status: Fix Committed → Fix Released
Changed in beautifulsoup:
status: Fix Released → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.