find and find_all: string matching doesn't work when comments are present.
Bug #1713129 reported by
Thomas Proctor
This bug report is a duplicate of:
Bug #1645513: Change find() behavior when searching for both a tag and a string.
Edit
Remove
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
New
|
Undecided
|
Unassigned |
Bug Description
The attached file contains an example of the bug. When using a string/text match with the `string` or `text` arguments for `find` and `find_all`, no matches will be returned if the element contains a comment in the html and the `name` argument is specified.
My example shows a generic regex matching (matching the regex ".*", which should match all text), but I believe this bug also shows up with normal text matching as well.
See Bug example for an example.
Versioning:
bs4: 4.6.0
python: 3.4.3
lxml parser
To post a comment you must log in.
Thanks for filing this issue. The behavior you're seeing is a side effect of the way .string works. This is close enough to issue 1698990 that I'm going to mark it as a duplicate.
Passing 'tag' and 'string' into a find() method makes it look for a tag whose .string value is that tag. If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None. (https:/ /www.crummy. com/software/ BeautifulSoup/ bs4/doc/ #string)
The <p> tag that contains both a string and a comment has its .string set to None, so your problem_ soup.find( 'p', string= re.compile( r".*")) doesn't match anything -- there's a string inside the tag, but there's other stuff as well, so .string is undefined and an attempt to match on .string will match nothing.
You might be interested in issue 1645513, a proposal to change the behavior of find() when given both a tag and a string. You're not the first person to expect find() to behave differently than it does, but it seems like the solution I propose in https:/ /bugs.launchpad .net/beautifuls oup/+bug/ 1645513/ comments/ 3 would also not behave the way you expect it to.