finding scripts of type="application/ld+json"
Bug #1875715 reported by
Jason Essebag
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Invalid
|
Undecided
|
Unassigned |
Bug Description
In version 4.8.2 the following script would work to find all scripts of type="applicati
For example, data = json.loads(
In version 4.9.0 this soup.find returns None.
To post a comment you must log in.
Thanks for taking the time to file this issue.
This is expected; this change in behavior is the main new feature in the 4.9.0 feature release. Tags like <script> and <style> don't contain human-readable text, and including their content in the output of get_text() (for which .text is an alias) made that functionality less useful for most users.
The original discussion that led to that change is here: /bugs.launchpad .net/beautifuls oup/+bug/ 1868861
https:/
The behavior of get_text() may change again in the future, again with the goal of helping people extract human-readable text from documents: /bugs.launchpad .net/beautifuls oup/+bug/ 1768330
https:/
The simplest way to fix your regression is to use .string or .encode_contents() instead of .text. The .text attribute makes judgements about what parts of the page are 'text'. .string and .encode_contents() are better when you just need to get the string or markup inside a tag.
You can also explicitly tell get_text() to consider scripts and stylesheets as "text", by passing a list of NavigableString subclasses into get_text as `types`:
from bs4 import BeautifulSoup "<script> This is text</script>") text(types= (NavigableStrin g, CData, Script, Stylesheet))
from from bs4.element import (NavigableString, CData, Script, Stylesheet)
soup = BeautifulSoup(
soup.get_text()
# u''
soup.get_
# u'This is text'
But, again, using .string or .encode_contents seems more reliable, because you sidestep the judgement calls as to what is considered "text".