finding scripts of type="application/ld+json"

Bug #1875715 reported by Jason Essebag
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Invalid
Undecided
Unassigned

Bug Description

In version 4.8.2 the following script would work to find all scripts of type="application/ld+json".

For example, data = json.loads(soup.find("script", type="application/ld+json").text)

In version 4.9.0 this soup.find returns None.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for taking the time to file this issue.

This is expected; this change in behavior is the main new feature in the 4.9.0 feature release. Tags like <script> and <style> don't contain human-readable text, and including their content in the output of get_text() (for which .text is an alias) made that functionality less useful for most users.

The original discussion that led to that change is here:
https://bugs.launchpad.net/beautifulsoup/+bug/1868861

The behavior of get_text() may change again in the future, again with the goal of helping people extract human-readable text from documents:
https://bugs.launchpad.net/beautifulsoup/+bug/1768330

The simplest way to fix your regression is to use .string or .encode_contents() instead of .text. The .text attribute makes judgements about what parts of the page are 'text'. .string and .encode_contents() are better when you just need to get the string or markup inside a tag.

You can also explicitly tell get_text() to consider scripts and stylesheets as "text", by passing a list of NavigableString subclasses into get_text as `types`:

from bs4 import BeautifulSoup
from from bs4.element import (NavigableString, CData, Script, Stylesheet)
soup = BeautifulSoup("<script>This is text</script>")
soup.get_text()
# u''
soup.get_text(types=(NavigableString, CData, Script, Stylesheet))
# u'This is text'

But, again, using .string or .encode_contents seems more reliable, because you sidestep the judgement calls as to what is considered "text".

Changed in beautifulsoup:
status: New → Invalid
Revision history for this message
Jason Essebag (jason-essebag) wrote :

Could you please clarify why the following does not work.

soup.find("script", type="application/ld+json").string
or this
soup.find("script", type="application/ld+json").encode_contents()

instead of the old version's way:
soup.find("script", type="application/ld+json").text

To be clear, soup.find("script", type="application/ld+json") returned a result, with or without .text in the older version.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Sure, I'll just need to see the markup you used to build 'soup'.

Here's some simple markup and a script to test it out:

from bs4 import BeautifulSoup, __version__
print(__version__)
markup = '<html><script type="application/ld+json">some ld+json</script>'

soup = BeautifulSoup(markup, 'html.parser')
result = soup.find("script", type="application/ld+json")
print(type(result))
print(repr(result.string))
print(repr(result.encode_contents()))
print(repr(result.text))

Here's the output when I run the script:

4.9.0
<class 'bs4.element.Tag'>
'some ld+json'
b'some ld+json'
''

Revision history for this message
Jason Essebag (jason-essebag) wrote :

Thanks, Leonard.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.