Beautiful Soup

finding scripts of type="application/ld+json"

Bug #1875715 reported by Jason Essebag on 2020-04-28

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Invalid	Undecided	Unassigned

Bug Description

In version 4.8.2 the following script would work to find all scripts of type="application/ld+json".

For example, data = json.loads(soup.find("script", type="application/ld+json").text)

In version 4.9.0 this soup.find returns None.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2020-04-28:

Thanks for taking the time to file this issue.

This is expected; this change in behavior is the main new feature in the 4.9.0 feature release. Tags like <script> and <style> don't contain human-readable text, and including their content in the output of get_text() (for which .text is an alias) made that functionality less useful for most users.

The original discussion that led to that change is here:
https://bugs.launchpad.net/beautifulsoup/+bug/1868861

The behavior of get_text() may change again in the future, again with the goal of helping people extract human-readable text from documents:
https://bugs.launchpad.net/beautifulsoup/+bug/1768330

The simplest way to fix your regression is to use .string or .encode_contents() instead of .text. The .text attribute makes judgements about what parts of the page are 'text'. .string and .encode_contents() are better when you just need to get the string or markup inside a tag.

You can also explicitly tell get_text() to consider scripts and stylesheets as "text", by passing a list of NavigableString subclasses into get_text as `types`:

from bs4 import BeautifulSoup
from from bs4.element import (NavigableString, CData, Script, Stylesheet)
soup = BeautifulSoup("<script>This is text</script>")
soup.get_text()
# u''
soup.get_text(types=(NavigableString, CData, Script, Stylesheet))
# u'This is text'

But, again, using .string or .encode_contents seems more reliable, because you sidestep the judgement calls as to what is considered "text".

Changed in beautifulsoup:
status:	New → Invalid

Revision history for this message

Jason Essebag (jason-essebag) wrote on 2020-04-29:

Could you please clarify why the following does not work.

soup.find("script", type="application/ld+json").string
or this
soup.find("script", type="application/ld+json").encode_contents()

instead of the old version's way:
soup.find("script", type="application/ld+json").text

To be clear, soup.find("script", type="application/ld+json") returned a result, with or without .text in the older version.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2020-04-29:

Sure, I'll just need to see the markup you used to build 'soup'.

Here's some simple markup and a script to test it out:

from bs4 import BeautifulSoup, __version__
print(__version__)
markup = '<html><script type="application/ld+json">some ld+json</script>'

soup = BeautifulSoup(markup, 'html.parser')
result = soup.find("script", type="application/ld+json")
print(type(result))
print(repr(result.string))
print(repr(result.encode_contents()))
print(repr(result.text))

Here's the output when I run the script:

4.9.0
<class 'bs4.element.Tag'>
'some ld+json'
b'some ld+json'
''

Revision history for this message

Jason Essebag (jason-essebag) wrote on 2020-04-29:

Thanks, Leonard.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.