Bug #1868861 “Some tags never contain 'text' per the HTML spec” : Bugs : Beautiful Soup

Revision history for this message

Geoffrey Fairchild (gfairchild) wrote on 2020-03-24:

#1

test HTML content Edit (2.4 KiB, text/html)

Revision history for this message

Geoffrey Fairchild (gfairchild) wrote on 2020-03-24:

#2

expected output Edit (80 bytes, text/plain)

Here's my expected output.

Revision history for this message

Geoffrey Fairchild (gfairchild) wrote on 2020-03-24:

#3

actual output Edit (1.1 KiB, text/plain)

Here's my actual output.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2020-04-04:

#4

Geoffrey, thanks for taking the time to report this issue.

get_text() is a bit of a blunt instrument, but it does know to filter out comments and other 'strings' that are not generally considered part of a document's text.

In the example document you gave, html.parser parses the contents of the <style> tag as an HTML comment, but lxml parses it as a CSS stylesheet that for some reason starts with "<!--". lxml's behavior comes from the HTML 5 spec (https://html.spec.whatwg.org/#the-style-element); html.parser doesn't treat the <style> tag differently from any other tag.

So in one sense this is a problem of a difference between parsers. However the underlying problem is parser-independent. Per the HTML5 spec, nothing inside a <style> tag should be considered "text" -- it's a stylesheet. What the contents of the <style> tag look like is irrelevant. A document that looks like "<style>This is text.</style>" actually contains no text, just a broken stylesheet.

There are a lot of places where get_text() comes up short, and this is another one. get_text() is not very reliable, but it is very popular, so this is always a dilemma for me.

The best solution may be to parse the contents of the <style> tag as a special type of string which is ignored by get_text() the same way a Comment or ProcessingInstruction is. That way you can find special strings of that type and treat them specially if you need to for any reason; it's not just about hiding that string from get_text().

summary:	- get_text() misses cleaning some HTML comments + Some tags never contain 'text' per the HTML spec
Changed in beautifulsoup:
status:	New → Confirmed

Revision history for this message

Leonard Richardson (leonardr) wrote on 2020-04-05:

#5

Revision 564 introduces special treatment for <style>, <script>, and <template> tags.

Changed in beautifulsoup:
status:	Confirmed → Fix Committed

Leonard Richardson (leonardr) on 2020-04-05

Changed in beautifulsoup:
status:	Fix Committed → Fix Released

Revision history for this message

Geoffrey Fairchild (gfairchild) wrote on 2020-04-05:

#6

Thanks so much for the detailed explanation and the fix. I really appreciate it!

Revision history for this message

Fake Name (lemuix-2) wrote on 2020-10-13:

#7

While disambiguating between rendered text and non-markup text does make sense, this change broke a bunch of my code.

I think there's a coherent argument to be made that "text" in the context of HTML is anything that's not HTML, rather then anything that is expected to be visible in a browser.

So basically, there needs to be a way to select the old behaviour. Really, I think it would make more sense to have a parameter to `get_text()` that specifies the type of text, and if not present to default to all text (both for rendering and external runtimes).

Something like `get_text(type="rendered|script|style")` or something.

Revision history for this message

Fake Name (lemuix-2) wrote on 2020-10-13:

#8

Ironically, the OP has a valid point for whether the contents of HTML comments should be included in the returns of `get_text()`, since html comments are a HTML directive, rather then just a different type of text.

Revision history for this message

Fake Name (lemuix-2) wrote on 2020-10-13:

#9

An alternative would be to provide similar `get_script()` and `get_styles()` functions, or similar.

There seems to be no way to easily get the contents of Script elements. `<tag>.strings` also seems to not return the contained strings (this isn't even text!), the same for ``<tag>.stripped_strings`. Does string mean text now?

Short of manually iterating over a tags contents and pulling out the contents of each child-elements `string` (which strangely *doesn't* hide the contents of a `Script` tag), there seems to be no nice way to do this.

Incidentally, the fact that `Script` class, which is a subclass of `NavigableString` doesn't have any `strings` when in a tag, but has a `string` is extremely confusing. If there's going to be a distinction between text and strings, it at least should be consistent.

Leonard Richardson (leonardr) on 2020-10-13

Changed in beautifulsoup:
status:	Fix Released → New

Revision history for this message

Leonard Richardson (leonardr) wrote on 2020-10-13:

#10

I'm reopening this issue to get more information, and will file a new issue once I have a better idea of what to do. I think almost all the issues you're having can be improved by more documentation.

The methods that deal with text do give the caller a chance to configure what is considered 'text'. The get_text method takes an argument 'types':

    def get_text(self, separator=u"", strip=False,
                 types=(NavigableString, CData)):
...
        :types: A tuple of NavigableString subclasses. Any strings of
            a subclass not found in this list will be ignored. By
            default, this means only NavigableString and CData objects
            will be considered. So no comments, processing instructions,
            stylesheets, etc.

To make get_text consider scripts to be 'text', you would pass bs4.Script into the types tuple.

The '.strings' property calls the '_all_strings' internal method with default arguments, and '_all_strings' also takes the 'types' argument. (In fact, get_text calls _all_strings and joins the results.) But _all_strings is an internal method, and I can't make .strings stop being a property. This is the piece that can't be fixed with documentation.

With respect to getting the contents of an element, there's a method encode_contents() that I think will do what you want. It's not documented because it's a deprecated method that provides compatibility with Beautiful Soup 3. But if people want to use it I can un-deprecate it and document it. String types aren't an issue with encode_contents(), because it's rendering HTML markup as it would appear in a web page, not trying to apply a human notion (or HTML-spec notion) of what is "text".

So, to put all of that into an example Python script:

---
markup = """<script>some javascript</script>"""
from bs4 import BeautifulSoup, NavigableString, CData, Script
soup = BeautifulSoup(markup, 'lxml')
script = soup.script

print("get_text()")
print(script.get_text())
#
print(script.get_text(types=[NavigableString, CData, Script]))
# some javascript

print("")
print(".strings")
print([x for x in soup.script.strings])
# []
print(
    [
        x for x in soup.script._all_strings(
            types=[NavigableString, CData, Script]
        )
    ]
)
# ['some javascript']

print("")
print("encode_contents()")
print(script.encode_contents())
# some javascript
---

Will this solve the problems you bring up?

I'm reopening this issue to get more information, and will file a new issue once I have a better idea of what to do. I think almost all the issues you're having can be improved by more documentation.

The methods that deal with text do give the caller a chance to configure what is considered 'text'. The get_text method takes an argument 'types':

def get_text(self, separator=u"", strip=False,
                 types=(NavigableString, CData)):
...
        :types: A tuple of NavigableString subclasses. Any strings of
            a subclass not found in this list will be ignored. By
            default, this means only NavigableString and CData objects
            will be considered. So no comments, processing instructions,
            stylesheets, etc.

To make get_text consider scripts to be 'text', you would pass bs4.Script into the types tuple.

The '.strings' property calls the '_all_strings' internal method with default arguments, and '_all_strings' also takes the 'types' argument. (In fact, get_text calls _all_strings and joins the results.) But _all_strings is an internal method, and I can't make .strings stop being a property. This is the piece that can't be fixed with documentation.

With respect to getting the contents of an element, there's a method encode_contents() that I think will do what you want. It's not documented because it's a deprecated method that provides compatibility with Beautiful Soup 3. But if people want to use it I can un-deprecate it and document it. String types aren't an issue with encode_contents(), because it's rendering HTML markup as it would appear in a web page, not trying to apply a human notion (or HTML-spec notion) of what is "text".

So, to put all of that into an example Python script:

---
markup = """<script>some javascript</script>"""
from bs4 import BeautifulSoup, NavigableString, CData, Script
soup = BeautifulSoup(markup, 'lxml')
script = soup.script

print("get_text()")
print(script.get_text())
# 
print(script.get_text(types=[NavigableString, CData, Script]))
# some javascript

print("")
print(".strings")
print([x for x in soup.script.strings])
# []
print(
    [
        x for x in soup.script._all_strings(
            types=[NavigableString, CData, Script]
        )
    ]
)
# ['some javascript']

print("")
print("encode_contents()")
print(script.encode_contents())
# some javascript
---

Will this solve the problems you bring up?

Changed in beautifulsoup:
status:	New → In Progress

Revision history for this message

Fake Name (lemuix-2) wrote on 2020-10-15:

#11

The `types` parameter to `get_text()` basically almost completely solves my issue. I'd still have some ergonomics quibbling (basically, many contexts where I'm working with a bs4.Soup() object I don't have `bs4` imported, so having to import bs4 and remember the types I care about is annoying. I think this could be handled by having a special case where `types=True` returns all types pretty easily.

My points about the `.strings` accessors was basically a comment that after `get_text()` broke, I went through every other option to access the strings in a tag and none of them were replacements for the old behaviour.

I'd agree that this is basically a documentation issue.

Currently, there's a *bunch* of functionality that's basically undocumented, as optional parameters aren't shown in the documentation and there's no comprehensive API overview that *does* show them. I literally only found out that you can pass `strip=True` to `get_text()` on stack overflow!

I understand the desire to keep the main documentation page (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) simple, presumably for newcomers, but *somewhere* should list *all* the parameters each function can possibly take. In the past, I've done this with sphinx via the `sphinx.ext.autodoc` feature, which basically generates class/function listings with their parameters (and pull the documentation out of the docstrings). I see in `doc/source/conf.py` it's enabled, but it's not seeing the python sources.

Looking at sphinx files I put together in the past, there was some horrible hacky

```
import os
import sys
sys.path.append(os.path.abspath('../'))
```

needed in `conf.py` to get sphinx to properly find the module sources. Perhaps that's what's going on here.

Other notes:

- This was a pretty major breaking change. I kind of assumed that a 4.8.2 -> 4.9.0 revision wouldn't include something like this, but it should instead involve a major version number increment (e.g. semver), though I can understand that the major revision 4 is kind of important (it's `bs4` after all).
- Please also enable `'sphinx.ext.viewcode',`. Launchpad is crap for browsing code, so including the code in the docs would be nice!

The `types` parameter to `get_text()` basically almost completely solves my issue. I'd still have some ergonomics quibbling (basically, many contexts where I'm working with a bs4.Soup() object I don't have `bs4` imported, so having to import bs4 and remember the types I care about is annoying. I think this could be handled by having a special case where `types=True` returns all types pretty easily.

My points about the `.strings` accessors was basically a comment that after `get_text()` broke, I went through every other option to access the strings in a tag and none of them were replacements for the old behaviour.

I'd agree that this is basically a documentation issue.

Currently, there's a *bunch* of functionality that's basically undocumented, as optional parameters aren't shown in the documentation and there's no comprehensive API overview that *does* show them. I literally only found out that you can pass `strip=True` to `get_text()` on stack overflow!

I understand the desire to keep the main documentation page (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) simple, presumably for newcomers, but *somewhere* should list *all* the parameters each function can possibly take. In the past, I've done this with sphinx via the `sphinx.ext.autodoc` feature, which basically generates class/function listings with their parameters (and pull the documentation out of the docstrings). I see in `doc/source/conf.py` it's enabled, but it's not seeing the python sources.

Looking at sphinx files I put together in the past, there was some horrible hacky

```
import os
import sys
sys.path.append(os.path.abspath('../'))
```

needed in `conf.py` to get sphinx to properly find the module sources. Perhaps that's what's going on here.

Other notes:

- This was a pretty major breaking change. I kind of assumed that a 4.8.2 -> 4.9.0 revision wouldn't include something like this, but it should instead involve a major version number increment (e.g. semver), though I can understand that the major revision 4 is kind of important (it's `bs4` after all).
 - Please also enable `'sphinx.ext.viewcode',`. Launchpad is crap for browsing code, so including the code in the docs would be nice!

Revision history for this message

Fake Name (lemuix-2) wrote on 2020-10-15:

#12

Oh, also, I still think it'd be better to have this be opt-in, rather then opt-out, but that's not something I'm too dead-set on.

Revision history for this message

Fake Name (lemuix-2) wrote on 2020-10-15:

#13

Ok, walking through poking sphinx into doing something useful. No PR because bzr is annoying (do you have a git repo somewhere?).

- From repo root: `sphinx-apidoc -o doc/source bs4/`. This will generate stub `modules.rst` and associated files in `doc/source`.
- Remove `bs4.builder.rst` because I don't think it's relevant here.
- Somewhere in `index.rst`, insert:

```
Contents:

.. toctree::
:maxdepth: 3

modules
```

This will inject a autodocumentation-generated class listing that is completely comprehensive.

It might be nice to manually manage the toc a bit, but just having it is super handy.

Revision history for this message

Fake Name (lemuix-2) wrote on 2020-10-15:

#14

(Oh, and in `conf.py` update `extensions` with `'sphinx.ext.viewcode'`)

Revision history for this message

Leonard Richardson (leonardr) wrote on 2021-02-13:

#15

Revision 600 changes the behavior of text-extraction methods like .strings and .get_text so that what counts as "text" depends on the type of tag. See bug #1906226 for details.

I believe this resolves most of the issues that caused me to reopen this issue, but not necessarily all of them. I'll revisit this later, but if you want to take a look I'd appreciate it.

https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/revision/600

Leonard Richardson (leonardr) on 2021-10-24

Changed in beautifulsoup:
status:	In Progress → Fix Released

Beautiful Soup

Some tags never contain 'text' per the HTML spec

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches