Some tags never contain 'text' per the HTML spec

Bug #1868861 reported by Geoffrey Fairchild on 2020-03-24
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Undecided
Unassigned

Bug Description

First off, for what I'm about to describe, I'm using Python 3.6.9, bs4 4.8.2, and lxml 4.5.0.

I maintain a website that's built around a listserv for physicians around the world to discuss emerging infectious diseases. Users send emails to a designated email address, and those emails are parsed to pull out the content. That content is then thrown into a database for storage, and there's a whole UI around it so that users can browse and search and whatnot.

I use https://github.com/SpamScope/mail-parser to parse the emails. Some emails have both plaintext content and HTML content, and some just have one or the other. I preferentially store the plaintext content, but if only HTML content is present, I use bs4 to parse it and pull out the text (using get_text()), throwing away all HTML.

I recently encountered an email that was throwing errors. I ended up tracking it down, and it turns out that it was an HTML content-only email, and bs4 isn't properly cleaning up some of the HTML comments. I'm attaching a sanitized version of the HTML content portion of the email, and below is code that you can run to see it for yourself:

from bs4 import BeautifulSoup
with open('email.html') as infile:
  soup = BeautifulSoup(infile.read(), 'lxml')
print(soup.get_text().strip())

What I expect to see is provided in expected_output.txt, and what I actually get is provided in actual_output.txt.

It seems like the get_text() method needs to be a little more aggressive with cleaning HTML comments. Thanks a bunch for your help!

Geoffrey Fairchild (gfairchild) wrote :
Geoffrey Fairchild (gfairchild) wrote :

Here's my expected output.

Geoffrey Fairchild (gfairchild) wrote :

Here's my actual output.

Leonard Richardson (leonardr) wrote :

Geoffrey, thanks for taking the time to report this issue.

get_text() is a bit of a blunt instrument, but it does know to filter out comments and other 'strings' that are not generally considered part of a document's text.

In the example document you gave, html.parser parses the contents of the <style> tag as an HTML comment, but lxml parses it as a CSS stylesheet that for some reason starts with "<!--". lxml's behavior comes from the HTML 5 spec (https://html.spec.whatwg.org/#the-style-element); html.parser doesn't treat the <style> tag differently from any other tag.

So in one sense this is a problem of a difference between parsers. However the underlying problem is parser-independent. Per the HTML5 spec, nothing inside a <style> tag should be considered "text" -- it's a stylesheet. What the contents of the <style> tag look like is irrelevant. A document that looks like "<style>This is text.</style>" actually contains no text, just a broken stylesheet.

There are a lot of places where get_text() comes up short, and this is another one. get_text() is not very reliable, but it is very popular, so this is always a dilemma for me.

The best solution may be to parse the contents of the <style> tag as a special type of string which is ignored by get_text() the same way a Comment or ProcessingInstruction is. That way you can find special strings of that type and treat them specially if you need to for any reason; it's not just about hiding that string from get_text().

summary: - get_text() misses cleaning some HTML comments
+ Some tags never contain 'text' per the HTML spec
Changed in beautifulsoup:
status: New → Confirmed
Leonard Richardson (leonardr) wrote :

Revision 564 introduces special treatment for <style>, <script>, and <template> tags.

Changed in beautifulsoup:
status: Confirmed → Fix Committed
Changed in beautifulsoup:
status: Fix Committed → Fix Released
Geoffrey Fairchild (gfairchild) wrote :

Thanks so much for the detailed explanation and the fix. I really appreciate it!

Fake Name (lemuix-2) wrote :

While disambiguating between rendered text and non-markup text does make sense, this change broke a bunch of my code.

I think there's a coherent argument to be made that "text" in the context of HTML is anything that's not HTML, rather then anything that is expected to be visible in a browser.

So basically, there needs to be a way to select the old behaviour. Really, I think it would make more sense to have a parameter to `get_text()` that specifies the type of text, and if not present to default to all text (both for rendering and external runtimes).

Something like `get_text(type="rendered|script|style")` or something.

Fake Name (lemuix-2) wrote :

Ironically, the OP has a valid point for whether the contents of HTML comments should be included in the returns of `get_text()`, since html comments are a HTML directive, rather then just a different type of text.

Fake Name (lemuix-2) wrote :

An alternative would be to provide similar `get_script()` and `get_styles()` functions, or similar.

There seems to be no way to easily get the contents of Script elements. `<tag>.strings` also seems to not return the contained strings (this isn't even text!), the same for ``<tag>.stripped_strings`. Does string mean text now?

Short of manually iterating over a tags contents and pulling out the contents of each child-elements `string` (which strangely *doesn't* hide the contents of a `Script` tag), there seems to be no nice way to do this.

Incidentally, the fact that `Script` class, which is a subclass of `NavigableString` doesn't have any `strings` when in a tag, but has a `string` is extremely confusing. If there's going to be a distinction between text and strings, it at least should be consistent.

Changed in beautifulsoup:
status: Fix Released → New
Leonard Richardson (leonardr) wrote :

I'm reopening this issue to get more information, and will file a new issue once I have a better idea of what to do. I think almost all the issues you're having can be improved by more documentation.

The methods that deal with text do give the caller a chance to configure what is considered 'text'. The get_text method takes an argument 'types':

    def get_text(self, separator=u"", strip=False,
                 types=(NavigableString, CData)):
...
        :types: A tuple of NavigableString subclasses. Any strings of
            a subclass not found in this list will be ignored. By
            default, this means only NavigableString and CData objects
            will be considered. So no comments, processing instructions,
            stylesheets, etc.

To make get_text consider scripts to be 'text', you would pass bs4.Script into the types tuple.

The '.strings' property calls the '_all_strings' internal method with default arguments, and '_all_strings' also takes the 'types' argument. (In fact, get_text calls _all_strings and joins the results.) But _all_strings is an internal method, and I can't make .strings stop being a property. This is the piece that can't be fixed with documentation.

With respect to getting the contents of an element, there's a method encode_contents() that I think will do what you want. It's not documented because it's a deprecated method that provides compatibility with Beautiful Soup 3. But if people want to use it I can un-deprecate it and document it. String types aren't an issue with encode_contents(), because it's rendering HTML markup as it would appear in a web page, not trying to apply a human notion (or HTML-spec notion) of what is "text".

So, to put all of that into an example Python script:

---
markup = """<script>some javascript</script>"""
from bs4 import BeautifulSoup, NavigableString, CData, Script
soup = BeautifulSoup(markup, 'lxml')
script = soup.script

print("get_text()")
print(script.get_text())
#
print(script.get_text(types=[NavigableString, CData, Script]))
# some javascript

print("")
print(".strings")
print([x for x in soup.script.strings])
# []
print(
    [
        x for x in soup.script._all_strings(
            types=[NavigableString, CData, Script]
        )
    ]
)
# ['some javascript']

print("")
print("encode_contents()")
print(script.encode_contents())
# some javascript
---

Will this solve the problems you bring up?

Changed in beautifulsoup:
status: New → In Progress
Fake Name (lemuix-2) wrote :

The `types` parameter to `get_text()` basically almost completely solves my issue. I'd still have some ergonomics quibbling (basically, many contexts where I'm working with a bs4.Soup() object I don't have `bs4` imported, so having to import bs4 and remember the types I care about is annoying. I think this could be handled by having a special case where `types=True` returns all types pretty easily.

My points about the `.strings` accessors was basically a comment that after `get_text()` broke, I went through every other option to access the strings in a tag and none of them were replacements for the old behaviour.

I'd agree that this is basically a documentation issue.

Currently, there's a *bunch* of functionality that's basically undocumented, as optional parameters aren't shown in the documentation and there's no comprehensive API overview that *does* show them. I literally only found out that you can pass `strip=True` to `get_text()` on stack overflow!

I understand the desire to keep the main documentation page (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) simple, presumably for newcomers, but *somewhere* should list *all* the parameters each function can possibly take. In the past, I've done this with sphinx via the `sphinx.ext.autodoc` feature, which basically generates class/function listings with their parameters (and pull the documentation out of the docstrings). I see in `doc/source/conf.py` it's enabled, but it's not seeing the python sources.

Looking at sphinx files I put together in the past, there was some horrible hacky

```
import os
import sys
sys.path.append(os.path.abspath('../'))
```

needed in `conf.py` to get sphinx to properly find the module sources. Perhaps that's what's going on here.

Other notes:

 - This was a pretty major breaking change. I kind of assumed that a 4.8.2 -> 4.9.0 revision wouldn't include something like this, but it should instead involve a major version number increment (e.g. semver), though I can understand that the major revision 4 is kind of important (it's `bs4` after all).
 - Please also enable `'sphinx.ext.viewcode',`. Launchpad is crap for browsing code, so including the code in the docs would be nice!

Fake Name (lemuix-2) wrote :

Oh, also, I still think it'd be better to have this be opt-in, rather then opt-out, but that's not something I'm too dead-set on.

Fake Name (lemuix-2) wrote :

Ok, walking through poking sphinx into doing something useful. No PR because bzr is annoying (do you have a git repo somewhere?).

 - From repo root: `sphinx-apidoc -o doc/source bs4/`. This will generate stub `modules.rst` and associated files in `doc/source`.
 - Remove `bs4.builder.rst` because I don't think it's relevant here.
 - Somewhere in `index.rst`, insert:

```
Contents:

.. toctree::
   :maxdepth: 3

   modules
```

This will inject a autodocumentation-generated class listing that is completely comprehensive.

It might be nice to manually manage the toc a bit, but just having it is super handy.

Fake Name (lemuix-2) wrote :

(Oh, and in `conf.py` update `extensions` with `'sphinx.ext.viewcode'`)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers