SVG inlined in HTML confuses beautifulsoup

Bug #1873640 reported by on 2020-04-19
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup

Bug Description has some 20 SVG images inlined. All browsers support that.

I'm trying to extract my strings for translations. With respect to this source:

    <tspan font-family="Helvetica Neue" font-size="8" font-weight="400" fill="black" x="1.112"
        y="8">Hand wash in very hot soapy

Beautifulsoup thinks that is:

    Hand wash in very hot soapy\n', '

Yup, trailing CR (perhaps correctly?), and a ', ' sequence that's not in the source at all.

Leonard Richardson (leonardr) wrote :

Can you provide the Python code you're using to extract this text?

Here's my best guess at a recreation:

import requests
from bs4 import BeautifulSoup
markup = requests.get("").content
soup = BeautifulSoup(markup, 'html.parser')
svg = soup.find('svg', width="162.7954")
[x for x in svg.strings]

The result is:

['\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', ' Produced by OmniGraffle 7.15\n ', '2020-04-15 11:40:49 +0000', '\n', '\n', '\n', 'Canvas 3', '\n', '\n', 'Layer 1', '\n', '\n', '\n', '\n', '\n', '22\n ', '\n', '\n', '\n', '\n', '\n', 'Hand wash in very hot soapy\n ', '\n', 'water and dry before use\n ', '\n', '\n', '\n', '\n', '\n']

This is the closest I could get to your observed output. (i-paul-h) wrote :

A little less and a little more than you need, sorry:

    soup = BeautifulSoup(str(contents_of_that_file), 'html.parser')
    text = soup.find_all(text=True)
    blacklist = [
     # there may be more elements you don't want, such as "style", etc.
    for t in text:
 if not in blacklist:
     if "Hand wash" in t.strip():
         print(">>" + t.strip() + "<<")

Leonard Richardson (leonardr) wrote :

Running that code I get this output, which is what I'd expect:

>>Hand wash in very hot soapy<<

I get this result using Python 2 and Python 3, using an old version of Beautiful Soup (4.6.1) and the latest version (4.9.0). The result is the same whether I use requests to retrieve from within the Python script, or whether I use my web browser to save it to a file, then open the file inside the Python script.

It sounds like you get output that looks like this:

>>Hand wash in very hot soapy\n', '<<

Is that right?

Can you run this iteration and paste the output?

for t in text:
    if not in blacklist:

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers