SVG inlined in HTML confuses beautifulsoup

Bug #1873640 reported by paul@hammant.org on 2020-04-19
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Undecided
Unassigned

Bug Description

https://cv-masks.github.io/ragmask-max.html has some 20 SVG images inlined. All browsers support that.

I'm trying to extract my strings for translations. With respect to this source:

    <tspan font-family="Helvetica Neue" font-size="8" font-weight="400" fill="black" x="1.112"
        y="8">Hand wash in very hot soapy
    </tspan>

Beautifulsoup thinks that is:

    Hand wash in very hot soapy\n', '

Yup, trailing CR (perhaps correctly?), and a ', ' sequence that's not in the source at all.

Leonard Richardson (leonardr) wrote :

Can you provide the Python code you're using to extract this text?

Here's my best guess at a recreation:

import requests
from bs4 import BeautifulSoup
markup = requests.get("https://cv-masks.github.io/ragmask-max.html").content
soup = BeautifulSoup(markup, 'html.parser')
svg = soup.find('svg', width="162.7954")
[x for x in svg.strings]

The result is:

['\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', ' Produced by OmniGraffle 7.15\n ', '2020-04-15 11:40:49 +0000', '\n', '\n', '\n', 'Canvas 3', '\n', '\n', 'Layer 1', '\n', '\n', '\n', '\n', '\n', '22\n ', '\n', '\n', '\n', '\n', '\n', 'Hand wash in very hot soapy\n ', '\n', 'water and dry before use\n ', '\n', '\n', '\n', '\n', '\n']

This is the closest I could get to your observed output.

paul@hammant.org (i-paul-h) wrote :

A little less and a little more than you need, sorry:

data=myfile.readlines()
    soup = BeautifulSoup(str(contents_of_that_file), 'html.parser')
    text = soup.find_all(text=True)
    blacklist = [
     '[document]',
     'noscript',
     'header',
     'html',
     'meta',
     'head',
     'input',
     'script',
     'dc:date',
     'title',
     # there may be more elements you don't want, such as "style", etc.
    ]
    for t in text:
 if t.parent.name not in blacklist:
     if "Hand wash" in t.strip():
         print(">>" + t.strip() + "<<")

Leonard Richardson (leonardr) wrote :

Running that code I get this output, which is what I'd expect:

>>Hand wash in very hot soapy<<

I get this result using Python 2 and Python 3, using an old version of Beautiful Soup (4.6.1) and the latest version (4.9.0). The result is the same whether I use requests to retrieve https://cv-masks.github.io/ragmask-max.html from within the Python script, or whether I use my web browser to save it to a file, then open the file inside the Python script.

It sounds like you get output that looks like this:

>>Hand wash in very hot soapy\n', '<<

Is that right?

Can you run this iteration and paste the output?

for t in text:
    if t.parent.name not in blacklist:
        print(repr(t))

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers