SVG inlined in HTML confuses beautifulsoup

Bug #1873640 reported by paul@hammant.org
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Won't Fix
Undecided
Unassigned

Bug Description

https://cv-masks.github.io/ragmask-max.html has some 20 SVG images inlined. All browsers support that.

I'm trying to extract my strings for translations. With respect to this source:

    <tspan font-family="Helvetica Neue" font-size="8" font-weight="400" fill="black" x="1.112"
        y="8">Hand wash in very hot soapy
    </tspan>

Beautifulsoup thinks that is:

    Hand wash in very hot soapy\n', '

Yup, trailing CR (perhaps correctly?), and a ', ' sequence that's not in the source at all.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Can you provide the Python code you're using to extract this text?

Here's my best guess at a recreation:

import requests
from bs4 import BeautifulSoup
markup = requests.get("https://cv-masks.github.io/ragmask-max.html").content
soup = BeautifulSoup(markup, 'html.parser')
svg = soup.find('svg', width="162.7954")
[x for x in svg.strings]

The result is:

['\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', ' Produced by OmniGraffle 7.15\n ', '2020-04-15 11:40:49 +0000', '\n', '\n', '\n', 'Canvas 3', '\n', '\n', 'Layer 1', '\n', '\n', '\n', '\n', '\n', '22\n ', '\n', '\n', '\n', '\n', '\n', 'Hand wash in very hot soapy\n ', '\n', 'water and dry before use\n ', '\n', '\n', '\n', '\n', '\n']

This is the closest I could get to your observed output.

Revision history for this message
paul@hammant.org (i-paul-h) wrote :

A little less and a little more than you need, sorry:

data=myfile.readlines()
    soup = BeautifulSoup(str(contents_of_that_file), 'html.parser')
    text = soup.find_all(text=True)
    blacklist = [
     '[document]',
     'noscript',
     'header',
     'html',
     'meta',
     'head',
     'input',
     'script',
     'dc:date',
     'title',
     # there may be more elements you don't want, such as "style", etc.
    ]
    for t in text:
 if t.parent.name not in blacklist:
     if "Hand wash" in t.strip():
         print(">>" + t.strip() + "<<")

Revision history for this message
Leonard Richardson (leonardr) wrote :

Running that code I get this output, which is what I'd expect:

>>Hand wash in very hot soapy<<

I get this result using Python 2 and Python 3, using an old version of Beautiful Soup (4.6.1) and the latest version (4.9.0). The result is the same whether I use requests to retrieve https://cv-masks.github.io/ragmask-max.html from within the Python script, or whether I use my web browser to save it to a file, then open the file inside the Python script.

It sounds like you get output that looks like this:

>>Hand wash in very hot soapy\n', '<<

Is that right?

Can you run this iteration and paste the output?

for t in text:
    if t.parent.name not in blacklist:
        print(repr(t))

Revision history for this message
Leonard Richardson (leonardr) wrote :

Closing this bug as it's been over a year and I haven't been able to reproduce it.

Changed in beautifulsoup:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.