Beautiful Soup

SVG inlined in HTML confuses beautifulsoup

Bug #1873640 reported by paul@hammant.org on 2020-04-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Won't Fix	Undecided	Unassigned

Bug Description

https://cv-masks.github.io/ragmask-max.html has some 20 SVG images inlined. All browsers support that.

I'm trying to extract my strings for translations. With respect to this source:

    <tspan font-family="Helvetica Neue" font-size="8" font-weight="400" fill="black" x="1.112"
        y="8">Hand wash in very hot soapy
    </tspan>

Beautifulsoup thinks that is:

Hand wash in very hot soapy\n', '

Yup, trailing CR (perhaps correctly?), and a ', ' sequence that's not in the source at all.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2020-04-19:

Can you provide the Python code you're using to extract this text?

Here's my best guess at a recreation:

import requests
from bs4 import BeautifulSoup
markup = requests.get("https://cv-masks.github.io/ragmask-max.html").content
soup = BeautifulSoup(markup, 'html.parser')
svg = soup.find('svg', width="162.7954")
[x for x in svg.strings]

The result is:

['\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', ' Produced by OmniGraffle 7.15\n ', '2020-04-15 11:40:49 +0000', '\n', '\n', '\n', 'Canvas 3', '\n', '\n', 'Layer 1', '\n', '\n', '\n', '\n', '\n', '22\n ', '\n', '\n', '\n', '\n', '\n', 'Hand wash in very hot soapy\n ', '\n', 'water and dry before use\n ', '\n', '\n', '\n', '\n', '\n']

This is the closest I could get to your observed output.

Revision history for this message

paul@hammant.org (i-paul-h) wrote on 2020-04-19:

A little less and a little more than you need, sorry:

data=myfile.readlines()
    soup = BeautifulSoup(str(contents_of_that_file), 'html.parser')
    text = soup.find_all(text=True)
    blacklist = [
     '[document]',
     'noscript',
     'header',
     'html',
     'meta',
     'head',
     'input',
     'script',
     'dc:date',
     'title',
     # there may be more elements you don't want, such as "style", etc.
    ]
    for t in text:
if t.parent.name not in blacklist:
     if "Hand wash" in t.strip():
         print(">>" + t.strip() + "<<")

Revision history for this message

Leonard Richardson (leonardr) wrote on 2020-04-19:

Running that code I get this output, which is what I'd expect:

>>Hand wash in very hot soapy<<

I get this result using Python 2 and Python 3, using an old version of Beautiful Soup (4.6.1) and the latest version (4.9.0). The result is the same whether I use requests to retrieve https://cv-masks.github.io/ragmask-max.html from within the Python script, or whether I use my web browser to save it to a file, then open the file inside the Python script.

It sounds like you get output that looks like this:

>>Hand wash in very hot soapy\n', '<<

Is that right?

Can you run this iteration and paste the output?

for t in text:
if t.parent.name not in blacklist:
print(repr(t))

Revision history for this message

Leonard Richardson (leonardr) wrote on 2021-10-25:

Closing this bug as it's been over a year and I haven't been able to reproduce it.

Changed in beautifulsoup:
status:	New → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.