Change html5 formatter to escape only ambiguous ampersands

Bug #1902431 reported by Fake Name
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Confirmed
Undecided
Unassigned

Bug Description

```
In [1]: import bs4

In [2]: bs4.__version__
Out[2]: '4.9.3'

In [3]: st = bs4.BeautifulSoup('<a href="https://test.example.com/?param_1=5&param_2=2&param_3=1">Link text</a>', "lxml")

In [4]: st
Out[4]: <html><body><a href="https://test.example.com/?param_1=5&amp;param_2=2&amp;param_3=1">Link text</a></body></html>

```

It's impossible to create a `<a>` tag where the `href` parameter has a query string with multiple parameters without bs4 escaping the ampersands in the query string, therefore breaking the link.

I think in general, automatically trying to escape any tag attributes is a deeply problematic behaviour. Escaping tag *contents* makes sense, but the attributes don't follow the same ruleset.

Revision history for this message
Fake Name (lemuix-2) wrote :

As a horrible hack, going into element.py and special-casing for `href` attributes makes things work, through it's probably super buggy:

```
        for key, val in attributes:
            if val is None:
                decoded = key
            else:
                if isinstance(val, list) or isinstance(val, tuple):
                    val = ' '.join(val)
                elif not isinstance(val, str):
                    val = str(val)
                elif (
                        isinstance(val, AttributeValueWithCharsetSubstitution)
                        and eventual_encoding is not None
                ):
                    val = val.encode(eventual_encoding)

                text = formatter.attribute_value(val)
                decoded = (
                    str(key) + '='
                    + formatter.quoted_attribute_value(text))
            attrs.append(decoded)
        close = ''
        closeTag = ''
```

to

```
        for key, val in attributes:
            if val is None:
                decoded = key
            else:
                if isinstance(val, list) or isinstance(val, tuple):
                    val = ' '.join(val)
                elif not isinstance(val, str):
                    val = str(val)
                elif (
                        isinstance(val, AttributeValueWithCharsetSubstitution)
                        and eventual_encoding is not None
                ):
                    val = val.encode(eventual_encoding)
                if key == 'href':
                    text = str(val)
                else:
                    text = formatter.attribute_value(val)
                decoded = (
                    str(key) + '='
                    + formatter.quoted_attribute_value(text))
            attrs.append(decoded)
        close = ''
        closeTag = ''

```

The core of the issue appears to be that the "minimal" EntitySubstitution() instance still replaces "&".

I have no idea what's "correct" from a spec perspective here, but I can say that it seems BS4 is unable to generate the valid HTML output I need here, and I don't *think* having multi-parameter query strings in a anchor tag is invalid in any variant of HTML.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for taking the time to file this issue.

The behavior you're talking about -- which entities are escaped -- is under the control of the output formatter. (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters)

A simple way to stop ampersands from being escaped is to disable the formatter altogether:

---
from bs4 import BeautifulSoup
soup = BeautifulSoup('<a href="https://test.example.com/?param_1=5&param_2=2&param_3=1">Link text</a>', 'lxml')

print(soup.a.encode(formatter='html'))
# <a href="https://test.example.com/?param_1=5&amp;param_2=2&amp;param_3=1">Link text</a>

print(soup.a.encode(formatter=None))
# <a href="https://test.example.com/?param_1=5&param_2=2&param_3=1">Link text</a>
---

Unfortunately, this will stop ampersands (and other HTML entities) from being escaped _everywhere_ in the document, increasing the risk of an invalid document.

The default formatter does what it does because your example URL contains HTML entities (ampersands) which need to be entity-encoded when they're embedded in an HTML file. So I wouldn't say the link you're talking about is broken. If you render the output of Beautiful Soup in a web client and activate the link, the client will visit the correct URL. Similarly if you parse the encoded output of your script into a second BeautifulSoup object and look at the href:

---
from bs4 import BeautifulSoup
soup1 = BeautifulSoup('<a href="https://test.example.com/?param_1=5&param_2=2&param_3=1">Link text</a>', 'lxml')

output = soup1.encode()
soup2 = BeautifulSoup(output)

print(soup2.a['href'])
# https://test.example.com/?param_1=5&param_2=2&param_3=1
---

The ampersands are only escaped when the URL is part of an HTML file.

The validity of unencoded ampersands in HTML is a complex topic that has gotten more complex since I originally wrote the output formatting code. The good news for you is that the HTML5 spec does allow for unescaped ampersands so long as they are not ambiguous:

https://html.spec.whatwg.org/#attributes-2
https://html.spec.whatwg.org/#syntax-ambiguous-ampersand
https://mathiasbynens.be/notes/ambiguous-ampersands

Beautiful Soup includes an 'html5' outputter which can be modified -- I don't know how at the moment -- to escape only ambiguous ampersands. This will give you the feature you want: ampersands like the ones in your 'href' attribute can go into an HTML5 file unescaped.

I'm changing the summary of this issue to reflect the new feature to be added.

summary: - Query strings in link href attributes are being spuriously escaped
+ Change html5 formatter to escape only ambiguous ampersands
Changed in beautifulsoup:
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.