Comment 2 for bug 1902431

Revision history for this message
Leonard Richardson (leonardr) wrote : Re: Query strings in link href attributes are being spuriously escaped

Thanks for taking the time to file this issue.

The behavior you're talking about -- which entities are escaped -- is under the control of the output formatter. (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters)

A simple way to stop ampersands from being escaped is to disable the formatter altogether:

---
from bs4 import BeautifulSoup
soup = BeautifulSoup('<a href="https://test.example.com/?param_1=5&param_2=2&param_3=1">Link text</a>', 'lxml')

print(soup.a.encode(formatter='html'))
# <a href="https://test.example.com/?param_1=5&amp;param_2=2&amp;param_3=1">Link text</a>

print(soup.a.encode(formatter=None))
# <a href="https://test.example.com/?param_1=5&param_2=2&param_3=1">Link text</a>
---

Unfortunately, this will stop ampersands (and other HTML entities) from being escaped _everywhere_ in the document, increasing the risk of an invalid document.

The default formatter does what it does because your example URL contains HTML entities (ampersands) which need to be entity-encoded when they're embedded in an HTML file. So I wouldn't say the link you're talking about is broken. If you render the output of Beautiful Soup in a web client and activate the link, the client will visit the correct URL. Similarly if you parse the encoded output of your script into a second BeautifulSoup object and look at the href:

---
from bs4 import BeautifulSoup
soup1 = BeautifulSoup('<a href="https://test.example.com/?param_1=5&param_2=2&param_3=1">Link text</a>', 'lxml')

output = soup1.encode()
soup2 = BeautifulSoup(output)

print(soup2.a['href'])
# https://test.example.com/?param_1=5&param_2=2&param_3=1
---

The ampersands are only escaped when the URL is part of an HTML file.

The validity of unencoded ampersands in HTML is a complex topic that has gotten more complex since I originally wrote the output formatting code. The good news for you is that the HTML5 spec does allow for unescaped ampersands so long as they are not ambiguous:

https://html.spec.whatwg.org/#attributes-2
https://html.spec.whatwg.org/#syntax-ambiguous-ampersand
https://mathiasbynens.be/notes/ambiguous-ampersands

Beautiful Soup includes an 'html5' outputter which can be modified -- I don't know how at the moment -- to escape only ambiguous ampersands. This will give you the feature you want: ampersands like the ones in your 'href' attribute can go into an HTML5 file unescaped.

I'm changing the summary of this issue to reflect the new feature to be added.