html.parser confuses URL variable names for unterminated character references
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Invalid
|
Undecided
|
Unassigned |
Bug Description
This is not a bug (at least, not a bug in Beautiful Soup). I'm posting this publicly for later reference.
Here's some HTMLParser code that exhibits unusual behavior:
from html.parser import HTMLParser
markup = '<a href="https:/
class MyParser(
def handle_
MyParser(
# a [('href', 'https:/
Specifically, "¬i" in the URL is transformed to "¬i". That's the logical symbol for NOT, designated by ¬ in HTML.
The underlying cause is that the HTML markup is ambiguous. The right way of writing that href attribute in an HTML link is to escape the ampersand:
https:/
And if you really wanted to write "¬i", the proper way is to terminate the character reference:
https:/
But unescaped, unterminated character references show up so frequently in HTML that Python's html.parser library chooses to interpret "¬i=999" as "¬i=999". It's an unusual example of the phenomenon described in "Differences between parsers" in the Beautiful Soup documentation (https:/
If you encounter this situation, my recommendation is to use the lxml parser, which does not interpret "¬i" as "¬i":
---
from bs4 import BeautifulSoup
markup = '<a href="https:/
soup = BeautifulSoup(
a = soup.find("a")
print(a["href"])
# output: "https:/
---
Changed in beautifulsoup: | |
status: | New → Invalid |
description: | updated |