2023-04-15 18:18:27 |
Leonard Richardson |
description |
This is not a bug (at least, not a bug in Beautiful Soup). I'm posting this publicly for later reference.
Here's some HTMLParser code that exhibits unusual behavior:
from html.parser import HTMLParser
markup = '<a href="https://somedomain.com/someresource.php?m=111¬i=999">'
class MyParser(HTMLParser):
def handle_starttag(self, name, attrs):
print(f"{name} {attrs} START")
MyParser().feed(markup)
# a [('href', 'https://somedomain.com/someresource.php?m=111¬i=999')] START
Specifically, "¬i" in the URL is transformed to "¬i". That's the logical symbol for NOT, designated by ¬ in HTML.
The underlying cause is that the HTML markup is ambiguous. The right way of writing that href attribute in HTML is:
https://somedomain.com/someresource.php?m=111&noti=999
But unescaped, unterminated character references show up so frequently in HTML that Python's html.parser library chooses to interpret "¬i=999" as "¬i=999". It's an unusual example of the phenomenon described in "Differences between parsers" in the Beautiful Soup documentation (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers)
If you encounter this situation, my recommendation is to use the lxml parser, which does not interpret "¬i" as "¬i":
---
from bs4 import BeautifulSoup
markup = '<a href="https://somedomain.com/someresource.php?m=111¬i=999">'
soup = BeautifulSoup(markup, 'lxml')
a = soup.find("a")
print(a["href"])
# output: "https://somedomain.com/someresource.php?m=111¬i=999"
--- |
This is not a bug (at least, not a bug in Beautiful Soup). I'm posting this publicly for later reference.
Here's some HTMLParser code that exhibits unusual behavior:
from html.parser import HTMLParser
markup = '<a href="https://somedomain.com/someresource.php?m=111¬i=999">'
class MyParser(HTMLParser):
def handle_starttag(self, name, attrs):
print(f"{name} {attrs} START")
MyParser().feed(markup)
# a [('href', 'https://somedomain.com/someresource.php?m=111¬i=999')] START
Specifically, "¬i" in the URL is transformed to "¬i". That's the logical symbol for NOT, designated by ¬ in HTML.
The underlying cause is that the HTML markup is ambiguous. The right way of writing that href attribute in an HTML link is to escape the ampersand:
https://somedomain.com/someresource.php?m=111&noti=999
And if you really wanted to write "¬i", the proper way is to terminate the character reference:
https://somedomain.com/someresource.php?m=111¬i=999
But unescaped, unterminated character references show up so frequently in HTML that Python's html.parser library chooses to interpret "¬i=999" as "¬i=999". It's an unusual example of the phenomenon described in "Differences between parsers" in the Beautiful Soup documentation (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers)
If you encounter this situation, my recommendation is to use the lxml parser, which does not interpret "¬i" as "¬i":
---
from bs4 import BeautifulSoup
markup = '<a href="https://somedomain.com/someresource.php?m=111¬i=999">'
soup = BeautifulSoup(markup, 'lxml')
a = soup.find("a")
print(a["href"])
# output: "https://somedomain.com/someresource.php?m=111¬i=999"
--- |
|