Beautiful Soup

Bug #2016391
Activity log

Activity log for bug #2016391

Date	Who	What changed	Old value	New value	Message
2023-04-15 18:17:07	Leonard Richardson	bug			added bug
2023-04-15 18:17:16	Leonard Richardson	beautifulsoup: status	New	Invalid
2023-04-15 18:18:27	Leonard Richardson	description	This is not a bug (at least, not a bug in Beautiful Soup). I'm posting this publicly for later reference. Here's some HTMLParser code that exhibits unusual behavior: from html.parser import HTMLParser markup = '<a href="https://somedomain.com/someresource.php?m=111&noti=999">' class MyParser(HTMLParser): def handle_starttag(self, name, attrs): print(f"{name} {attrs} START") MyParser().feed(markup) # a [('href', 'https://somedomain.com/someresource.php?m=111¬i=999')] START Specifically, "&noti" in the URL is transformed to "¬i". That's the logical symbol for NOT, designated by ¬ in HTML. The underlying cause is that the HTML markup is ambiguous. The right way of writing that href attribute in HTML is: https://somedomain.com/someresource.php?m=111&noti=999 But unescaped, unterminated character references show up so frequently in HTML that Python's html.parser library chooses to interpret "&noti=999" as "¬i=999". It's an unusual example of the phenomenon described in "Differences between parsers" in the Beautiful Soup documentation (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers) If you encounter this situation, my recommendation is to use the lxml parser, which does not interpret "&noti" as "¬i": --- from bs4 import BeautifulSoup markup = '<a href="https://somedomain.com/someresource.php?m=111&noti=999">' soup = BeautifulSoup(markup, 'lxml') a = soup.find("a") print(a["href"]) # output: "https://somedomain.com/someresource.php?m=111&noti=999" ---	This is not a bug (at least, not a bug in Beautiful Soup). I'm posting this publicly for later reference. Here's some HTMLParser code that exhibits unusual behavior: from html.parser import HTMLParser markup = '<a href="https://somedomain.com/someresource.php?m=111&noti=999">' class MyParser(HTMLParser): def handle_starttag(self, name, attrs): print(f"{name} {attrs} START") MyParser().feed(markup) # a [('href', 'https://somedomain.com/someresource.php?m=111¬i=999')] START Specifically, "&noti" in the URL is transformed to "¬i". That's the logical symbol for NOT, designated by ¬ in HTML. The underlying cause is that the HTML markup is ambiguous. The right way of writing that href attribute in an HTML link is to escape the ampersand: https://somedomain.com/someresource.php?m=111&noti=999 And if you really wanted to write "¬i", the proper way is to terminate the character reference: https://somedomain.com/someresource.php?m=111¬i=999 But unescaped, unterminated character references show up so frequently in HTML that Python's html.parser library chooses to interpret "&noti=999" as "¬i=999". It's an unusual example of the phenomenon described in "Differences between parsers" in the Beautiful Soup documentation (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers) If you encounter this situation, my recommendation is to use the lxml parser, which does not interpret "&noti" as "¬i": --- from bs4 import BeautifulSoup markup = '<a href="https://somedomain.com/someresource.php?m=111&noti=999">' soup = BeautifulSoup(markup, 'lxml') a = soup.find("a") print(a["href"]) # output: "https://somedomain.com/someresource.php?m=111&noti=999" ---