html.parser confuses URL variable names for unterminated character references

Bug #2016391 reported by Leonard Richardson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Invalid
Undecided
Unassigned

Bug Description

This is not a bug (at least, not a bug in Beautiful Soup). I'm posting this publicly for later reference.

Here's some HTMLParser code that exhibits unusual behavior:

from html.parser import HTMLParser
markup = '<a href="https://somedomain.com/someresource.php?m=111&noti=999">'

class MyParser(HTMLParser):

    def handle_starttag(self, name, attrs):
        print(f"{name} {attrs} START")

MyParser().feed(markup)
# a [('href', 'https://somedomain.com/someresource.php?m=111¬i=999')] START

Specifically, "&noti" in the URL is transformed to "¬i". That's the logical symbol for NOT, designated by &not; in HTML.

The underlying cause is that the HTML markup is ambiguous. The right way of writing that href attribute in an HTML link is to escape the ampersand:

https://somedomain.com/someresource.php?m=111&amp;noti=999

And if you really wanted to write "¬i", the proper way is to terminate the character reference:

https://somedomain.com/someresource.php?m=111&not;i=999

But unescaped, unterminated character references show up so frequently in HTML that Python's html.parser library chooses to interpret "&noti=999" as "¬i=999". It's an unusual example of the phenomenon described in "Differences between parsers" in the Beautiful Soup documentation (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers)

If you encounter this situation, my recommendation is to use the lxml parser, which does not interpret "&noti" as "¬i":

---
from bs4 import BeautifulSoup
markup = '<a href="https://somedomain.com/someresource.php?m=111&noti=999">'
soup = BeautifulSoup(markup, 'lxml')
a = soup.find("a")
print(a["href"])
# output: "https://somedomain.com/someresource.php?m=111&noti=999"
---

Changed in beautifulsoup:
status: New → Invalid
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.