Tag attributes may contain values not valid in HTML but easily converted
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Committed
|
Undecided
|
Unassigned |
Bug Description
This is borderline user error, but I thought I'd report it if the fix was easy.
In some code, I defined a Tag attribute value by creating an "attrs" dictionary value. However, I set it to an integer value:
====
for i, e in enumerate(
e.attrs[
====
However, when I wrote unit tests for the code, they were unexpectedly failing. The culprit was that Tag equality does not consider these two to be equivalent:
====
tag.attrs[
tag.attrs[
====
Here is a testcase:
====
import bs4
html1 = "<body>
html2 = "<body><div data-idx=
body1 = bs4.BeautifulSo
body2 = bs4.BeautifulSo
body1.find(
print("String equality:", str(body1) == str(body2))
print("Object equality:", body1 == body2)
====
If there's a simple fix to stick a str() somewhere in the Tag equality code, great! If not, no worries.
summary: |
- Tag object equality is confused by integer tag attribute values + Tag attributes may contain values not valid in HTML but easily converted |
Changed in beautifulsoup: | |
status: | New → In Progress |
Some relevant quotes from the HTML spec:
https:/ /html.spec. whatwg. org/#attributes -2
Attribute values are a mixture of text and character references...
https:/ /html.spec. whatwg. org/#boolean- attributes
If the [boolean] attribute is present, its value must either be the empty string or a value that is an
ASCII case-insensitive match for the attribute's canonical name, with no leading or trailing whitespace.
The values "true" and "false" are not allowed on boolean attributes. To represent a false value, the
attribute has to be omitted altogether.
So numerics and booleans are not allowed in attribute values at all. With respect to booleans, this is inconsistent with the way Beautiful Soup works now:
from bs4 import BeautifulSoup "<a>")
soup = BeautifulSoup(
soup.a['v'] = True
soup.a
# <a v="True"></a>
I think converting the values to strings on the way in would solve the larger problem, and as a side effect your equality example would start working.
I'm hesitant to convert _all_ attribute values to strings on the way in, since someone might be using a custom value that contains more information than a simple string (in fact, Beautiful Soup does this internally for the encoding in a <meta> tag). But handling numerics and booleans should be a small enough change to backwards compatibility that 4.13 could hold it.