Tag attributes may contain values not valid in HTML but easily converted

Bug #2065525 reported by Chris Papademetrious
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Committed
Undecided
Unassigned

Bug Description

This is borderline user error, but I thought I'd report it if the fix was easy.

In some code, I defined a Tag attribute value by creating an "attrs" dictionary value. However, I set it to an integer value:

====
for i, e in enumerate(children):
    e.attrs["data-idx"] = i
====

However, when I wrote unit tests for the code, they were unexpectedly failing. The culprit was that Tag equality does not consider these two to be equivalent:

====
tag.attrs["data-idx"] = 0
tag.attrs["data-idx"] = "0"
====

Here is a testcase:

====
import bs4

html1 = "<body><div/></body>"
html2 = "<body><div data-idx='0'/></body>"

body1 = bs4.BeautifulSoup(html1, "lxml").find("body")
body2 = bs4.BeautifulSoup(html2, "lxml").find("body")

body1.find("div").attrs["data-idx"] = 0

print("String equality:", str(body1) == str(body2))
print("Object equality:", body1 == body2)
====

If there's a simple fix to stick a str() somewhere in the Tag equality code, great! If not, no worries.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Some relevant quotes from the HTML spec:

https://html.spec.whatwg.org/#attributes-2

 Attribute values are a mixture of text and character references...

https://html.spec.whatwg.org/#boolean-attributes

 If the [boolean] attribute is present, its value must either be the empty string or a value that is an
 ASCII case-insensitive match for the attribute's canonical name, with no leading or trailing whitespace.
 The values "true" and "false" are not allowed on boolean attributes. To represent a false value, the
 attribute has to be omitted altogether.

So numerics and booleans are not allowed in attribute values at all. With respect to booleans, this is inconsistent with the way Beautiful Soup works now:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<a>")
soup.a['v'] = True
soup.a
# <a v="True"></a>

I think converting the values to strings on the way in would solve the larger problem, and as a side effect your equality example would start working.

I'm hesitant to convert _all_ attribute values to strings on the way in, since someone might be using a custom value that contains more information than a simple string (in fact, Beautiful Soup does this internally for the encoding in a <meta> tag). But handling numerics and booleans should be a small enough change to backwards compatibility that 4.13 could hold it.

summary: - Tag object equality is confused by integer tag attribute values
+ Tag attributes may contain values not valid in HTML but easily converted
Changed in beautifulsoup:
status: New → In Progress
Revision history for this message
Leonard Richardson (leonardr) wrote :

My initial implementation was merged in as of revision 6d529cf1147dce8158cc06f94877a711caa4c788. If you set a tag's attribute value to a numeric, boolean, or None, it will be handled according to the restrictions of HTML or XML spec.

Changed in beautifulsoup:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.