Make it possible to customize the data structure used to store multi-valued attributes
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Committed
|
Undecided
|
Unassigned |
Bug Description
This is an enhancement request.
It would be nice to have convenience methods to add/remove keywords from the "class" attribute.
For example,
====
elt.add_
elt.add_
elt.remove_
elt.remove_
====
The methods should accept string or list-of-string values. Existing keywords should not be duplicated. remove_class() should delete the "class" attribute if it becomes empty.
I could propose a merge request for this, if you're open to it.
These are the helper functions we currently use:
====
def add_class(tag, these_classes):
if isinstance(
for this_class in these_classes:
if tag.get('class', None) is None:
if not this_class.
return tag
def remove_class(tag, these_classes):
if isinstance(
for this_class in these_classes:
if tag.get('class', None):
if not tag['class']:
del tag['class']
return tag
====
We also have a helper function to test if a keyword exists in "class":
====
def has_class(tag, this_class):
return bool(tag.
====
but this would be provided by self-testing from #2052936, if that comes to be:
====
if elt.matches(class_ = ...):
# ...
====
This request is asking to create a distinction between the "class" attribute and other attributes that I don't think is appropriate. I can think of a better way to offer functionality like this, but I don't know how useful it'd be.
The main time Beautiful Soup treats the "class" attribute as special is when working around the fact that "class" is also a Python reserved word. The rest of the time, "class" is treated as part of a family of attributes that Beautiful Soup calls multi-valued attributes. The HTML spec calls these "CDATA list attributes," with "class" being the most common.
There are lots of CDATA list attributes: "accesskey" and "dropzone" work the same way as "class", and for certain tags, attributes like "rel" or "headers" may also work like that. For XML documents, you can set up the CDATA list attributes however you want. For this reason I'm *very* reluctant to add methods with "class" in the name.
(a dictionary listing HTML's CDATA list attributes is here; this configures Beautiful Soup's default behavior when given an HTML document): https:/ /git.launchpad. net/beautifulso up/tree/ bs4/builder/ __init_ _.py?h= 4.13#n522
The crucial moment in Beautiful Soup where we treat CDATA list attributes differently from regular attributes is here: /git.launchpad. net/beautifulso up/tree/ bs4/builder/ __init_ _.py?h= 4.13#n394
https:/
The attribute value comes out of there either a string (for normal attributes) or a list of strings (for CDATA list attributes).
I think the best way to do what you want would be to define a subclass of `list` or `set`, add helper methods to that class, and have Beautiful Soup instantiate that class to hold the values of a CDATA list attribute. That way the functionality isn't specific to the "class" attribute; it'd be additional complexity available to the attribute value itself, if the attribute value happened to be of this kind.
Here's my summary of the functionality you want on this class:
1. Check for duplicates on insertion, like set does.
2. Treat insertion of a list as insertion of every item in the list (like list.extend does)
3. Preserve the original value order, like list does.
4. Empty values are treated as the absence of a value.
#3 is the current behavior and I'd want it preserved. I don't think there's a method you could add to this class that would give you #4 in a backwards compatible way (currently an empty list for a CDATA list attribute becomes a ). You could do #2 but unless you're okay with using both append() and extend(), your interface would deviate from what `list` offers.
As for #1, you could definitely do it, but again backwards compatibility is the issue. I've seen some really weird stuff and I'm almost positive somebody's workflow depends on sticking the same CSS class into a tag multiple times.
So I'm not wild about the "subclass of `list` idea" either--there are too many backwards compatibility pitfalls and deviations from how Python's built-in data structures work.
However, there has been a trend in recent years where I enable advanced use cases by allowing users to customize which classes Beautiful Soup instantiates in different circumstances. See for example the "element_cl...