Beautiful Soup

Bug #2052943
Comment #1

Comment 1 for bug 2052943

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-02-13: Re: Provide convenience methods to add/remove class keywords

This request is asking to create a distinction between the "class" attribute and other attributes that I don't think is appropriate. I can think of a better way to offer functionality like this, but I don't know how useful it'd be.

The main time Beautiful Soup treats the "class" attribute as special is when working around the fact that "class" is also a Python reserved word. The rest of the time, "class" is treated as part of a family of attributes that Beautiful Soup calls multi-valued attributes. The HTML spec calls these "CDATA list attributes," with "class" being the most common.

There are lots of CDATA list attributes: "accesskey" and "dropzone" work the same way as "class", and for certain tags, attributes like "rel" or "headers" may also work like that. For XML documents, you can set up the CDATA list attributes however you want. For this reason I'm *very* reluctant to add methods with "class" in the name.

(a dictionary listing HTML's CDATA list attributes is here; this configures Beautiful Soup's default behavior when given an HTML document): https://git.launchpad.net/beautifulsoup/tree/bs4/builder/__init__.py?h=4.13#n522

The crucial moment in Beautiful Soup where we treat CDATA list attributes differently from regular attributes is here:
https://git.launchpad.net/beautifulsoup/tree/bs4/builder/__init__.py?h=4.13#n394

The attribute value comes out of there either a string (for normal attributes) or a list of strings (for CDATA list attributes).

I think the best way to do what you want would be to define a subclass of `list` or `set`, add helper methods to that class, and have Beautiful Soup instantiate that class to hold the values of a CDATA list attribute. That way the functionality isn't specific to the "class" attribute; it'd be additional complexity available to the attribute value itself, if the attribute value happened to be of this kind.

Here's my summary of the functionality you want on this class:

1. Check for duplicates on insertion, like set does.
2. Treat insertion of a list as insertion of every item in the list (like list.extend does)
3. Preserve the original value order, like list does.
4. Empty values are treated as the absence of a value.

#3 is the current behavior and I'd want it preserved. I don't think there's a method you could add to this class that would give you #4 in a backwards compatible way (currently an empty list for a CDATA list attribute becomes a ). You could do #2 but unless you're okay with using both append() and extend(), your interface would deviate from what `list` offers.

As for #1, you could definitely do it, but again backwards compatibility is the issue. I've seen some really weird stuff and I'm almost positive somebody's workflow depends on sticking the same CSS class into a tag multiple times.

So I'm not wild about the "subclass of `list` idea" either--there are too many backwards compatibility pitfalls and deviations from how Python's built-in data structures work.

However, there has been a trend in recent years where I enable advanced use cases by allowing users to customize which classes Beautiful Soup instantiates in different circumstances. See for example the "element_classes" argument to the BeautifulSoup constructor. That's a dict that lets you specify your own drop-in replacements for Tag, NavigableString, and so on. Also the "string_containers" argument to the TreeBuilder constructor, which lets you map tags like "script" to NavigableString subclasses like Script. That lets you handle situations where you need to give special treatment to the strings inside certain tags.

So, if TreeBuilder had an argument "multi_valued_attribute_class", and that was the class Beautiful Soup instantiated with the list of values for any CDATA list attribute (with the default value of multi_valued_attribute_class being `list`), I'd be fine with also supporting alternate implementation that was more optimized for managing CSS classes. But it would take this from a standard part of the Beautiful Soup API to a pretty advanced feature.

What do you think?

The crucial moment in Beautiful Soup where we treat CDATA list attributes differently from regular attributes is here:
https://git.launchpad.net/beautifulsoup/tree/bs4/builder/__init__.py?h=4.13#n394

The attribute value comes out of there either a string (for normal attributes) or a list of strings (for CDATA list attributes).

Here's my summary of the functionality you want on this class:

So I'm not wild about the "subclass of `list` idea" either--there are too many backwards compatibility pitfalls and deviations from how Python's built-in data structures work.

What do you think?