Bug #2052943 “Make it possible to customize the data structure u...” : Bugs : Beautiful Soup

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-02-13:

#1

Download full text (4.0 KiB)

This request is asking to create a distinction between the "class" attribute and other attributes that I don't think is appropriate. I can think of a better way to offer functionality like this, but I don't know how useful it'd be.

The main time Beautiful Soup treats the "class" attribute as special is when working around the fact that "class" is also a Python reserved word. The rest of the time, "class" is treated as part of a family of attributes that Beautiful Soup calls multi-valued attributes. The HTML spec calls these "CDATA list attributes," with "class" being the most common.

There are lots of CDATA list attributes: "accesskey" and "dropzone" work the same way as "class", and for certain tags, attributes like "rel" or "headers" may also work like that. For XML documents, you can set up the CDATA list attributes however you want. For this reason I'm *very* reluctant to add methods with "class" in the name.

(a dictionary listing HTML's CDATA list attributes is here; this configures Beautiful Soup's default behavior when given an HTML document): https://git.launchpad.net/beautifulsoup/tree/bs4/builder/__init__.py?h=4.13#n522

The crucial moment in Beautiful Soup where we treat CDATA list attributes differently from regular attributes is here:
https://git.launchpad.net/beautifulsoup/tree/bs4/builder/__init__.py?h=4.13#n394

The attribute value comes out of there either a string (for normal attributes) or a list of strings (for CDATA list attributes).

I think the best way to do what you want would be to define a subclass of `list` or `set`, add helper methods to that class, and have Beautiful Soup instantiate that class to hold the values of a CDATA list attribute. That way the functionality isn't specific to the "class" attribute; it'd be additional complexity available to the attribute value itself, if the attribute value happened to be of this kind.

Here's my summary of the functionality you want on this class:

1. Check for duplicates on insertion, like set does.
2. Treat insertion of a list as insertion of every item in the list (like list.extend does)
3. Preserve the original value order, like list does.
4. Empty values are treated as the absence of a value.

#3 is the current behavior and I'd want it preserved. I don't think there's a method you could add to this class that would give you #4 in a backwards compatible way (currently an empty list for a CDATA list attribute becomes a ). You could do #2 but unless you're okay with using both append() and extend(), your interface would deviate from what `list` offers.

As for #1, you could definitely do it, but again backwards compatibility is the issue. I've seen some really weird stuff and I'm almost positive somebody's workflow depends on sticking the same CSS class into a tag multiple times.

So I'm not wild about the "subclass of `list` idea" either--there are too many backwards compatibility pitfalls and deviations from how Python's built-in data structures work.

However, there has been a trend in recent years where I enable advanced use cases by allowing users to customize which classes Beautiful Soup instantiates in different circumstances. See for example the "element_cl...

This request is asking to create a distinction between the "class" attribute and other attributes that I don't think is appropriate. I can think of a better way to offer functionality like this, but I don't know how useful it'd be.

The main time Beautiful Soup treats the "class" attribute as special is when working around the fact that "class" is also a Python reserved word. The rest of the time, "class" is treated as part of a family of attributes that Beautiful Soup calls multi-valued attributes. The HTML spec calls these "CDATA list attributes," with "class" being the most common.

There are lots of CDATA list attributes: "accesskey" and "dropzone" work the same way as "class", and for certain tags, attributes like "rel" or "headers" may also work like that. For XML documents, you can set up the CDATA list attributes however you want. For this reason I'm *very* reluctant to add methods with "class" in the name.

(a dictionary listing HTML's CDATA list attributes is here; this configures Beautiful Soup's default behavior when given an HTML document): https://git.launchpad.net/beautifulsoup/tree/bs4/builder/__init__.py?h=4.13#n522

The crucial moment in Beautiful Soup where we treat CDATA list attributes differently from regular attributes is here:
https://git.launchpad.net/beautifulsoup/tree/bs4/builder/__init__.py?h=4.13#n394

The attribute value comes out of there either a string (for normal attributes) or a list of strings (for CDATA list attributes).

I think the best way to do what you want would be to define a subclass of `list` or `set`, add helper methods to that class, and have Beautiful Soup instantiate that class to hold the values of a CDATA list attribute. That way the functionality isn't specific to the "class" attribute; it'd be additional complexity available to the attribute value itself, if the attribute value happened to be of this kind.

Here's my summary of the functionality you want on this class:

1. Check for duplicates on insertion, like set does.
2. Treat insertion of a list as insertion of every item in the list (like list.extend does)
3. Preserve the original value order, like list does.
4. Empty values are treated as the absence of a value.

#3 is the current behavior and I'd want it preserved. I don't think there's a method you could add to this class that would give you #4 in a backwards compatible way (currently an empty list for a CDATA list attribute becomes a ). You could do #2 but unless you're okay with using both append() and extend(), your interface would deviate from what `list` offers.

As for #1, you could definitely do it, but again backwards compatibility is the issue. I've seen some really weird stuff and I'm almost positive somebody's workflow depends on sticking the same CSS class into a tag multiple times.

So I'm not wild about the "subclass of `list` idea" either--there are too many backwards compatibility pitfalls and deviations from how Python's built-in data structures work.

However, there has been a trend in recent years where I enable advanced use cases by allowing users to customize which classes Beautiful Soup instantiates in different circumstances. See for example the "element_classes" argument to the BeautifulSoup constructor. That's a dict that lets you specify your own drop-in replacements for Tag, NavigableString, and so on. Also the "string_containers" argument to the TreeBuilder constructor, which lets you map tags like "script" to NavigableString subclasses like Script. That lets you handle situations where you need to give special treatment to the strings inside certain tags.

So, if TreeBuilder had an argument "multi_valued_attribute_class", and that was the class Beautiful Soup instantiated with the list of values for any CDATA list attribute (with the default value of multi_valued_attribute_class being `list`), I'd be fine with also supporting alternate implementation that was more optimized for managing CSS classes. But it would take this from a standard part of the Beautiful Soup API to a pretty advanced feature.

What do you think?

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-04-19:

#3

Your four-item summary of the requested functionality is spot-on.

Indeed, these methods could (and should) support other attributes, such as:

====
def add_class(self, these_classes, attname='class'):
...

def remove_class(self, these_classes, attname='class'):
...
====

For example, if I am working with non-HTML content (such as DITA XML source), then that will have its own conventions for list-of-strings attributes. For example, DITA has profiling condition attributes that are also multi-value:

====
dita_tag.add_class('expert', attname='audience')
====

I am completely fine with not using "class" in the name. (I am just used to XML::Twig's methods.) The terminology used in the BS4 documentation is "multi-valued attributes" so we probably use something consistent with that. Some ideas are:

====
tag.add_value('foo')
tag.remove_value(['foo', 'bar'])

tag.add_multivalue('foo')
tag.remove_multivalue(['foo', 'bar'])
====

If the attribute name argument is required (instead of defaulting to 'class'), it would make the methods' purposes clearer in context:

====
tag.add_value('foo', attname='class')
tag.remove_value(['foo', 'bar'], attname='class')

tag.add_multivalue('foo', attname='class')
tag.remove_multivalue(['foo', 'bar'], attname='class')
====

or

====
tag.add_value('class', 'foo')
tag.remove_value('class', ['foo', 'bar'])

tag.add_multivalue('class', 'foo')
tag.remove_multivalue('class', ['foo', 'bar'])
====

What would the subclassed-list UI look like? How would it handle the addition of a value to an attribute that doesn't exist yet? Could it handle addition/removal of both single values and lists of values?

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-05-27:

#4

Coming back to this, the idea of a custom subclass for holding multi-valued attributes is very much in keeping with the idea introduced by the AttributeDict class I wrote for issue 2065525. The main difference is that I would not provide any alternate implementations for this class other than "regular list". This would let you experiment with your own implementations and if you came up with one that worked for me I could add it in a future version, either as an option or the default.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-05-27:

#5

Revision 5104c45 of the 4.13 branch makes it possible to customize the data structure for multi-valued attributes. This will give you space to experiment with different possibilities; feel free to report back once you have something you think is a plausible replacement or alternative to a plain Python list.

summary:	- Provide convenience methods to add/remove class keywords + Make it possible to customize the data structure used to store multi- + valued attributes
Changed in beautifulsoup:
status:	New → Fix Committed

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-06-08:

#7

I think I have something sort of working!

Using the 4.13 branch:

====
import bs4
from typing import (
    Any,
    List,
    Type,
)
from bs4.builder import TreeBuilder
from bs4.builder._htmlparser import HTMLParserTreeBuilder

default_builder: Type[TreeBuilder] = HTMLParserTreeBuilder

class UniqueAttributeValueList(List[str]):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)

    def append(self, value: Any):
        if value not in self:
            super().append(value)

    def extend(self, values: List[Any]):
        for value in values:
            self.append(value)

    def remove(self, values: Any | List[Any]) -> None:
        if not isinstance(values, list):
            values = [values]
        for value in values:
            if value in self:
                super().remove(value)
        if not self:
            print("DELETE ATTRIBUTE???")
            # ???

builder = default_builder(
multi_valued_attributes={"*": set(["class"])},
attribute_value_list_class=UniqueAttributeValueList
)

markup = '<a class=""/>'
soup = bs4.BeautifulSoup(markup, builder=builder)
tag = soup.a

tag['class'].append('1')
print(tag)
tag['class'].append('1')
print(tag)
tag.attrs['class'].extend(['2', '2', '3'])
print(tag)
tag.attrs['class'].remove("2")
print(tag)
tag.attrs['class'].remove(["1", "3"])
print(tag)
====

gives the following output:

====
<a class="1"></a>
<a class="1"></a>
<a class="1 2 3"></a>
<a class="1 3"></a>
DELETE ATTRIBUTE???
<a class=""></a>
====

There are two things I need to figure out.

1. Right now it works only if the original HTML defines the attribute. If the attribute doesn't exist in the HTML, then I get a KeyError because there is no object in the attributes dictionary to operate on.

2. If I remove all the values, I still get class="", but somehow I want to remove the attribute completely.

Beautiful Soup

Make it possible to customize the data structure used to store multi-valued attributes

Bug Description

Other bug subscribers

Remote bug watches