Make it possible to customize the data structure used to store multi-valued attributes

Bug #2052943 reported by Chris Papademetrious
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Committed
Undecided
Unassigned

Bug Description

This is an enhancement request.

It would be nice to have convenience methods to add/remove keywords from the "class" attribute.

For example,

====
elt.add_class('foo')
elt.add_class(['bar', 'baz'])

elt.remove_class('foo')
elt.remove_class(['bar', 'baz'])
====

The methods should accept string or list-of-string values. Existing keywords should not be duplicated. remove_class() should delete the "class" attribute if it becomes empty.

I could propose a merge request for this, if you're open to it.

These are the helper functions we currently use:

====
def add_class(tag, these_classes):
    if isinstance(these_classes, str):
        these_classes = [these_classes]
    for this_class in these_classes:
        if tag.get('class', None) is None:
            tag['class'] = []
        if not this_class.isspace() and this_class not in tag['class']:
            tag['class'].append(this_class)
    return tag

def remove_class(tag, these_classes):
    if isinstance(these_classes, str):
        these_classes = [these_classes]
    for this_class in these_classes:
        if tag.get('class', None):
            tag['class'] = [x for x in tag.get('class') if not x in these_classes]
    if not tag['class']:
        del tag['class']
    return tag
====

We also have a helper function to test if a keyword exists in "class":

====
def has_class(tag, this_class):
    return bool(tag.get('class', []) and this_class in tag['class'])
====

but this would be provided by self-testing from #2052936, if that comes to be:

====
if elt.matches(class_ = ...):
    # ...
====

Revision history for this message
Leonard Richardson (leonardr) wrote :
Download full text (4.0 KiB)

This request is asking to create a distinction between the "class" attribute and other attributes that I don't think is appropriate. I can think of a better way to offer functionality like this, but I don't know how useful it'd be.

The main time Beautiful Soup treats the "class" attribute as special is when working around the fact that "class" is also a Python reserved word. The rest of the time, "class" is treated as part of a family of attributes that Beautiful Soup calls multi-valued attributes. The HTML spec calls these "CDATA list attributes," with "class" being the most common.

There are lots of CDATA list attributes: "accesskey" and "dropzone" work the same way as "class", and for certain tags, attributes like "rel" or "headers" may also work like that. For XML documents, you can set up the CDATA list attributes however you want. For this reason I'm *very* reluctant to add methods with "class" in the name.

(a dictionary listing HTML's CDATA list attributes is here; this configures Beautiful Soup's default behavior when given an HTML document): https://git.launchpad.net/beautifulsoup/tree/bs4/builder/__init__.py?h=4.13#n522

The crucial moment in Beautiful Soup where we treat CDATA list attributes differently from regular attributes is here:
https://git.launchpad.net/beautifulsoup/tree/bs4/builder/__init__.py?h=4.13#n394

The attribute value comes out of there either a string (for normal attributes) or a list of strings (for CDATA list attributes).

I think the best way to do what you want would be to define a subclass of `list` or `set`, add helper methods to that class, and have Beautiful Soup instantiate that class to hold the values of a CDATA list attribute. That way the functionality isn't specific to the "class" attribute; it'd be additional complexity available to the attribute value itself, if the attribute value happened to be of this kind.

Here's my summary of the functionality you want on this class:

1. Check for duplicates on insertion, like set does.
2. Treat insertion of a list as insertion of every item in the list (like list.extend does)
3. Preserve the original value order, like list does.
4. Empty values are treated as the absence of a value.

#3 is the current behavior and I'd want it preserved. I don't think there's a method you could add to this class that would give you #4 in a backwards compatible way (currently an empty list for a CDATA list attribute becomes a ). You could do #2 but unless you're okay with using both append() and extend(), your interface would deviate from what `list` offers.

As for #1, you could definitely do it, but again backwards compatibility is the issue. I've seen some really weird stuff and I'm almost positive somebody's workflow depends on sticking the same CSS class into a tag multiple times.

So I'm not wild about the "subclass of `list` idea" either--there are too many backwards compatibility pitfalls and deviations from how Python's built-in data structures work.

However, there has been a trend in recent years where I enable advanced use cases by allowing users to customize which classes Beautiful Soup instantiates in different circumstances. See for example the "element_cl...

Read more...

Revision history for this message
Chris Papademetrious (chrispitude) wrote :

Your four-item summary of the requested functionality is spot-on.

Indeed, these methods could (and should) support other attributes, such as:

====
def add_class(self, these_classes, attname='class'):
    ...

def remove_class(self, these_classes, attname='class'):
    ...
====

For example, if I am working with non-HTML content (such as DITA XML source), then that will have its own conventions for list-of-strings attributes. For example, DITA has profiling condition attributes that are also multi-value:

====
dita_tag.add_class('expert', attname='audience')
====

I am completely fine with not using "class" in the name. (I am just used to XML::Twig's methods.) The terminology used in the BS4 documentation is "multi-valued attributes" so we probably use something consistent with that. Some ideas are:

====
tag.add_value('foo')
tag.remove_value(['foo', 'bar'])

tag.add_multivalue('foo')
tag.remove_multivalue(['foo', 'bar'])
====

If the attribute name argument is required (instead of defaulting to 'class'), it would make the methods' purposes clearer in context:

====
tag.add_value('foo', attname='class')
tag.remove_value(['foo', 'bar'], attname='class')

tag.add_multivalue('foo', attname='class')
tag.remove_multivalue(['foo', 'bar'], attname='class')
====

or

====
tag.add_value('class', 'foo')
tag.remove_value('class', ['foo', 'bar'])

tag.add_multivalue('class', 'foo')
tag.remove_multivalue('class', ['foo', 'bar'])
====

What would the subclassed-list UI look like? How would it handle the addition of a value to an attribute that doesn't exist yet? Could it handle addition/removal of both single values and lists of values?

Revision history for this message
Leonard Richardson (leonardr) wrote :

Coming back to this, the idea of a custom subclass for holding multi-valued attributes is very much in keeping with the idea introduced by the AttributeDict class I wrote for issue 2065525. The main difference is that I would not provide any alternate implementations for this class other than "regular list". This would let you experiment with your own implementations and if you came up with one that worked for me I could add it in a future version, either as an option or the default.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Revision 5104c45 of the 4.13 branch makes it possible to customize the data structure for multi-valued attributes. This will give you space to experiment with different possibilities; feel free to report back once you have something you think is a plausible replacement or alternative to a plain Python list.

summary: - Provide convenience methods to add/remove class keywords
+ Make it possible to customize the data structure used to store multi-
+ valued attributes
Changed in beautifulsoup:
status: New → Fix Committed
Revision history for this message
Chris Papademetrious (chrispitude) wrote :

I think I have something sort of working!

Using the 4.13 branch:

====
import bs4
from typing import (
    Any,
    List,
    Type,
)
from bs4.builder import TreeBuilder
from bs4.builder._htmlparser import HTMLParserTreeBuilder

default_builder: Type[TreeBuilder] = HTMLParserTreeBuilder

class UniqueAttributeValueList(List[str]):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def append(self, value: Any):
        if value not in self:
            super().append(value)

    def extend(self, values: List[Any]):
        for value in values:
            self.append(value)

    def remove(self, values: Any | List[Any]) -> None:
        if not isinstance(values, list):
            values = [values]
        for value in values:
            if value in self:
                super().remove(value)
        if not self:
            print("DELETE ATTRIBUTE???")
            # ???

builder = default_builder(
    multi_valued_attributes={"*": set(["class"])},
    attribute_value_list_class=UniqueAttributeValueList
)

markup = '<a class=""/>'
soup = bs4.BeautifulSoup(markup, builder=builder)
tag = soup.a

tag['class'].append('1')
print(tag)
tag['class'].append('1')
print(tag)
tag.attrs['class'].extend(['2', '2', '3'])
print(tag)
tag.attrs['class'].remove("2")
print(tag)
tag.attrs['class'].remove(["1", "3"])
print(tag)
====

gives the following output:

====
<a class="1"></a>
<a class="1"></a>
<a class="1 2 3"></a>
<a class="1 3"></a>
DELETE ATTRIBUTE???
<a class=""></a>
====

There are two things I need to figure out.

1. Right now it works only if the original HTML defines the attribute. If the attribute doesn't exist in the HTML, then I get a KeyError because there is no object in the attributes dictionary to operate on.

2. If I remove all the values, I still get class="", but somehow I want to remove the attribute completely.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.