Behavior change (regression?) when class_ provided as find argument contains trailing whitespace

Bug #1824502 reported by Sylvain
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
New
Undecided
Unassigned

Bug Description

On one of my projects, there is a element that has the following class attribute:

    'fa btn btn-outline-secondary btn-circle fa fa-backward sm ' # trailing whitespace

and that I used to retrieve using:

    soup.find('a', class_='fa btn btn-outline-secondary btn-circle fa fa-backward sm ') # trailing whitespace

Recently, someone reported that it didn't work for them. After a bit of investigation, we tried:

    soup.find('a', class_='fa btn btn-outline-secondary btn-circle fa fa-backward sm') # NO trailing whitespace

and it did work for them.

That behavior change does not seem expected to me.

I had a quick look at the source code/history for BeautifulSoup and here are my findings:
 - could it be related to https://bugs.launchpad.net/beautifulsoup/+bug/1787453 / https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/revision/484 ?
 - I tried to add things in "test_multivalued_attribute_with_whitespace" and did manage to reproduce the new behavior
 - would it make sense to "clean up" the argument provided as the class?

I am currently away from my dev computer but I can try to have a deeper look in the future to:
 - pinpoint the exact commit that changed the behavior
 - write corresponding unit test cases
 - suggest a fix

Thanks again for the great work you did on this project :)

(More details about the original bug: https://github.com/SylvainDe/ComicBookMaker/issues/83 . For the time being, I went for a workaround calling find with and without the trailing whitespace.)

Revision history for this message
Sylvain (sylvaind) wrote :

After a deeper investigation with the following unit-test added in class HTMLTreeBuilderSmokeTest(object):

    # Based on test_multivalued_attribute_with_whitespace
    def test_find_class_with_whitespace(self):
        markup = '<div class=" foo bar "></a>'
        soup = self.soup(markup)
        self.assertEqual(soup.div, soup.find('div', class_=" foo bar "))

I got the following results on the different revisions:

497 KO
450 KO
425 KO
417 KO <- behavior change (+ test_copy_preserves_encoding error)
416 OK (except for test_copy_preserves_encoding)
414 OK
410 OK
408 OK
400 OK

This would correspond to https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/revision/417 .

Revision history for this message
Sylvain (sylvaind) wrote :

Adding the relevant test in the relevant place:

    def test_find_with_multi_valued_attribute(self):
        soup = self.soup(
            "<div class='a b'>1</div><div class='a c'>2</div><div class='a d'>3</div><div class='a e '>4</div>"
        )
        r1 = soup.find('div', 'a d')
        r2 = soup.find('div', re.compile(r'a d'))
        r3, r4 = soup.find_all('div', ['a b', 'a d'])
        r5 = soup.find('div', 'a e ') # Should this one work ?
        r5 = soup.find('div', 'a e') # Or this one ? Or both ?
        self.assertEqual('3', r1.string)
        self.assertEqual('3', r2.string)
        self.assertEqual('1', r3.string)
        self.assertEqual('3', r4.string)
        self.assertEqual('4', r5.string)

Revision history for this message
Sylvain (sylvaind) wrote :

Here is my suggestion for a fix but I don't find it very satisfying:

         if not match and isinstance(match_against, unicode):
             # Exact string match
- match = markup == match_against
+ match = markup == match_against.strip()

Revision history for this message
Isaac Muse (facelessuser) wrote :

As I recall, BeautifulSoup has always turned certain attributes that are generally treated as lists to lists in the attribute dictionary. Previously, the logic would sometimes create lists with empty entries when dealing with leading or trailing whitespace `['', 'classname', '']`. This was weird, and kind of unexpected.

So the revision you referenced "fixed" that. To be honest, comparing the old methodology and the new methodology, neither ever preserved the class attribute exactly. In the old method, it would split on whitespace, and when doing certain comparisons it would rebuild them, but it never remembered exact characters, so ' class class2 ' would get turned to `['', 'class', 'class2', '']` and rebuilt for certain comparisons as ` class class2 ` simply joining the values in the list with a single space.

The new method now eliminates the leading and trailing whitespace which created empty entries as, instead of splitting on whitespace, it now just finds the non-whitespace values.

I always kind of wished BeatifulSoup had instead just preserved the original attribute text, and split it when evaluating it as it currently means you can never compare the "exact" text in things like CSS selectors (I currently support selector support) with `[class=" class class2 "]` or in your specific case with `class_`. But this has always been a problem in one form or another. I at least find the new methodology a bit more sane.

Revision history for this message
Isaac Muse (facelessuser) wrote :

Unfortunately whitespace was collapsed in my previous comment, but basically I was trying to illustrate that if you had multiple whitespace characters, bs4 always collapsed them between attributes so you never had 1:1 of what was in the document.

Revision history for this message
Sylvain (sylvaind) wrote :

Thanks for your feedback.

I understand that BeautifulSoup does things with whitespace which impacts their internal representation (which is used for string conversion, lookup, etc).
Maybe this choice is questionable, I am clearly not in position to argue for one choice over another.

What I do find unfortunate is that the corresponding change did change the behavior for the final user in a way which is not clearly documented (and maybe not fully intended) and is easy to trip over (if you just rely on what you see in HTML, you'll get things wrong).

Thus, I just wished that somehow, a similar processing what done on the input string provided to find to ensure that we do find what we'd expect to find (and what used to be found on older versions of BeautifulSoup).

Revision history for this message
Isaac Muse (facelessuser) wrote :

> Thus, I just wished that somehow, a similar processing what done on the input string provided to find to ensure that we do find what we'd expect to find (and what used to be found on older versions of BeautifulSoup).

I agree, but you could never use just what you see in HTML to compare the class as whitespace was collapsed and normalized by Beautiful Soup in the past as well as now, but yes, it now additionally strips trailing and leading whitespace. This does cause some confusion, while it also fixes the confusing issues with empty strings in attribute lists.

There are probably only really two ways to address this. The first, as you suggested, would be to augment the comparison to strip whitespace from the recreated list string and the user input at comparison time.

If it were me, I'd probably just break it for a major release (say a 5.0 release) and store the actual string content instead of a list and provide an API break the attribute content into a list if desired. I think it would make things more predictable in the long run. This is just my opinion though as that approach makes more sense to me, and I've never been a huge fan of the not being able to get at the actual attribute value.

Currently, you just have to remember that Beautiful Soup will do this for certain attributes, and plan your access accordingly.

Revision history for this message
Isaac Muse (facelessuser) wrote :

On the other hand, pre splitting classes, which is primarily how HTMl if navigated also makes things faster, so I don't know what is better.

Revision history for this message
Leonard Richardson (leonardr) wrote :

My fix to bug 1832978 makes it possible to customize or disable the process by which certain attributes are converted to lists. Although this doesn't precisely recreate the old behavior, it does make your underlying goal possible -- to find a tag by its exact string class value.

Revision history for this message
Sylvain (sylvaind) wrote :

Thanks for the feedback, I'll have a look at this. Thanks again for all the great work you did with BeautifulSoup!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.