Bug #1974105 “Improve search performance by keeping an index of ...” : Bugs : Beautiful Soup

Revision history for this message

Luke P (lkp80877984) wrote on 2022-05-19:

#1

As a further optimization in replace_with - because you're swapping one for one, extract removes the removed item from the cache, insert adds the inserted item to the cache, the cache doesn't actually get any more stale.

```
        old_parent = self.parent
        my_index = self.parent.index(self)
+ last_stale = old_parent._contents_cache_stale
        self.extract()
        old_parent.insert(my_index, replace_with)
+ old_parent._contents_cache_stale = last_stale
        return self
```

Cheers,
Luke

Revision history for this message

Luke P (lkp80877984) wrote on 2022-05-23:

#2

You can also just set old_parent._contents_cache_stale = len(old_parent.contents) just after inserting, rather than dealing with last_stale. The call to index fully initializes the cache.

Cheers,
Luke

Revision history for this message

Leonard Richardson (leonardr) wrote on 2022-05-23:

#3

Luke, thanks for filing this ticket and providing a proposed implementation. I'm going to take a serious look at this when I have time, but historically I've been reluctant to consider solutions that involve caching tree state. Because of the wide variety of ways Beautiful Soup trees are created, and the many things people do to them after they're created, it's difficult to prove that the cache is always invalidated when it should be.

Like I said, I'll look at this implementation on its own terms, but that's where I'm coming from in general.

If you can provide your original test document and test code it would help me understand the problem you're trying to solve. Based on what you say, I think the fundamental problem is that the unwrap() implementation is inefficient because it's implemented as a series of calls to insert().

Revision history for this message

Luke P (lkp80877984) wrote on 2022-05-24:

#4

"reluctant to ... involve caching"
I can completely understand that. It seems to me that others (including parsers) _can_ manipulate .contents, for example, which would invalidate the cache. I'm usually working in C#, where I think you can be a bit clearer about the supported public interface, and more easily have some hidden, private optimizations! Python is a bit free-er.

I do have a workaround, anyway. I've monkey-patched your library, wrapping and replacing a couple of functions to give the caching. It's not beautiful, but it works.

"inefficient ... calls to insert"
Problem isn't just assert. I have other scripts that, for example, extract many many elements. Or insert many many elements. Or replace many many elements. Each of these functions is slow, because each touch index. With a 15MB file, with 150_000 children in a div, it's very slow to CALCULATE the index every time. I can't share this document, but you could generate such a document? html > body > div, with 150_000 paragraphs. Generate it with bs4 even, to exercise the insert operation!

"test code"
Uh... I have a test that randomly generates a series of operations to manipulate a string so I have a known start and endpoint, then does those operations on a bs4 document, and then assert for the end string equivalent. It can catch some issues (I make sure it breaks when I comment out lines), but not all, to be honest... does that sound at all helpful?

Thanks for your hard work maintaining this library!

Cheers,
Luke

Revision history for this message

Leonard Richardson (leonardr) wrote on 2022-05-24:

#5

As you can tell, Beautiful Soup wasn't designed for tree manipulation and I've never seriously used it that way. If I could recommend another solution I would, but I don't know of anything other than jQuery (where the "unwrap" idea comes from).

Insofar as it's possible, implementing these operations the way they're implemented in jQuery might help. But after looking at the jQuery source I see that the Javascript functions quickly call out to DOM methods which may well be backed by a well-tested cache of element locations.

If you want to share the test generation code, yes, that could be used as a fuzzer.

Revision history for this message

Luke P (lkp80877984) wrote on 2022-05-24:

#6

Download full text (3.6 KiB)

"wasn't designed for"
Well, to your credit, it does that pretty darn well!

Here (btw, I'm learning python, you're welcome to give other feedback):

```
class TestMonkeyPatchBs4(unittest.TestCase):
    def test_bs4(self):
        for seed in range(0, 1000):
            # phase 1: randomly build a series of operations
            # starting with abcdex and ending in an empty string
            ops = []
            random.seed(seed)
            current = "abcdex"
            proposed_ops = [
                "+f",
                "+g",
                "+h",
                "+i",
                "+j",
                "!?",
                "!?",
                "!?",
                "!?",
                "!?",
                "-?",
                "-?",
                "-?",
                "-?",
                "-?",
                "@",
                "@",
            ]
            # there's an edge case where we do 5 removals, then a replace, which would err if there were only 5 chars
            # this is vanishingly unlikely, but can happen, so an extra x char fixes this
            random.shuffle(proposed_ops)

            while proposed_ops:
                proposed_op = proposed_ops.pop()
                i = random.randrange(0, len(current))
                char = current[i]
                if proposed_op[0] == "+":
                    ops.append(("+", i))
                    current = current[:i] + proposed_op[1] + current[i:]
                elif proposed_op[0] == "-":
                    ops.append(("-", i, char))
                    current = current[:i] + current[i + 1 :]
                elif proposed_op[0] == "!":
                    ops.append(("!", i, char))
                    inv_char = char.lower() if char.isupper() else char.upper()
                    current = current[:i] + inv_char + current[i + 1 :]
                else:
                    ops.append(("@"))
                # print(ops[-1])
                # print(current)

            while len(current) > 0:
                i = random.randrange(0, len(current))
                char = current[i]
                ops.append(("-", i, char))
                current = current[:i] + current[i + 1 :]
                # print(ops[-1])
                # print(current)

# phase two: run the html equivalent of the inverse those operations in reverse
 # we should always end with abcde, for a reliable assertion
 soup = bs4.BeautifulSoup("<html></html>", "html.parser")
 body = soup.new_tag("body")
 soup.html.append(body)
 for rop in reversed(ops):
 if rop[0] == "+":
 # exercise the inverse of insert, extract
 body.contents[rop[1]].extract()
 elif rop[0] == "-":
 # exercise the inverse of substring removal, insert
 p_tag = soup.new_tag("p")
 p_tag.append(rop[2])
 body.insert(rop[1], p_tag)
 elif rop[0] == "!":
 # exercise the inverse of substring replacement, replace_with
 p_tag = soup.new_tag("p")
 p_tag.appen...

"wasn't designed for"
Well, to your credit, it does that pretty darn well!

Here (btw, I'm learning python, you're welcome to give other feedback):

```
class TestMonkeyPatchBs4(unittest.TestCase):
    def test_bs4(self):
        for seed in range(0, 1000):
            # phase 1: randomly build a series of operations
            # starting with abcdex and ending in an empty string
            ops = []
            random.seed(seed)
            current = "abcdex"
            proposed_ops = [
                "+f",
                "+g",
                "+h",
                "+i",
                "+j",
                "!?",
                "!?",
                "!?",
                "!?",
                "!?",
                "-?",
                "-?",
                "-?",
                "-?",
                "-?",
                "@",
                "@",
            ]
            # there's an edge case where we do 5 removals, then a replace, which would err if there were only 5 chars
            # this is vanishingly unlikely, but can happen, so an extra x char fixes this
            random.shuffle(proposed_ops)

while proposed_ops:
                proposed_op = proposed_ops.pop()
                i = random.randrange(0, len(current))
                char = current[i]
                if proposed_op[0] == "+":
                    ops.append(("+", i))
                    current = current[:i] + proposed_op[1] + current[i:]
                elif proposed_op[0] == "-":
                    ops.append(("-", i, char))
                    current = current[:i] + current[i + 1 :]
                elif proposed_op[0] == "!":
                    ops.append(("!", i, char))
                    inv_char = char.lower() if char.isupper() else char.upper()
                    current = current[:i] + inv_char + current[i + 1 :]
                else:
                    ops.append(("@"))
                # print(ops[-1])
                # print(current)

while len(current) > 0:
                i = random.randrange(0, len(current))
                char = current[i]
                ops.append(("-", i, char))
                current = current[:i] + current[i + 1 :]
                # print(ops[-1])
                # print(current)

# phase two: run the html equivalent of the inverse those operations in reverse
 # we should always end with abcde, for a reliable assertion
 soup = bs4.BeautifulSoup("<html></html>", "html.parser")
 body = soup.new_tag("body")
 soup.html.append(body)
 for rop in reversed(ops):
 if rop[0] == "+":
 # exercise the inverse of insert, extract
 body.contents[rop[1]].extract()
 elif rop[0] == "-":
 # exercise the inverse of substring removal, insert
 p_tag = soup.new_tag("p")
 p_tag.append(rop[2])
 body.insert(rop[1], p_tag)
 elif rop[0] == "!":
 # exercise the inverse of substring replacement, replace_with
 p_tag = soup.new_tag("p")
 p_tag.append(rop[2])
 body.contents[rop[1]].replace_with(p_tag)
 else:
 # exercise the index method
 for i in range(0, len(body.contents)):
 self.assertEqual(i, body.index(body.contents[i]))

self.assertEqual(
 b"<html><body>abcdex</body></html>",
 soup.encode(encoding="utf-8"),
 )
```

Cheers,
Luke

Leonard Richardson (leonardr) on 2023-01-28

summary:	- slow performance within Tag.index(child) + Improve search performance by keeping an index of tag contents
Changed in beautifulsoup:
status:	New → Triaged

Leonard Richardson (leonardr) on 2023-01-31

Changed in beautifulsoup:
importance:	Undecided → Wishlist

Beautiful Soup

Improve search performance by keeping an index of tag contents

Bug Description

Other bug subscribers

Remote bug watches