Improve search performance by keeping an index of tag contents

Bug #1974105 reported by Luke P
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Triaged
Wishlist
Unassigned

Bug Description

Hi,

I tested against a 3MB HTML document, but I want to run against even larger documents where the performance issue is even more severe. The document has a html > body > div, and the div directly contains many elements. To test, I was selecting the div, and unwrapping it. I profiled this operation with cProfile, and found that the vast majority of the time was spent within tag.Index(child), as the index is implemented by enumerating the list of children for a match.

I've experimented with a fix that attempts to cache the index of all contents. The overall performance of my test case improved by a factor of ~20, and the performance of the Index function improved by a factor of ~500. The memory peak (measured with tracemalloc) increased from ~34MB to ~37MB, which seems minimal. It maps child id to index. It gets partially invalidated when children are added/removed, and lazily reconstructed when you fetch the index.

Here's a patch file to work with, demonstrating the fix.

```
--- .../element.py
+++ .../element_patched.py
@@ -279,7 +279,11 @@
         :return: `self`, no longer part of the tree.
         """
         if self.parent is not None:
- del self.parent.contents[self.parent.index(self)]
+ del_index = self.parent.index(self)
+ del self.parent.contents[del_index]
+ del self.parent._contents_cache[id(self)]
+ self.parent._contents_cache_stale = min(self.parent._contents_cache_stale, del_index)

         #Find the two elements that would be next to each other if
         #this element (and any children) hadn't been parsed. Connect
@@ -405,6 +409,9 @@
         if new_childs_last_element.next_element is not None:
             new_childs_last_element.next_element.previous_element = new_childs_last_element
         self.contents.insert(position, new_child)
+ self._contents_cache[id(new_child)] = position
+ self._contents_cache_stale = min(self._contents_cache_stale, position + 1)

     def append(self, tag):
         """Appends the given PageElement to the contents of this one.
@@ -1051,6 +1058,9 @@
             self.known_xml = is_xml
         self.attrs = attrs
         self.contents = []
+ self._contents_cache = {}
+ self._contents_cache_stale = 0
         self.setup(parent, previous)
         self.hidden = False

@@ -1218,6 +1228,9 @@
             next = i.next_element
             i.__dict__.clear()
             i.contents = []
+ i._contents_cache = {}
+ i._contents_cache_stale = 0
             i = next

     def clear(self, decompose=False):
@@ -1283,9 +1296,14 @@

         :param element: Look for this PageElement in `self.contents`.
         """
- for i, child in enumerate(self.contents):
- if child is element:
- return i
+ len_contents = len(self.contents)
+ while self._contents_cache_stale < len_contents:
+ self._contents_cache[id(self.contents[self._contents_cache_stale])] = self._contents_cache_stale
+ self._contents_cache_stale = self._contents_cache_stale + 1
+ i = self._contents_cache.get(id(element), -1)
+ if i != -1:
+ return i
         raise ValueError("Tag.index: element not in tag")

     def get(self, key, default=None):
```

Thanks for your work maintaining this!

Cheers,
Luke

Revision history for this message
Luke P (lkp80877984) wrote :

As a further optimization in replace_with - because you're swapping one for one, extract removes the removed item from the cache, insert adds the inserted item to the cache, the cache doesn't actually get any more stale.

```
        old_parent = self.parent
        my_index = self.parent.index(self)
+ last_stale = old_parent._contents_cache_stale
        self.extract()
        old_parent.insert(my_index, replace_with)
+ old_parent._contents_cache_stale = last_stale
        return self
```

Cheers,
Luke

Revision history for this message
Luke P (lkp80877984) wrote :

You can also just set old_parent._contents_cache_stale = len(old_parent.contents) just after inserting, rather than dealing with last_stale. The call to index fully initializes the cache.

Cheers,
Luke

Revision history for this message
Leonard Richardson (leonardr) wrote :

Luke, thanks for filing this ticket and providing a proposed implementation. I'm going to take a serious look at this when I have time, but historically I've been reluctant to consider solutions that involve caching tree state. Because of the wide variety of ways Beautiful Soup trees are created, and the many things people do to them after they're created, it's difficult to prove that the cache is always invalidated when it should be.

Like I said, I'll look at this implementation on its own terms, but that's where I'm coming from in general.

If you can provide your original test document and test code it would help me understand the problem you're trying to solve. Based on what you say, I think the fundamental problem is that the unwrap() implementation is inefficient because it's implemented as a series of calls to insert().

Revision history for this message
Luke P (lkp80877984) wrote :

"reluctant to ... involve caching"
I can completely understand that. It seems to me that others (including parsers) _can_ manipulate .contents, for example, which would invalidate the cache. I'm usually working in C#, where I think you can be a bit clearer about the supported public interface, and more easily have some hidden, private optimizations! Python is a bit free-er.

I do have a workaround, anyway. I've monkey-patched your library, wrapping and replacing a couple of functions to give the caching. It's not beautiful, but it works.

"inefficient ... calls to insert"
Problem isn't just assert. I have other scripts that, for example, extract many many elements. Or insert many many elements. Or replace many many elements. Each of these functions is slow, because each touch index. With a 15MB file, with 150_000 children in a div, it's very slow to CALCULATE the index every time. I can't share this document, but you could generate such a document? html > body > div, with 150_000 paragraphs. Generate it with bs4 even, to exercise the insert operation!

"test code"
Uh... I have a test that randomly generates a series of operations to manipulate a string so I have a known start and endpoint, then does those operations on a bs4 document, and then assert for the end string equivalent. It can catch some issues (I make sure it breaks when I comment out lines), but not all, to be honest... does that sound at all helpful?

Thanks for your hard work maintaining this library!

Cheers,
Luke

Revision history for this message
Leonard Richardson (leonardr) wrote :

As you can tell, Beautiful Soup wasn't designed for tree manipulation and I've never seriously used it that way. If I could recommend another solution I would, but I don't know of anything other than jQuery (where the "unwrap" idea comes from).

Insofar as it's possible, implementing these operations the way they're implemented in jQuery might help. But after looking at the jQuery source I see that the Javascript functions quickly call out to DOM methods which may well be backed by a well-tested cache of element locations.

If you want to share the test generation code, yes, that could be used as a fuzzer.

Revision history for this message
Luke P (lkp80877984) wrote :
Download full text (3.6 KiB)

"wasn't designed for"
Well, to your credit, it does that pretty darn well!

Here (btw, I'm learning python, you're welcome to give other feedback):

```
class TestMonkeyPatchBs4(unittest.TestCase):
    def test_bs4(self):
        for seed in range(0, 1000):
            # phase 1: randomly build a series of operations
            # starting with abcdex and ending in an empty string
            ops = []
            random.seed(seed)
            current = "abcdex"
            proposed_ops = [
                "+f",
                "+g",
                "+h",
                "+i",
                "+j",
                "!?",
                "!?",
                "!?",
                "!?",
                "!?",
                "-?",
                "-?",
                "-?",
                "-?",
                "-?",
                "@",
                "@",
            ]
            # there's an edge case where we do 5 removals, then a replace, which would err if there were only 5 chars
            # this is vanishingly unlikely, but can happen, so an extra x char fixes this
            random.shuffle(proposed_ops)

            while proposed_ops:
                proposed_op = proposed_ops.pop()
                i = random.randrange(0, len(current))
                char = current[i]
                if proposed_op[0] == "+":
                    ops.append(("+", i))
                    current = current[:i] + proposed_op[1] + current[i:]
                elif proposed_op[0] == "-":
                    ops.append(("-", i, char))
                    current = current[:i] + current[i + 1 :]
                elif proposed_op[0] == "!":
                    ops.append(("!", i, char))
                    inv_char = char.lower() if char.isupper() else char.upper()
                    current = current[:i] + inv_char + current[i + 1 :]
                else:
                    ops.append(("@"))
                # print(ops[-1])
                # print(current)

            while len(current) > 0:
                i = random.randrange(0, len(current))
                char = current[i]
                ops.append(("-", i, char))
                current = current[:i] + current[i + 1 :]
                # print(ops[-1])
                # print(current)

            # phase two: run the html equivalent of the inverse those operations in reverse
            # we should always end with abcde, for a reliable assertion
            soup = bs4.BeautifulSoup("<html></html>", "html.parser")
            body = soup.new_tag("body")
            soup.html.append(body)
            for rop in reversed(ops):
                if rop[0] == "+":
                    # exercise the inverse of insert, extract
                    body.contents[rop[1]].extract()
                elif rop[0] == "-":
                    # exercise the inverse of substring removal, insert
                    p_tag = soup.new_tag("p")
                    p_tag.append(rop[2])
                    body.insert(rop[1], p_tag)
                elif rop[0] == "!":
                    # exercise the inverse of substring replacement, replace_with
                    p_tag = soup.new_tag("p")
                    p_tag.appen...

Read more...

summary: - slow performance within Tag.index(child)
+ Improve search performance by keeping an index of tag contents
Changed in beautifulsoup:
status: New → Triaged
Changed in beautifulsoup:
importance: Undecided → Wishlist
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.