Improve search performance by keeping an index of tag contents
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Triaged
|
Wishlist
|
Unassigned |
Bug Description
Hi,
I tested against a 3MB HTML document, but I want to run against even larger documents where the performance issue is even more severe. The document has a html > body > div, and the div directly contains many elements. To test, I was selecting the div, and unwrapping it. I profiled this operation with cProfile, and found that the vast majority of the time was spent within tag.Index(child), as the index is implemented by enumerating the list of children for a match.
I've experimented with a fix that attempts to cache the index of all contents. The overall performance of my test case improved by a factor of ~20, and the performance of the Index function improved by a factor of ~500. The memory peak (measured with tracemalloc) increased from ~34MB to ~37MB, which seems minimal. It maps child id to index. It gets partially invalidated when children are added/removed, and lazily reconstructed when you fetch the index.
Here's a patch file to work with, demonstrating the fix.
```
--- .../element.py
+++ .../element_
@@ -279,7 +279,11 @@
:return: `self`, no longer part of the tree.
"""
if self.parent is not None:
- del self.parent.
+ del_index = self.parent.
+ del self.parent.
+ del self.parent.
+ self.parent.
#Find the two elements that would be next to each other if
#this element (and any children) hadn't been parsed. Connect
@@ -405,6 +409,9 @@
if new_childs_
+ self._contents_
+ self._contents_
def append(self, tag):
"""Appends the given PageElement to the contents of this one.
@@ -1051,6 +1058,9 @@
self.attrs = attrs
+ self._contents_
+ self._contents_
@@ -1218,6 +1228,9 @@
next = i.next_element
+ i._contents_cache = {}
+ i._contents_
i = next
def clear(self, decompose=False):
@@ -1283,9 +1296,14 @@
:param element: Look for this PageElement in `self.contents`.
"""
- for i, child in enumerate(
- if child is element:
- return i
+ len_contents = len(self.contents)
+ while self._contents_
+ self._contents_
+ self._contents_
+ i = self._contents_
+ if i != -1:
+ return i
raise ValueError(
def get(self, key, default=None):
```
Thanks for your work maintaining this!
Cheers,
Luke
summary: |
- slow performance within Tag.index(child) + Improve search performance by keeping an index of tag contents |
Changed in beautifulsoup: | |
status: | New → Triaged |
Changed in beautifulsoup: | |
importance: | Undecided → Wishlist |
As a further optimization in replace_with - because you're swapping one for one, extract removes the removed item from the cache, insert adds the inserted item to the cache, the cache doesn't actually get any more stale.
``` index(self) _contents_ cache_stale
self.extract( )
old_parent. insert( my_index, replace_with) _contents_ cache_stale = last_stale
old_parent = self.parent
my_index = self.parent.
+ last_stale = old_parent.
+ old_parent.
return self
```
Cheers,
Luke