Tag.sourceline and Tag.sourcepos aren't always set

Bug #2065904 reported by Chris Papademetrious
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Committed
Undecided
Unassigned

Bug Description

As a follow-up to this message group discussion:

====
copy.copy(soup) takes longer than expected
https://groups.google.com/g/beautifulsoup/c/r3AI81NFFxo
====

there is a call to find() in the __getattr__() method of Tag objects:

====
def __getattr__(self, tag):
    """Calling tag.subtag is the same as calling tag.find(name="subtag")"""
    if len(tag) > 3 and tag.endswith('Tag'):
        ...
    elif not tag.startswith("__") and not tag == "contents":
        return self.find(tag)
====

and this is getting called by the clone() method for these attribute queries:

====
def _clone(self):
    ...
    clone = type(self)(
        ...
        sourceline=self.sourceline, sourcepos=self.sourcepos,
====

Here are some runtimes:

parse HTML, create soup: 15 seconds
copy.copy() - original: 108 seconds
copy.copy() - without sourceline/sourcepos: 47 seconds

Retrieval of these two attributes is over half the copy.copy() runtime. We do many copy-and-modify operations during document processing, so hopefully this can be improved.

Revision history for this message
Leonard Richardson (leonardr) wrote :

As you discovered in the thread, the problem is that sourceline and sourcepos aren't always set on Tag. When a nonexistent attribute of Tag is accessed, Beautiful Soup treats it as a call to find() and starts looking for a child tag of that name. Here, sourceline and sourcepos just don't have values, because lxml doesn't provide that information (at least the ways we use it). The values should be set to None in the Tag constructor if they're not provided.

As of revision 8900598 in the 4.13 branch, sourceline and sourcepos are always set. Try your benchmark again and see what kind of improvement you see.

Changed in beautifulsoup:
status: New → Fix Committed
summary: - Improve copy.copy() runtime
+ Tag.sourceline and Tag.sourcepos aren't always set
Revision history for this message
Leonard Richardson (leonardr) wrote :

Actually there's a slight issue with my implementation, as there is behavior that relies on being able to stop .sourcepos and .sourceline from being set so you can easily search for tags with those names. I'm sure I did it this way for backwards compatibility reasons, but I don't think anyone was actually using it this wayu, and the previous implementation had a couple big problems. First, the same code can do drastically different things in different contexts. Second, Tag.sourcepos and Tag.sourceline don't have a consistent data type, which is an issue as of 4.13. So I'm leaning towards enforcing the behavior I just added.

Revision history for this message
Chris Papademetrious (chrispitude) wrote :

I wish I remembered what machine and file I generated the previous runtimes with, as the machine/file I chose this time have a much lower ratio between parsing and cloning.

On my Windows laptop under WSL:

====
4.12.3
git checkout 7fb51753743644e23dfb3e7a964d387583dd0bc0
parse HTML, create soup: 56 seconds
copy.copy(): 141 seconds

4.13 - before fix
git checkout 5104c454e47242dfa68e9c6c3b7659e054230017
parse HTML, create soup: 71 seconds
copy.copy(): 150 seconds

4.13 - after fix
git checkout 55c2e34525380ffd83867818169968826a625609
parse HTML, create soup: 71 seconds
copy.copy(): 53 seconds
====

On a Linux box (same HTML):

====
4.12.3
git checkout 7fb51753743644e23dfb3e7a964d387583dd0bc0
parse HTML, create soup: 22 seconds
copy.copy(): 51 seconds

4.13 - before fix
git checkout 5104c454e47242dfa68e9c6c3b7659e054230017
parse HTML, create soup: 28 seconds
copy.copy(): 55 seconds

4.13 - after fix
git checkout 55c2e34525380ffd83867818169968826a625609
parse HTML, create soup: 28 seconds
copy.copy(): 24 seconds
====

Great results!! Please feel free to close this as fixed.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.