Beautiful Soup

Tag.sourceline and Tag.sourcepos aren't always set

Bug #2065904 reported by Chris Papademetrious on 2024-05-16

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Fix Committed	Undecided	Unassigned

Bug Description

As a follow-up to this message group discussion:

====
copy.copy(soup) takes longer than expected
https://groups.google.com/g/beautifulsoup/c/r3AI81NFFxo
====

there is a call to find() in the __getattr__() method of Tag objects:

====
def __getattr__(self, tag):
    """Calling tag.subtag is the same as calling tag.find(name="subtag")"""
    if len(tag) > 3 and tag.endswith('Tag'):
        ...
    elif not tag.startswith("__") and not tag == "contents":
        return self.find(tag)
====

and this is getting called by the clone() method for these attribute queries:

====
def _clone(self):
    ...
    clone = type(self)(
        ...
        sourceline=self.sourceline, sourcepos=self.sourcepos,
====

Here are some runtimes:

parse HTML, create soup: 15 seconds
copy.copy() - original: 108 seconds
copy.copy() - without sourceline/sourcepos: 47 seconds

Retrieval of these two attributes is over half the copy.copy() runtime. We do many copy-and-modify operations during document processing, so hopefully this can be improved.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-05-27:

As you discovered in the thread, the problem is that sourceline and sourcepos aren't always set on Tag. When a nonexistent attribute of Tag is accessed, Beautiful Soup treats it as a call to find() and starts looking for a child tag of that name. Here, sourceline and sourcepos just don't have values, because lxml doesn't provide that information (at least the ways we use it). The values should be set to None in the Tag constructor if they're not provided.

As of revision 8900598 in the 4.13 branch, sourceline and sourcepos are always set. Try your benchmark again and see what kind of improvement you see.

Changed in beautifulsoup:
status:	New → Fix Committed
summary:	- Improve copy.copy() runtime + Tag.sourceline and Tag.sourcepos aren't always set

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-05-27:

Actually there's a slight issue with my implementation, as there is behavior that relies on being able to stop .sourcepos and .sourceline from being set so you can easily search for tags with those names. I'm sure I did it this way for backwards compatibility reasons, but I don't think anyone was actually using it this wayu, and the previous implementation had a couple big problems. First, the same code can do drastically different things in different contexts. Second, Tag.sourcepos and Tag.sourceline don't have a consistent data type, which is an issue as of 4.13. So I'm leaning towards enforcing the behavior I just added.

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-05-27:

I wish I remembered what machine and file I generated the previous runtimes with, as the machine/file I chose this time have a much lower ratio between parsing and cloning.

On my Windows laptop under WSL:

====
4.12.3
git checkout 7fb51753743644e23dfb3e7a964d387583dd0bc0
parse HTML, create soup: 56 seconds
copy.copy(): 141 seconds

4.13 - before fix
git checkout 5104c454e47242dfa68e9c6c3b7659e054230017
parse HTML, create soup: 71 seconds
copy.copy(): 150 seconds

4.13 - after fix
git checkout 55c2e34525380ffd83867818169968826a625609
parse HTML, create soup: 71 seconds
copy.copy(): 53 seconds
====

On a Linux box (same HTML):

====
4.12.3
git checkout 7fb51753743644e23dfb3e7a964d387583dd0bc0
parse HTML, create soup: 22 seconds
copy.copy(): 51 seconds

4.13 - before fix
git checkout 5104c454e47242dfa68e9c6c3b7659e054230017
parse HTML, create soup: 28 seconds
copy.copy(): 55 seconds

4.13 - after fix
git checkout 55c2e34525380ffd83867818169968826a625609
parse HTML, create soup: 28 seconds
copy.copy(): 24 seconds
====

Great results!! Please feel free to close this as fixed.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.