Tag.sourceline and Tag.sourcepos aren't always set
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Committed
|
Undecided
|
Unassigned |
Bug Description
As a follow-up to this message group discussion:
====
copy.copy(soup) takes longer than expected
https:/
====
there is a call to find() in the __getattr__() method of Tag objects:
====
def __getattr__(self, tag):
"""Calling tag.subtag is the same as calling tag.find(
if len(tag) > 3 and tag.endswith(
...
elif not tag.startswith(
return self.find(tag)
====
and this is getting called by the clone() method for these attribute queries:
====
def _clone(self):
...
clone = type(self)(
...
====
Here are some runtimes:
parse HTML, create soup: 15 seconds
copy.copy() - original: 108 seconds
copy.copy() - without sourceline/
Retrieval of these two attributes is over half the copy.copy() runtime. We do many copy-and-modify operations during document processing, so hopefully this can be improved.
As you discovered in the thread, the problem is that sourceline and sourcepos aren't always set on Tag. When a nonexistent attribute of Tag is accessed, Beautiful Soup treats it as a call to find() and starts looking for a child tag of that name. Here, sourceline and sourcepos just don't have values, because lxml doesn't provide that information (at least the ways we use it). The values should be set to None in the Tag constructor if they're not provided.
As of revision 8900598 in the 4.13 branch, sourceline and sourcepos are always set. Try your benchmark again and see what kind of improvement you see.