lxml TreeBuilder does not give ElementFilter a chance to filter a Doctype
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Committed
|
Undecided
|
Unassigned |
Bug Description
Markup:
<!DOCTYPE html>
<html>
</html>
html.parser TreeBuilder code for handling the doctype:
def handle_decl(self, data:str) -> None:
data = data[len("DOCTYPE "):]
lxml TreeBuilder code for the same:
def doctype(self, name:str, pubid:str, system:str) -> None:
assert self.soup is not None
doctype = Doctype.
endData() checks with the active ElementFilter whether a string should be created at all. But the lxml TreeBuilder doesn't call endData(), because lxml passes the content of the doctype as three pieces of information instead of a single string. The lxml TreeBuilder calls Doctype.
To fix, move the essential code out of Doctype.
object_was_parsed() is not called directly anywhere else in html.parser or lxml, so this is probably the only bug of this sort. (It is called in the html5lib TreeBuilder, but that TreeBuilder doesn't support ElementFilter anyway).
Changed in beautifulsoup: | |
status: | New → Confirmed |
Fixed in revision 8db6f83.