lxml TreeBuilder does not give ElementFilter a chance to filter a Doctype

Bug #2062000 reported by Leonard Richardson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Committed
Undecided
Unassigned

Bug Description

Markup:

<!DOCTYPE html>
<html>
</html>

html.parser TreeBuilder code for handling the doctype:

    def handle_decl(self, data:str) -> None:
        self.soup.endData()
        data = data[len("DOCTYPE "):]
        self.soup.handle_data(data)
        self.soup.endData(Doctype)

lxml TreeBuilder code for the same:

    def doctype(self, name:str, pubid:str, system:str) -> None:
        assert self.soup is not None
        self.soup.endData()
        doctype = Doctype.for_name_and_ids(name, pubid, system)
        self.soup.object_was_parsed(doctype)

endData() checks with the active ElementFilter whether a string should be created at all. But the lxml TreeBuilder doesn't call endData(), because lxml passes the content of the doctype as three pieces of information instead of a single string. The lxml TreeBuilder calls Doctype.for_name_and_ids() to turn the three pieces of information into a Doctype object, and calls object_was_parsed on the Doctype.

To fix, move the essential code out of Doctype.for_name_and_ids() and make a new method that just returns the string. The lxml TreeBuilder can use that method and pass the string into endData(). for_name_and_ids() could be deprecated at that point but it may be useful for people creating documents so I probably won't deprecate it.

object_was_parsed() is not called directly anywhere else in html.parser or lxml, so this is probably the only bug of this sort. (It is called in the html5lib TreeBuilder, but that TreeBuilder doesn't support ElementFilter anyway).

Changed in beautifulsoup:
status: New → Confirmed
Revision history for this message
Leonard Richardson (leonardr) wrote :

Fixed in revision 8db6f83.

Changed in beautifulsoup:
status: Confirmed → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.