Beautiful Soup

lxml TreeBuilder does not give ElementFilter a chance to filter a Doctype

Bug #2062000 reported by Leonard Richardson on 2024-04-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Fix Committed	Undecided	Unassigned

Bug Description

Markup:

<!DOCTYPE html>
<html>
</html>

html.parser TreeBuilder code for handling the doctype:

    def handle_decl(self, data:str) -> None:
        self.soup.endData()
        data = data[len("DOCTYPE "):]
        self.soup.handle_data(data)
        self.soup.endData(Doctype)

lxml TreeBuilder code for the same:

    def doctype(self, name:str, pubid:str, system:str) -> None:
        assert self.soup is not None
        self.soup.endData()
        doctype = Doctype.for_name_and_ids(name, pubid, system)
        self.soup.object_was_parsed(doctype)

endData() checks with the active ElementFilter whether a string should be created at all. But the lxml TreeBuilder doesn't call endData(), because lxml passes the content of the doctype as three pieces of information instead of a single string. The lxml TreeBuilder calls Doctype.for_name_and_ids() to turn the three pieces of information into a Doctype object, and calls object_was_parsed on the Doctype.

To fix, move the essential code out of Doctype.for_name_and_ids() and make a new method that just returns the string. The lxml TreeBuilder can use that method and pass the string into endData(). for_name_and_ids() could be deprecated at that point but it may be useful for people creating documents so I probably won't deprecate it.

object_was_parsed() is not called directly anywhere else in html.parser or lxml, so this is probably the only bug of this sort. (It is called in the html5lib TreeBuilder, but that TreeBuilder doesn't support ElementFilter anyway).

Leonard Richardson (leonardr) on 2024-04-17

Changed in beautifulsoup:
status:	New → Confirmed

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-05-24:

Fixed in revision 8db6f83.

Changed in beautifulsoup:
status:	Confirmed → Fix Committed

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.