Beautiful Soup

Customize html.parser TreeBuilder behavior when encountering the same attribute twice within a tag

Bug #1878209 reported by Leonard Richardson on 2020-05-12

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Fix Released	Undecided	Unassigned

Bug Description

Consider markup like this:

When this markup is parsed using the lxml or html5lib tree builders, the value of the 'href' attribute is the first one encountered; in this case "http://a". When the html.parser builder is used, the value is the _last_ one encountered; in this case "http://b".

With lxml and html5lib, there's nothing we can do to change this behavior; but with html.parser, we get a list of attributes from the parser and we can do what we want with that information.

The solution should look like this:

Add a 'on_duplicate_attribute' argument to the HTMLParserTreeBuilder constructor which can take the following values:

* REPLACE - replace the earlier value with the later value. This will be the default to maintain backwards compatibility.
* IGNORE - Ignore the later value. This will let the user give html.parser the same behavior as lxml and html5lib.
* A callable, which will be called using every value after the first one.

At this point there will be enough features involving customizing the tree builder that this process should have its own section in the documentation.

Leonard Richardson (leonardr) on 2020-05-12

Changed in beautifulsoup:
status:	New → Triaged

Revision history for this message

Leonard Richardson (leonardr) wrote on 2020-05-17:

Revision 572 contains the implementation; documentation is forthcoming.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2020-05-17:

Revision 575 has the documentation.

Changed in beautifulsoup:
status:	Triaged → Fix Committed

Revision history for this message

Leonard Richardson (leonardr) wrote on 2020-05-17:

Released in 4.9.1.

Changed in beautifulsoup:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.