Customize html.parser TreeBuilder behavior when encountering the same attribute twice within a tag

Bug #1878209 reported by Leonard Richardson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

Consider markup like this:

<a href="http://a" href="http://b"/>

When this markup is parsed using the lxml or html5lib tree builders, the value of the 'href' attribute is the first one encountered; in this case "http://a". When the html.parser builder is used, the value is the _last_ one encountered; in this case "http://b".

With lxml and html5lib, there's nothing we can do to change this behavior; but with html.parser, we get a list of attributes from the parser and we can do what we want with that information.

The solution should look like this:

Add a 'on_duplicate_attribute' argument to the HTMLParserTreeBuilder constructor which can take the following values:

* REPLACE - replace the earlier value with the later value. This will be the default to maintain backwards compatibility.
* IGNORE - Ignore the later value. This will let the user give html.parser the same behavior as lxml and html5lib.
* A callable, which will be called using every value after the first one.

At this point there will be enough features involving customizing the tree builder that this process should have its own section in the documentation.

Changed in beautifulsoup:
status: New → Triaged
Revision history for this message
Leonard Richardson (leonardr) wrote :

Revision 572 contains the implementation; documentation is forthcoming.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Revision 575 has the documentation.

Changed in beautifulsoup:
status: Triaged → Fix Committed
Revision history for this message
Leonard Richardson (leonardr) wrote :

Released in 4.9.1.

Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.