Customize html.parser TreeBuilder behavior when encountering the same attribute twice within a tag
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Consider markup like this:
<a href="http://
When this markup is parsed using the lxml or html5lib tree builders, the value of the 'href' attribute is the first one encountered; in this case "http://
With lxml and html5lib, there's nothing we can do to change this behavior; but with html.parser, we get a list of attributes from the parser and we can do what we want with that information.
The solution should look like this:
Add a 'on_duplicate_
* REPLACE - replace the earlier value with the later value. This will be the default to maintain backwards compatibility.
* IGNORE - Ignore the later value. This will let the user give html.parser the same behavior as lxml and html5lib.
* A callable, which will be called using every value after the first one.
At this point there will be enough features involving customizing the tree builder that this process should have its own section in the documentation.
Changed in beautifulsoup: | |
status: | New → Triaged |
Revision 572 contains the implementation; documentation is forthcoming.