Give the target parser interface access to sourceline and sourcepos
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
New
|
Undecided
|
Unassigned |
Bug Description
I'm the developer of Beautiful Soup, passing on a feature request from some of my users: see https:/
When Beautiful Soup asks lxml to parse an HTML or XML document, it uses the target parser interface to receive parser events and build element objects. This is similar to what we do when using html5lib or html.parser as the parser.
When using html5lib and html.parser, we're able to keep track of the source of every tag that came from a source document -- line number and character position. With html5lib we do this by looking at parser.
When using the default lxml parsing code, the source line of a given element is available as Element.sourceline. But when using the target parser interface, this information isn't available. This makes it impossible for Beautiful Soup to keep track of where in the document a tag was originally found.
It may be that the solution is to change Beautiful Soup's lxml tree builder to use a custom Element class lookup instead of the target parser interface. I think this is possible--it would make the lxml tree builder more like the html5lib tree builder. But it seems like a lot of work to get a relatively small feature working, so I thought I'd see if there's any interest in expanding the target parser interface, or if there's some other way of doing this that I don't see.
Python : sys.version_
lxml.etree : (4, 4, 1, 0)
libxml used : (2, 9, 3)
libxml compiled : (2, 9, 3)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)