Give the target parser interface access to sourceline and sourcepos

Bug #1846906 reported by Leonard Richardson
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

I'm the developer of Beautiful Soup, passing on a feature request from some of my users: see https://bugs.launchpad.net/beautifulsoup/+bug/1742921.

When Beautiful Soup asks lxml to parse an HTML or XML document, it uses the target parser interface to receive parser events and build element objects. This is similar to what we do when using html5lib or html.parser as the parser.

When using html5lib and html.parser, we're able to keep track of the source of every tag that came from a source document -- line number and character position. With html5lib we do this by looking at parser.tokenizer.stream.position() when the DOM object is created. With html.parser, the parser and the event target are the same object, so we can get the current position during parsing by calling self.getpos().

When using the default lxml parsing code, the source line of a given element is available as Element.sourceline. But when using the target parser interface, this information isn't available. This makes it impossible for Beautiful Soup to keep track of where in the document a tag was originally found.

It may be that the solution is to change Beautiful Soup's lxml tree builder to use a custom Element class lookup instead of the target parser interface. I think this is possible--it would make the lxml tree builder more like the html5lib tree builder. But it seems like a lot of work to get a relatively small feature working, so I thought I'd see if there's any interest in expanding the target parser interface, or if there's some other way of doing this that I don't see.

Python : sys.version_info(major=3, minor=5, micro=0, releaselevel='final', serial=0)
lxml.etree : (4, 4, 1, 0)
libxml used : (2, 9, 3)
libxml compiled : (2, 9, 3)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.