Add a way of reporting line number and position of html elements

Bug #1742921 reported by Petr Dlouhý
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Wishlist
Unassigned

Bug Description

It would be nice, if BeautyfulSoup could report line numbers and position of html elements.
It was once requested here:
https://groups.google.com/forum/?fromgroups#!searchin/beautifulsoup/line$20numbers%7Csort:date/beautifulsoup/sy2skfowsso/j8vh7mhcmgUJ
And we would like such functionality for the Linkchecker project:
https://github.com/linkcheck/linkchecker/pull/119

Tags: feature
Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for putting this feature request into an issue. I don't plan to implement this feature, but I'm marking the issue as confirmed and I will accept a pull request that adds this feature.

Changed in beautifulsoup:
status: New → Confirmed
tags: added: feature
Changed in beautifulsoup:
importance: Undecided → Wishlist
Revision history for this message
Chris Mayo (chris-mayo) wrote :

I've had a go at this, with html.parser. Patch attached (unfortunately I'm less successful trying to get bzr and launchpad working).

This adds the variables lineno and offset to the Tag class (named as per html.parser).

Let me know what you think about this approach and what would be needed to get in included.
I have got a version of linkchecker working with this.

Revision history for this message
Chris Mayo (chris-mayo) wrote :

Patch updated for Beautiful Soup 4,8,0

Revision history for this message
Leonard Richardson (leonardr) wrote :

Chris,

Thanks for this work. Unfortunately I forgot you had a patch for this until after I released 4.8.0.

I adapted your patch in revision 515. The TreeBuilder keyword arguments introduced in 4.8.0 make it easy to turn this feature off. In revision 516 I wrote similar code for html5lib. Unfortunately it looks like lxml won't work: for performance reasons we use the target parser interface, which doesn't provide any access to this information.

The main change I made with this patch is I renamed the fields to "sourceline" and "sourcepos". "sourceline" is what lxml calls this concept, so if we ever do support lxml we'll have a consistent naming scheme. And "sourceline" and "sourcepos" are less likely than "lineno" and "offset" to be tag names in real markup.

Changed in beautifulsoup:
status: Confirmed → Fix Committed
Revision history for this message
Chris Mayo (chris-mayo) wrote :

Great to see it committed. Thanks for finishing it off. I knew the patch wasn't 100% ready but as the idea turned out to be relatively simple to implement thought it was worth trying to get it going.

Our conversion of LinkChecker to Python 3 proceeds slowly. I suspect you will have a new release out before we do. In the meantime it's easy to point Travis CI at an updated copy so we can keep developing. Having line numbers means we don't need to throw existing code away or reduce what the user sees (and of course being able to just replace the old parser so easily does help the conversion - and reduce the maintenance in the future).

Naming is always a challenge, the changes only meant editing two lines of LinkChecker code and one line of a test, we are running against the new revision.

Revision history for this message
Stu (stu-axon) wrote :

"we use the target parser interface, which doesn't provide any access to this information."

Is it possible to request the feature in libxml2 ?

(their bugtracker is now here)
https://gitlab.gnome.org/GNOME/libxml2

I would, but don't know enough about that interface to report a good bug.

Revision history for this message
Leonard Richardson (leonardr) wrote :

I filed a feature request with lxml here: https://bugs.launchpad.net/lxml/+bug/1846906

Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.