Comment 19 for bug 1768330

Revision history for this message
Leonard Richardson (leonardr) wrote :

This is promising. I will want to play around with the API for maximum forwards compatibility, but the output of this algorithm definitely looks good. I have two questions before you do any more work, and I think you can answer both of them at the same time.

First, can you try this on some real web pages and see if it gives you the results you want? A simple case that's also very common would be extracting the "meat" of a web page's content: the product information or the news article on a page that also contains a lot of peripheral stuff.

Second, I'm a bit concerned about any code that looks like this:

for element in soup.descendants:
   ...
   this_block_container = element.find_parent(block_elements)

Because you're calling a tree navigation method inside another tree navigation method, which is very bad for performance. However, find_parent is the least-bad tree navigation method to call in this situation, so it might not be that bad. Basically, if you try this on real web pages, also gather some timing information so I can compare it to the current implementation of get_text().