Beautiful Soup

Bug #1768330
Comment #22

Comment 22 for bug 1768330

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-01-10 (last edit on 2024-01-10):

#22

Our current algorithm to solve this problem was to use copy.copy() to make a copy of the soup, iterate through all block elements and insert special NavigableString separator strings ("<<BLOCK>>") before and after each one, call soup.text, then search-and-replace any sequence of one or more separator strings with a single space. I didn't like this approach because I had to copy and destructively modify the soup.

On a test set of about 30k HTML files, this new algorithm returns 100% identical results to our current algorithm, and no copy/modification was needed.

A note on how I arrived at this algorithm... Originally I thought about trying to iterate through Tag descendants and keep track of what I entered and left to maintain a "current" block context. Knowing when we cross start tags was fine - that's precisely what the iterator is - but knowing when we cross end tags was difficult. I could derive it by comparing this start tag to the last start tag, but to determine if the next Tag was *inside* or *after* the previous start tag, I had to query the ancestry of containing elements. And if I'm going to do that each time, I might as well skip the Tags and just check the block context of each NavigableString object.

Would inlining a hardcoded (and simplified!) version of find_parent() in this algorithm resolve your concerns? Is it the algorithmic inclusion of an inner loop within an outer loop, or is it some technical aspect of nesting iterable things that have to maintain context as the interpreter jumps between the execution locations? (And fortunately, there are typically very few (if any) levels separating NavigableStrings from their containing block element.)

I am glad you are looking to provide a generalized solution that can be configured. For example, if I am working with DITA XML content:

https://www.oxygenxml.com/dita/1.3/specs/index.html

then the lists of block and inline elements will be different:

https://blog.oxygenxml.com/topics/refactoring_inserting_reformatting.html

@turfurken - with this algorithm, "<p>ABC<br/>DEF</p>" would return "ABCDEF". The algorithm would need an "elif" branch to push a single space on the strings list for <br/> elements. As @leonardr said, we don't want to hardcode format-specific heuristics, so hopefully whatever handler configurability magic he comes up with can support this type of thing.

On a test set of about 30k HTML files, this new algorithm returns 100% identical results to our current algorithm, and no copy/modification was needed.

I am glad you are looking to provide a generalized solution that can be configured. For example, if I am working with DITA XML content:

https://www.oxygenxml.com/dita/1.3/specs/index.html

then the lists of block and inline elements will be different:

https://blog.oxygenxml.com/topics/refactoring_inserting_reformatting.html