design limitation/ feature request: extract() creates empty lines, please consider this patch as proof of concept

Bug #1882067 reported by ptoche
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
New
Undecided
Unassigned

Bug Description

The patch below (untested but hopefully safe) removes empty lines left by extract(). See a comparison of output at the bottom. Thanks.

from bs4 import BeautifulSoup

def strip_empty_tags(soup, patch=False):
    """
    Remove empty tags from a BeautifulSoup object.

    Args:
        soup (bs4.BeautifulSoup): html content with empty tags

    Returns:
        soup (bs4.BeautifulSoup): original soup with empty tags removed

    """
    for item in soup.find_all():
        if len(item.get_text(strip=True)) == 0:
            if patch:
                item.extract_patched()
            else:
                item.extract()
    return soup

from bs4 import PageElement

def extract_patched(self, _self_index=None):
    """Destructively rips this element out of the tree.

    :param _self_index: The location of this element in its parent's
       .contents, if known. Passing this in allows for a performance
       optimization.

    :return: `self`, no longer part of the tree.
    """
    if self.parent is not None:
        if _self_index is None:
            _self_index = self.parent.index(self)
        del self.parent.contents[_self_index]

        # PATCH STARTS HERE
        # remove empty line introduced by extract():
        # check that nearby parent.contents is really empty before deleting
        for i in range(_self_index, _self_index+1):
            if str(self.parent.contents[i]).strip('\t\r\n') == '':
                del self.parent.contents[i]
        # PATCH ENDS HERE

    #Find the two elements that would be next to each other if
    #this element (and any children) hadn't been parsed. Connect
    #the two.
    last_child = self._last_descendant()
    next_element = last_child.next_element

    if (self.previous_element is not None and
        self.previous_element is not next_element):
        self.previous_element.next_element = next_element
    if next_element is not None and next_element is not self.previous_element:
        next_element.previous_element = self.previous_element
    self.previous_element = None
    last_child.next_element = None

    self.parent = None
    if (self.previous_sibling is not None
        and self.previous_sibling is not self.next_sibling):
        self.previous_sibling.next_sibling = self.next_sibling
    if (self.next_sibling is not None
        and self.next_sibling is not self.previous_sibling):
        self.next_sibling.previous_sibling = self.previous_sibling
    self.previous_sibling = self.next_sibling = None
    return self
PageElement.extract_patched = extract_patched

html = '''<html>
  <head>
    <title>Unknown</title>
  </head>
  <body>LOOSE TEXT
    <a></a>
    <p></p>
    <div>BODY</div>
    <b></b>
    <i></i> # COMMENT
  </body>
</html>'''

# With .extract()
soup = BeautifulSoup(html, features='lxml')
print(strip_empty_tags(soup, patch=False))
<html>
<head>
<title>Unknown</title>
</head>
<body>LOOSE TEXT

<div>BODY</div>

 # COMMENT
  </body>
</html>

# With .extract_patched()
soup = BeautifulSoup(html, features='lxml')
print(strip_empty_tags(soup, patch=True))
<html>
<head>
<title>Unknown</title>
</head>
<body>LOOSE TEXT
    <div>BODY</div>
 # COMMENT
  </body>
</html>

Revision history for this message
ptoche (ptoche) wrote :

If you search for BeautifulSoup remove empty lines, you'll find many users wanting to remove the empty lines. All the solutions I've seen involve extracting text from the trees, stripping the strings of empty lines, and joining the strings together again. This may remove too many empty lines and/or mess things up and also it returns a string and loses the beauty of the soup.

Revision history for this message
ptoche (ptoche) wrote :

(can't edit comments?)

I should have put the patch inside a try/except statement, in case an index out-of-range error shows up (I haven't had that problem with my limited testing).

Revision history for this message
ptoche (ptoche) wrote :

For backward compatibility, could add a `strip` argument:

    def extract_patched(self, _self_index=None, strip=False):

Revision history for this message
ptoche (ptoche) wrote :

An exception does occur sometimes (in my experience, `_self_index+1` out of bound), but so far no problem with (adding a `strip=False` argument for backward compatibility):

# PATCH for BeautifulSoup extract() method
from bs4 import PageElement
def extract_patched(self, _self_index=None, strip=False):
    """Destructively rips this element out of the tree.

    :param _self_index: The location of this element in its parent's
       .contents, if known. Passing this in allows for a performance
       optimization.

    :return: `self`, no longer part of the tree.
    """
    if self.parent is not None:
        if _self_index is None:
            _self_index = self.parent.index(self)
        del self.parent.contents[_self_index]

        # ! PATCH STARTS HERE !
        # remove empty line introduced by extract():
        # check that nearby parent.contents is really empty before deleting
        if strip:
            try:
                for i in range(_self_index-1, _self_index+1):
                    if str(self.parent.contents[i]).strip() == '':
                        del self.parent.contents[i]
            except:
                pass
        # ! PATCH ENDS HERE !

    #Find the two elements that would be next to each other if
    #this element (and any children) hadn't been parsed. Connect
    #the two.
    last_child = self._last_descendant()
    next_element = last_child.next_element

    if (self.previous_element is not None and
        self.previous_element is not next_element):
        self.previous_element.next_element = next_element
    if next_element is not None and next_element is not self.previous_element:
        next_element.previous_element = self.previous_element
    self.previous_element = None
    last_child.next_element = None

    self.parent = None
    if (self.previous_sibling is not None
        and self.previous_sibling is not self.next_sibling):
        self.previous_sibling.next_sibling = self.next_sibling
    if (self.next_sibling is not None
        and self.next_sibling is not self.previous_sibling):
        self.next_sibling.previous_sibling = self.previous_sibling
    self.previous_sibling = self.next_sibling = None
    return self
# ! APPLY PATCH !
PageElement.extract = extract_patched

Revision history for this message
Leonard Richardson (leonardr) wrote :

Before you spend too much additional time on this, I want to let you know that I don't think this change makes sense as part of the extract() method. This seems really specific to your situation and not something that should be in the core code.

Beautiful Soup 4.8.0 introduced a cleanup method called smooth() which consolidates adjacent NavigableString objects, a situation that mainly happens through calling tree-modification methods such as extract():

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#smooth

If this were to go into the core code, my gut feeling is that it would take the form of an argument to smooth(). smooth() merges adjacent NavigableStrings; you want to find adjacent NavigableStrings and delete some of them.

Revision history for this message
ptoche (ptoche) wrote :

Hi Leonard, thanks for your reply!

My effort is likely a result of my limited understanding of what the built-in functions can do. I agree that the keyword extract suggests nothing more than removal, so that extracting a tag in a sentence like this:

    A single-line sentence with <b>useless</b> tag.

can reasonably be expected to yield:

    A single-line sentence with tag.

with two empty spaces between 'with' and 'tag'.

Or a sentence like this:

    A multiple-line sentence with
    <b>useless</b>
    tag.

can reasonably be expected to yield:

    A multiple-line sentence with

    tag.

with one empty line in the middle.

However, it won't be all that often that users desire these blank spaces/lines. Now I understand that you may not feel that the option to remove these spaces should be accessed inside the `extract()` method. And indeed it makes sense to have a separate smooth-ing function. I had tried to use `smooth()` before hacking `extract()`, but my attempts had failed somewhere. I do not have a clear recollection of why, perhaps because I wasn't looping over all the children properly. I'm new to the package and my first reaction was to browse for solutions on stackoverflow and in github repos: Most of the approaches involve parsing a list and join-ing the strings together. As I was passing BeautifulSoup objects and tags around, I was bothered by the detour to string and thought about hacking extract(). But let me see if I can make `smooth()` do what I was looking for.

Thanks again!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.