Beautiful Soup

design limitation/ feature request: extract() creates empty lines, please consider this patch as proof of concept

Bug #1882067 reported by ptoche on 2020-06-04

This bug report is a duplicate of: Bug #1768330: get_text() doesn't retain semantic separation of two sentences in separate paragraphs. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	New	Undecided	Unassigned

Bug Description

The patch below (untested but hopefully safe) removes empty lines left by extract(). See a comparison of output at the bottom. Thanks.

from bs4 import BeautifulSoup

def strip_empty_tags(soup, patch=False):
"""
Remove empty tags from a BeautifulSoup object.

Args:
soup (bs4.BeautifulSoup): html content with empty tags

Returns:
soup (bs4.BeautifulSoup): original soup with empty tags removed

    """
    for item in soup.find_all():
        if len(item.get_text(strip=True)) == 0:
            if patch:
                item.extract_patched()
            else:
                item.extract()
    return soup

from bs4 import PageElement

def extract_patched(self, _self_index=None):
"""Destructively rips this element out of the tree.

    :param _self_index: The location of this element in its parent's
       .contents, if known. Passing this in allows for a performance
       optimization.

        # PATCH STARTS HERE
        # remove empty line introduced by extract():
        # check that nearby parent.contents is really empty before deleting
        for i in range(_self_index, _self_index+1):
            if str(self.parent.contents[i]).strip('\t\r\n') == '':
                del self.parent.contents[i]
        # PATCH ENDS HERE

    #Find the two elements that would be next to each other if
    #this element (and any children) hadn't been parsed. Connect
    #the two.
    last_child = self._last_descendant()
    next_element = last_child.next_element

    if (self.previous_element is not None and
        self.previous_element is not next_element):
        self.previous_element.next_element = next_element
    if next_element is not None and next_element is not self.previous_element:
        next_element.previous_element = self.previous_element
    self.previous_element = None
    last_child.next_element = None

    self.parent = None
    if (self.previous_sibling is not None
        and self.previous_sibling is not self.next_sibling):
        self.previous_sibling.next_sibling = self.next_sibling
    if (self.next_sibling is not None
        and self.next_sibling is not self.previous_sibling):
        self.next_sibling.previous_sibling = self.previous_sibling
    self.previous_sibling = self.next_sibling = None
    return self
PageElement.extract_patched = extract_patched

html = '''<html>
  <head>
    <title>Unknown</title>
  </head>
  <body>LOOSE TEXT
    <a></a>
    <p></p>
    <div>BODY</div>
    <b></b>
    <i></i> # COMMENT
  </body>
</html>'''

# With .extract()
soup = BeautifulSoup(html, features='lxml')
print(strip_empty_tags(soup, patch=False))
<html>
<head>
<title>Unknown</title>
</head>
<body>LOOSE TEXT

# COMMENT
</body>
</html>

# With .extract_patched()
soup = BeautifulSoup(html, features='lxml')
print(strip_empty_tags(soup, patch=True))
<html>
<head>
<title>Unknown</title>
</head>
<body>LOOSE TEXT
<div>BODY</div>
# COMMENT
</body>
</html>

Revision history for this message

ptoche (ptoche) wrote on 2020-06-04:

If you search for BeautifulSoup remove empty lines, you'll find many users wanting to remove the empty lines. All the solutions I've seen involve extracting text from the trees, stripping the strings of empty lines, and joining the strings together again. This may remove too many empty lines and/or mess things up and also it returns a string and loses the beauty of the soup.

Revision history for this message

ptoche (ptoche) wrote on 2020-06-04:

(can't edit comments?)

I should have put the patch inside a try/except statement, in case an index out-of-range error shows up (I haven't had that problem with my limited testing).

Revision history for this message

ptoche (ptoche) wrote on 2020-06-04:

For backward compatibility, could add a `strip` argument:

def extract_patched(self, _self_index=None, strip=False):

Revision history for this message

ptoche (ptoche) wrote on 2020-06-05:

An exception does occur sometimes (in my experience, `_self_index+1` out of bound), but so far no problem with (adding a `strip=False` argument for backward compatibility):

# PATCH for BeautifulSoup extract() method
from bs4 import PageElement
def extract_patched(self, _self_index=None, strip=False):
"""Destructively rips this element out of the tree.

    :param _self_index: The location of this element in its parent's
       .contents, if known. Passing this in allows for a performance
       optimization.

        # ! PATCH STARTS HERE !
        # remove empty line introduced by extract():
        # check that nearby parent.contents is really empty before deleting
        if strip:
            try:
                for i in range(_self_index-1, _self_index+1):
                    if str(self.parent.contents[i]).strip() == '':
                        del self.parent.contents[i]
            except:
                pass
        # ! PATCH ENDS HERE !

An exception does occur sometimes (in my experience, `_self_index+1` out of bound), but so far no problem with (adding a `strip=False` argument for backward compatibility):

# PATCH for BeautifulSoup extract() method
from bs4 import PageElement
def extract_patched(self, _self_index=None, strip=False):
    """Destructively rips this element out of the tree.

:param _self_index: The location of this element in its parent's
       .contents, if known. Passing this in allows for a performance
       optimization.

:return: `self`, no longer part of the tree.
    """
    if self.parent is not None:
        if _self_index is None:
            _self_index = self.parent.index(self)
        del self.parent.contents[_self_index]
        
        # ! PATCH STARTS HERE !
        # remove empty line introduced by extract():
        # check that nearby parent.contents is really empty before deleting
        if strip:
            try:
                for i in range(_self_index-1, _self_index+1):
                    if str(self.parent.contents[i]).strip() == '':
                        del self.parent.contents[i]
            except: 
                pass
        # ! PATCH ENDS HERE !
    
    #Find the two elements that would be next to each other if
    #this element (and any children) hadn't been parsed. Connect
    #the two.
    last_child = self._last_descendant()
    next_element = last_child.next_element

Revision history for this message

Leonard Richardson (leonardr) wrote on 2020-06-05:

Before you spend too much additional time on this, I want to let you know that I don't think this change makes sense as part of the extract() method. This seems really specific to your situation and not something that should be in the core code.

Beautiful Soup 4.8.0 introduced a cleanup method called smooth() which consolidates adjacent NavigableString objects, a situation that mainly happens through calling tree-modification methods such as extract():

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#smooth

If this were to go into the core code, my gut feeling is that it would take the form of an argument to smooth(). smooth() merges adjacent NavigableStrings; you want to find adjacent NavigableStrings and delete some of them.

Revision history for this message

ptoche (ptoche) wrote on 2020-06-09:

Hi Leonard, thanks for your reply!

My effort is likely a result of my limited understanding of what the built-in functions can do. I agree that the keyword extract suggests nothing more than removal, so that extracting a tag in a sentence like this:

A single-line sentence with <b>useless</b> tag.

can reasonably be expected to yield:

A single-line sentence with tag.

with two empty spaces between 'with' and 'tag'.

Or a sentence like this:

    A multiple-line sentence with
    <b>useless</b>
    tag.

can reasonably be expected to yield:

A multiple-line sentence with

tag.

with one empty line in the middle.

However, it won't be all that often that users desire these blank spaces/lines. Now I understand that you may not feel that the option to remove these spaces should be accessed inside the `extract()` method. And indeed it makes sense to have a separate smooth-ing function. I had tried to use `smooth()` before hacking `extract()`, but my attempts had failed somewhere. I do not have a clear recollection of why, perhaps because I wasn't looping over all the children properly. I'm new to the package and my first reaction was to browse for solutions on stackoverflow and in github repos: Most of the approaches involve parsing a list and join-ing the strings together. As I was passing BeautifulSoup objects and tags around, I was bothered by the detour to string and thought about hacking extract(). But let me see if I can make `smooth()` do what I was looking for.

Thanks again!

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1768330 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.