design limitation/ feature request: extract() creates empty lines, please consider this patch as proof of concept
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
New
|
Undecided
|
Unassigned |
Bug Description
The patch below (untested but hopefully safe) removes empty lines left by extract(). See a comparison of output at the bottom. Thanks.
from bs4 import BeautifulSoup
def strip_empty_
"""
Remove empty tags from a BeautifulSoup object.
Args:
soup (bs4.BeautifulS
Returns:
soup (bs4.BeautifulS
"""
for item in soup.find_all():
if len(item.
if patch:
else:
return soup
from bs4 import PageElement
def extract_
"""
:param _self_index: The location of this element in its parent's
.contents, if known. Passing this in allows for a performance
:return: `self`, no longer part of the tree.
"""
if self.parent is not None:
if _self_index is None:
del self.parent.
# PATCH STARTS HERE
# remove empty line introduced by extract():
# check that nearby parent.contents is really empty before deleting
for i in range(_self_index, _self_index+1):
if str(self.
del self.parent.
# PATCH ENDS HERE
#Find the two elements that would be next to each other if
#this element (and any children) hadn't been parsed. Connect
#the two.
last_child = self._last_
next_element = last_child.
if (self.previous_
if next_element is not None and next_element is not self.previous_
self.
last_
self.parent = None
if (self.previous_
and self.previous_
if (self.next_sibling is not None
and self.next_sibling is not self.previous_
self.
return self
PageElement.
html = '''<html>
<head>
<title>
</head>
<body>LOOSE TEXT
<a></a>
<p></p>
<div>BODY</div>
<b></b>
<i></i> # COMMENT
</body>
</html>'''
# With .extract()
soup = BeautifulSoup(html, features='lxml')
print(strip_
<html>
<head>
<title>
</head>
<body>LOOSE TEXT
<div>BODY</div>
# COMMENT
</body>
</html>
# With .extract_patched()
soup = BeautifulSoup(html, features='lxml')
print(strip_
<html>
<head>
<title>
</head>
<body>LOOSE TEXT
<div>BODY</div>
# COMMENT
</body>
</html>
If you search for BeautifulSoup remove empty lines, you'll find many users wanting to remove the empty lines. All the solutions I've seen involve extracting text from the trees, stripping the strings of empty lines, and joining the strings together again. This may remove too many empty lines and/or mess things up and also it returns a string and loses the beauty of the soup.