Beautiful Soup

get_text() does not produce all output after calls to extract

Bug #1489208 reported by Shawn M. Jones on 2015-08-27

This bug report is a duplicate of: Bug #1481520: .descendants behaves poorly on uprooted elements. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	New	Undecided	Unassigned

Bug Description

BeautifulSoup version 4.4.0
lxml version 3.4.4
html5lib version 0.999999
OSX version 10.10.5
RHEL version 6.7

An example of one of the pages I am trying to parse is at: http://web.archive.org/web/20070630041429/http://functions.wolfram.com/

This page has been downloaded and is also included as attachment testpage.html. I tried to include the Python example code, but launchpad only lets me include one attachment at a time. Sorry for embedding the example code in the description.

I wanted to remove the <script> tags, so I consulted the web and found:
* http://stackoverflow.com/questions/5598524/can-i-remove-script-tags-with-beautifulsoup
* http://stackoverflow.com/questions/22799990/beatifulsoup4-get-text-still-has-javascript

If I use the extract method as suggested, the result of get_text() is not what is expected.

Here is the example Python 2.7 code:

import urllib2
from bs4 import BeautifulSoup
import lxml

data = open("testpage.html")
soup = BeautifulSoup(data, "lxml")
[ s.extract() for s in soup.findAll("script") ]
print soup.get_text()

On RHEL 6.7 with Python 2.7, the enclosed Python code displays "The Wolfram Functions Site\n\n". On the Mac it produces no output.

It does not return the other text on the page, such as "This site is created with Mathematica and is developed and maintained by Wolfram Research with partial support from the National Science Foundation".

Also, any calls to findAll() for non-script tags fail to find them.

I have duplicated the issue on Linux and Mac OSX 10.10.5, Python 2.7 and Python 3.4. I also tried using parser html5.

This may be a duplicate of bug https://bugs.launchpad.net/beautifulsoup/+bug/1483789, but I am not sure.