get_text() does not produce all output after calls to extract

Bug #1489208 reported by Shawn M. Jones
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
New
Undecided
Unassigned

Bug Description

BeautifulSoup version 4.4.0
lxml version 3.4.4
html5lib version 0.999999
OSX version 10.10.5
RHEL version 6.7

An example of one of the pages I am trying to parse is at: http://web.archive.org/web/20070630041429/http://functions.wolfram.com/

This page has been downloaded and is also included as attachment testpage.html. I tried to include the Python example code, but launchpad only lets me include one attachment at a time. Sorry for embedding the example code in the description.

I wanted to remove the <script> tags, so I consulted the web and found:
* http://stackoverflow.com/questions/5598524/can-i-remove-script-tags-with-beautifulsoup
* http://stackoverflow.com/questions/22799990/beatifulsoup4-get-text-still-has-javascript

If I use the extract method as suggested, the result of get_text() is not what is expected.

Here is the example Python 2.7 code:

import urllib2
from bs4 import BeautifulSoup
import lxml

data = open("testpage.html")
soup = BeautifulSoup(data, "lxml")
[ s.extract() for s in soup.findAll("script") ]
print soup.get_text()

On RHEL 6.7 with Python 2.7, the enclosed Python code displays "The Wolfram Functions Site\n\n". On the Mac it produces no output.

It does not return the other text on the page, such as "This site is created with Mathematica and is developed and maintained by Wolfram Research with partial support from the National Science Foundation".

Also, any calls to findAll() for non-script tags fail to find them.

I have duplicated the issue on Linux and Mac OSX 10.10.5, Python 2.7 and Python 3.4. I also tried using parser html5.

This may be a duplicate of bug https://bugs.launchpad.net/beautifulsoup/+bug/1483789, but I am not sure.

Revision history for this message
Shawn M. Jones (jones-shawn-m) wrote :
Revision history for this message
Shawn M. Jones (jones-shawn-m) wrote :

I've added the test Python script.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.