get_text() does not produce all output after calls to extract
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
New
|
Undecided
|
Unassigned |
Bug Description
BeautifulSoup version 4.4.0
lxml version 3.4.4
html5lib version 0.999999
OSX version 10.10.5
RHEL version 6.7
An example of one of the pages I am trying to parse is at: http://
This page has been downloaded and is also included as attachment testpage.html. I tried to include the Python example code, but launchpad only lets me include one attachment at a time. Sorry for embedding the example code in the description.
I wanted to remove the <script> tags, so I consulted the web and found:
* http://
* http://
If I use the extract method as suggested, the result of get_text() is not what is expected.
Here is the example Python 2.7 code:
import urllib2
from bs4 import BeautifulSoup
import lxml
data = open("testpage.
soup = BeautifulSoup(data, "lxml")
[ s.extract() for s in soup.findAll(
print soup.get_text()
On RHEL 6.7 with Python 2.7, the enclosed Python code displays "The Wolfram Functions Site\n\n". On the Mac it produces no output.
It does not return the other text on the page, such as "This site is created with Mathematica and is developed and maintained by Wolfram Research with partial support from the National Science Foundation".
Also, any calls to findAll() for non-script tags fail to find them.
I have duplicated the issue on Linux and Mac OSX 10.10.5, Python 2.7 and Python 3.4. I also tried using parser html5.
This may be a duplicate of bug https:/
I've added the test Python script.