Incorrect handling of HTML <noscript> tag
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
BeautifulSoup appears to handle <noscript> as if it were similar to <pre> or <code>, but that is not at all what it exists for. See e.g. http://
Here is a simple test script, basically copied from the BS4 intro documentation: http://
from bs4 import BeautifulSoup
h = open('noscript.
b = BeautifulSoup(h)
print b.get_text()
Attached is a simple HTML file which I used for testing. I would expect it to render like e.g. Lynx or Firefox render it, or BS4 without the <noscript> tag (repeated empty lines trimmed for legibility):
noscript test
test
... but instead, I get this:
noscript test
<a href="http://
(again, with empty lines trimmed somewhat).
This is BeautifulSoup 4.1.0 on Debian Wheezy with Python 2.7.3 out of the box.
vnix$ apt-cache policy python-bs4
python-bs4:
Installed: 4.1.0-1
Candidate: 4.1.0-1
Version table:
*** 4.1.0-1 0
500 http://
100 /var/lib/
vnix$ apt-cache policy python
python:
Installed: 2.7.3-4
Candidate: 2.7.3-4+deb7u1
Version table:
2.7.3-4+deb7u1 0
500 http://
*** 2.7.3-4 0
100 /var/lib/
Changed in beautifulsoup: | |
status: | New → Invalid |
status: | Invalid → Won't Fix |
The problem is limited to the html5lib tree builder. Where other parsers process the contents of a <noscript> tag as HTML, html5lib presents the contents of a <noscript> tag as a literal string. This has been commented on elsewhere: http:// blog.futtta. be/2010/ 12/01/venus- doesnt- love-noscript/ I agree that the other tree builders' behavior is closer to the spirit of the HTML specification.
Given the complexity of any workaround, unless I have a brilliant idea soon I'm going to close this bug INVALID as a difference between parsers, possibly a bug in html5lib.