Incorrect handling of HTML <noscript> tag

Bug #1277464 reported by era
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Won't Fix
Undecided
Unassigned

Bug Description

BeautifulSoup appears to handle <noscript> as if it were similar to <pre> or <code>, but that is not at all what it exists for. See e.g. http://www.w3.org/wiki/HTML/Elements/noscript for brief documentation from the W3C.

Here is a simple test script, basically copied from the BS4 intro documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

  from bs4 import BeautifulSoup
  h = open('noscript.html')
  b = BeautifulSoup(h)
  print b.get_text()

Attached is a simple HTML file which I used for testing. I would expect it to render like e.g. Lynx or Firefox render it, or BS4 without the <noscript> tag (repeated empty lines trimmed for legibility):

noscript test

test

... but instead, I get this:

noscript test

<a href="http://noscript.example.com/">test</a>

(again, with empty lines trimmed somewhat).

This is BeautifulSoup 4.1.0 on Debian Wheezy with Python 2.7.3 out of the box.

vnix$ apt-cache policy python-bs4
python-bs4:
  Installed: 4.1.0-1
  Candidate: 4.1.0-1
  Version table:
 *** 4.1.0-1 0
        500 http://mirror.example.com/debian/ wheezy/main amd64 Packages
        100 /var/lib/dpkg/status

vnix$ apt-cache policy python
python:
  Installed: 2.7.3-4
  Candidate: 2.7.3-4+deb7u1
  Version table:
     2.7.3-4+deb7u1 0
        500 http://mirror.example.com/debian/ wheezy/main amd64 Packages
 *** 2.7.3-4 0
        100 /var/lib/dpkg/status

Revision history for this message
era (era) wrote :
Revision history for this message
Leonard Richardson (leonardr) wrote :

The problem is limited to the html5lib tree builder. Where other parsers process the contents of a <noscript> tag as HTML, html5lib presents the contents of a <noscript> tag as a literal string. This has been commented on elsewhere: http://blog.futtta.be/2010/12/01/venus-doesnt-love-noscript/ I agree that the other tree builders' behavior is closer to the spirit of the HTML specification.

Given the complexity of any workaround, unless I have a brilliant idea soon I'm going to close this bug INVALID as a difference between parsers, possibly a bug in html5lib.

Changed in beautifulsoup:
status: New → Invalid
status: Invalid → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.