Beautiful Soup

Incorrect handling of HTML <noscript> tag

Bug #1277464 reported by era on 2014-02-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Won't Fix	Undecided	Unassigned

Bug Description

BeautifulSoup appears to handle <noscript> as if it were similar to <pre> or <code>, but that is not at all what it exists for. See e.g. http://www.w3.org/wiki/HTML/Elements/noscript for brief documentation from the W3C.

Here is a simple test script, basically copied from the BS4 intro documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

  from bs4 import BeautifulSoup
  h = open('noscript.html')
  b = BeautifulSoup(h)
  print b.get_text()

Attached is a simple HTML file which I used for testing. I would expect it to render like e.g. Lynx or Firefox render it, or BS4 without the <noscript> tag (repeated empty lines trimmed for legibility):

noscript test

test

... but instead, I get this:

noscript test

(again, with empty lines trimmed somewhat).

This is BeautifulSoup 4.1.0 on Debian Wheezy with Python 2.7.3 out of the box.

vnix$ apt-cache policy python-bs4
python-bs4:
  Installed: 4.1.0-1
  Candidate: 4.1.0-1
  Version table:
*** 4.1.0-1 0
        500 http://mirror.example.com/debian/ wheezy/main amd64 Packages
        100 /var/lib/dpkg/status

vnix$ apt-cache policy python
python:
  Installed: 2.7.3-4
  Candidate: 2.7.3-4+deb7u1
  Version table:
     2.7.3-4+deb7u1 0
        500 http://mirror.example.com/debian/ wheezy/main amd64 Packages
*** 2.7.3-4 0
        100 /var/lib/dpkg/status

Revision history for this message

era (era) wrote on 2014-02-07:

Simple HTML file for reproducing Edit (152 bytes, text/html)

Revision history for this message

Leonard Richardson (leonardr) wrote on 2014-12-12:

The problem is limited to the html5lib tree builder. Where other parsers process the contents of a <noscript> tag as HTML, html5lib presents the contents of a <noscript> tag as a literal string. This has been commented on elsewhere: http://blog.futtta.be/2010/12/01/venus-doesnt-love-noscript/ I agree that the other tree builders' behavior is closer to the spirit of the HTML specification.

Given the complexity of any workaround, unless I have a brilliant idea soon I'm going to close this bug INVALID as a difference between parsers, possibly a bug in html5lib.

Leonard Richardson (leonardr) on 2015-06-24

Changed in beautifulsoup:
status:	New → Invalid
status:	Invalid → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Simple HTML file for reproducing Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.