Creating BeautifulSoup object with Strainer caused crash

Bug #672771 reported by Nicholas Campion
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Won't Fix
Undecided
Unassigned

Bug Description

Using the fairly simple program:
===start code====
#! /usr/bin/env python

import re
import sys
sys.path.append('libs')
from BeautifulSoup import BeautifulSoup,SoupStrainer

f = open('test.html','r')
doc = f.read()

listing_divs_only = SoupStrainer('div')
soup = BeautifulSoup(doc, parseOnlyThese=listing_divs_only)

=====end code====
I was able to reliably create the following crash:
=====stack trace====
Traceback (most recent call last):
  File "./android_market_parser.py", line 17, in <module>
    soup = BeautifulSoup(doc, parseOnlyThese=listing_divs_only)
  File "libs/BeautifulSoup.py", line 1228, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "libs/BeautifulSoup.py", line 892, in __init__
    self._feed()
  File "libs/BeautifulSoup.py", line 917, in _feed
    SGMLParser.feed(self, markup)
  File "/usr/lib/python2.6/sgmllib.py", line 104, in feed
    self.goahead(0)
  File "/usr/lib/python2.6/sgmllib.py", line 138, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.6/sgmllib.py", line 296, in parse_starttag
    self.finish_starttag(tag, attrs)
  File "/usr/lib/python2.6/sgmllib.py", line 345, in finish_starttag
    self.handle_starttag(tag, method, attrs)
  File "/usr/lib/python2.6/sgmllib.py", line 381, in handle_starttag
    method(attrs)
  File "libs/BeautifulSoup.py", line 1318, in start_meta
    self._feed(self.declaredHTMLEncoding)
  File "libs/BeautifulSoup.py", line 917, in _feed
    SGMLParser.feed(self, markup)
  File "/usr/lib/python2.6/sgmllib.py", line 104, in feed
    self.goahead(0)
  File "/usr/lib/python2.6/sgmllib.py", line 138, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.6/sgmllib.py", line 296, in parse_starttag
    self.finish_starttag(tag, attrs)
  File "/usr/lib/python2.6/sgmllib.py", line 345, in finish_starttag
    self.handle_starttag(tag, method, attrs)
  File "/usr/lib/python2.6/sgmllib.py", line 381, in handle_starttag
    method(attrs)
  File "libs/BeautifulSoup.py", line 1322, in start_meta
    tag.containsSubstitutions = True
=====end stack trace====
I was able to get past the issue by adding the following check:
=====patch====
--- libs/BeautifulSoup.py.orig 2010-11-08 15:47:19.000000000 -0600
+++ libs/BeautifulSoup.py 2010-11-08 15:41:41.978114003 -0600
@@ -1318,7 +1318,7 @@
                         self._feed(self.declaredHTMLEncoding)
                         raise StopParsing
         tag = self.unknown_starttag("meta", attrs)
- if tagNeedsEncodingSubstitution:
+ if tagNeedsEncodingSubstitution and not tag == None:
             tag.containsSubstitutions = True

 class StopParsing(Exception):
=====end of patch=====
However, I don't know enough about the design of BeautifulSoup to say if the change is valid.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Without knowing the contents of test.html I can't reproduce this, but Beautiful Soup 4 handles encoding substitutions in a different way that means this bug probably doesn't happen there.

Changed in beautifulsoup:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.