Creating BeautifulSoup object with Strainer caused crash
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
Using the fairly simple program:
===start code====
#! /usr/bin/env python
import re
import sys
sys.path.
from BeautifulSoup import BeautifulSoup,
f = open('test.
doc = f.read()
listing_divs_only = SoupStrainer('div')
soup = BeautifulSoup(doc, parseOnlyThese=
=====end code====
I was able to reliably create the following crash:
=====stack trace====
Traceback (most recent call last):
File "./android_
soup = BeautifulSoup(doc, parseOnlyThese=
File "libs/Beautiful
BeautifulSt
File "libs/Beautiful
self._feed()
File "libs/Beautiful
SGMLParser.
File "/usr/lib/
self.goahead(0)
File "/usr/lib/
k = self.parse_
File "/usr/lib/
self.
File "/usr/lib/
self.
File "/usr/lib/
method(attrs)
File "libs/Beautiful
self.
File "libs/Beautiful
SGMLParser.
File "/usr/lib/
self.goahead(0)
File "/usr/lib/
k = self.parse_
File "/usr/lib/
self.
File "/usr/lib/
self.
File "/usr/lib/
method(attrs)
File "libs/Beautiful
tag.
=====end stack trace====
I was able to get past the issue by adding the following check:
=====patch====
--- libs/BeautifulS
+++ libs/BeautifulS
@@ -1318,7 +1318,7 @@
tag = self.unknown_
- if tagNeedsEncodin
+ if tagNeedsEncodin
class StopParsing(
=====end of patch=====
However, I don't know enough about the design of BeautifulSoup to say if the change is valid.
Without knowing the contents of test.html I can't reproduce this, but Beautiful Soup 4 handles encoding substitutions in a different way that means this bug probably doesn't happen there.