Beautiful Soup fails to santize unquoted style tags

Bug #403640 reported by Kasuko
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

This bug is manifesting it self in the program Sipie from http://sourceforge.net/projects/sipie/.

It attempts to parse the page here http://www.sirius.com/sirius/servlet/MediaPlayerLogin/subscriber

In the source of this page is a tag <input type="password" name="password" style={height:21px;} value="" size="30" maxlength="20"> where the style tag is not quoted and beautiful soup misses this resulting in the following:

Traceback (most recent call last):
  File "/usr/bin/gtkSipie", line 8, in <module>
    load_entry_point('Sipie==0.1196144357', 'gui_scripts', 'gtkSipie')()
  File "/usr/lib/python2.6/site-packages/Sipie/gtkPlayer.py", line 88, in gtkPlayer
    for selectable in sipie.getStreams():
  File "/usr/lib/python2.6/site-packages/Sipie/Factory.py", line 375, in getStreams
    streams = self.tryGetStreams()
  File "/usr/lib/python2.6/site-packages/Sipie/Factory.py", line 299, in tryGetStreams
    soup = BeautifulSoup(data)
  File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 1499, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 1230, in __init__
    self._feed(isHTML=isHTML)
  File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 1263, in _feed
    self.builder.feed(markup)
  File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.6/HTMLParser.py", line 263, in parse_starttag
    % (rawdata[k:endpos][:20],))
  File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: junk characters in start tag: u'{height:21px;} value', at line 145, column 26

I am currently running Arch Linux with beautiful-soup version 3.1.0.1 but there have been reports on the sourceforge page for Sipie that the problem is occuring on other platforms as well, apparently 3.0.7 was able to sanitize this.

Any other info I can gather I would be glad to give, just ask.

Thank You
Kasuko

Kasuko (kasuko)
description: updated
Revision history for this message
Leonard Richardson (leonardr) wrote :

The parsers used by BS4 handle this markup correctly.

Changed in beautifulsoup:
status: New → Fix Committed
Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.