UnicodeDecodeError when parsing http://www.projekt6.de/?feed=podcast

Bug #252506 reported by Thomas Perl on 2008-07-28
4
Affects Status Importance Assigned to Milestone
gPodder
Fix Released
Medium
feedparser (Ubuntu)
Medium
Luca Falavigna

Bug Description

Trying to parse the RSS feed http://www.projekt6.de/?feed=podcast with feedparser yields the following traceback:

thp@macbook:~$ python
Python 2.5.2 (r252:60911, Apr 21 2008, 11:12:42)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import feedparser
>>> f = feedparser.parse('http://www.projekt6.de/?feed=podcast')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/var/lib/python-support/python2.5/feedparser.py", line 2624, in parse
    feedparser.feed(data)
  File "/var/lib/python-support/python2.5/feedparser.py", line 1441, in feed
    sgmllib.SGMLParser.feed(self, data)
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 138, in goahead
    k = self.parse_endtag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 315, in parse_endtag
    self.finish_endtag(tag)
  File "/usr/lib/python2.5/sgmllib.py", line 355, in finish_endtag
    self.unknown_endtag(tag)
  File "/var/lib/python-support/python2.5/feedparser.py", line 476, in unknown_endtag
    method()
  File "/var/lib/python-support/python2.5/feedparser.py", line 1318, in _end_content
    value = self.popContent('content')
  File "/var/lib/python-support/python2.5/feedparser.py", line 700, in popContent
    value = self.pop(tag)
  File "/var/lib/python-support/python2.5/feedparser.py", line 641, in pop
    output = _resolveRelativeURIs(output, self.baseuri, self.encoding)
  File "/var/lib/python-support/python2.5/feedparser.py", line 1594, in _resolveRelativeURIs
    p.feed(htmlSource)
  File "/var/lib/python-support/python2.5/feedparser.py", line 1441, in feed
    sgmllib.SGMLParser.feed(self, data)
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
    self.finish_starttag(tag, attrs)
  File "/usr/lib/python2.5/sgmllib.py", line 333, in finish_starttag
    self.unknown_starttag(tag, attrs)
  File "/var/lib/python-support/python2.5/feedparser.py", line 1589, in unknown_starttag
    _BaseHTMLProcessor.unknown_starttag(self, tag, attrs)
  File "/var/lib/python-support/python2.5/feedparser.py", line 1458, in unknown_starttag
    value = unicode(value, self.encoding)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-8: unsupported Unicode code range

I've created a patch against the most recent feedparser.py in Ubuntu 8.04, which will fix this problem by replacing invalid characters instead of failing completely.

Related branches

Thomas Perl (thp) wrote :
David Futcher (bobbo) wrote :

Debdiff to apply this patch.

Changed in feedparser:
status: New → Confirmed
Changed in gpodder:
status: Unknown → In Progress
Luca Falavigna (dktrkranz) wrote :

Sponsored, thanks ;)

Changed in feedparser:
assignee: nobody → dktrkranz
importance: Undecided → Medium
status: Confirmed → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package feedparser - 4.1-10ubuntu1

---------------
feedparser (4.1-10ubuntu1) intrepid; urgency=low

  * Add utf8_decoding.patch to stop errors to stop errors decoding feeds with
    invalid characters (LP: #252506)

 -- David Futcher <email address hidden> Thu, 31 Jul 2008 20:05:34 +0100

Changed in feedparser:
status: Fix Committed → Fix Released
Changed in gpodder:
status: In Progress → Fix Released
Changed in gpodder:
importance: Unknown → Medium
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.