Feeds with non-UTF8 characters can't be parsed

Bug #1508316 reported by François Marier
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
PlanetFilter
Fix Released
Medium
François Marier

Bug Description

The attached feed contains a bad UTF-8 character (according to "file", it's an ISO-8859-1 file) and fails to parse with the following error:

Warning: 'http://active.inspection.gc.ca/eng/util/newrsse.asp?cid=40' is not a valid feed (not well-formed (invalid token): line 9, column
70)
Traceback (most recent call last):
  File "/usr/bin/planetfilter", line 478, in <module>
    if main():
  File "/usr/bin/planetfilter", line 476, in main
    return process_config(args.configfile, args.output, args.force)
  File "/usr/bin/planetfilter", line 438, in process_config
    document = parse_feed(contents, url)
  File "/usr/bin/planetfilter", line 404, in parse_feed
    noentities = remove_html_entities(contents)
  File "/usr/bin/planetfilter", line 372, in remove_html_entities
    ret = contents.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 457: invalid continuation byte

Revision history for this message
François Marier (fmarier) wrote :
Changed in planetfilter:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.