Feeds with non-UTF8 characters can't be parsed

Bug #1508316 reported by François Marier
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
PlanetFilter
Fix Released
Medium
François Marier

Bug Description

The attached feed contains a bad UTF-8 character (according to "file", it's an ISO-8859-1 file) and fails to parse with the following error:

Warning: 'http://active.inspection.gc.ca/eng/util/newrsse.asp?cid=40' is not a valid feed (not well-formed (invalid token): line 9, column
70)
Traceback (most recent call last):
  File "/usr/bin/planetfilter", line 478, in <module>
    if main():
  File "/usr/bin/planetfilter", line 476, in main
    return process_config(args.configfile, args.output, args.force)
  File "/usr/bin/planetfilter", line 438, in process_config
    document = parse_feed(contents, url)
  File "/usr/bin/planetfilter", line 404, in parse_feed
    noentities = remove_html_entities(contents)
  File "/usr/bin/planetfilter", line 372, in remove_html_entities
    ret = contents.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 457: invalid continuation byte

Revision history for this message
François Marier (fmarier) wrote :
Changed in planetfilter:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers