PlanetFilter

Feeds with non-UTF8 characters can't be parsed

Bug #1508316 reported by François Marier on 2015-10-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	PlanetFilter	Fix Released	Medium	François Marier	PlanetFilter 0.6.0

Bug Description

The attached feed contains a bad UTF-8 character (according to "file", it's an ISO-8859-1 file) and fails to parse with the following error:

Warning: 'http://active.inspection.gc.ca/eng/util/newrsse.asp?cid=40' is not a valid feed (not well-formed (invalid token): line 9, column
70)
Traceback (most recent call last):
  File "/usr/bin/planetfilter", line 478, in <module>
    if main():
  File "/usr/bin/planetfilter", line 476, in main
    return process_config(args.configfile, args.output, args.force)
  File "/usr/bin/planetfilter", line 438, in process_config
    document = parse_feed(contents, url)
  File "/usr/bin/planetfilter", line 404, in parse_feed
    noentities = remove_html_entities(contents)
  File "/usr/bin/planetfilter", line 372, in remove_html_entities
    ret = contents.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 457: invalid continuation byte