indexing fails silently with non-utf8 text attachments
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Zim |
Confirmed
|
Medium
|
Unassigned |
Bug Description
I use zim rev336 on Ubuntu 10.04.1. Once I attached a simple .txt file to a page and saw that it appears in zims outline sidebar as if it was a subpage. Well, this is surprising but the attached text can be displayed within zim, which is nice. But afterwards every Search got no results, and later every clicked page was empty! (just white background - quite scary for a moment)
Starting "zim -V -D" shows that indexing throws an exception (see below) if the .txt attachment contains non-utf8 characters. And later also zims page history gets poisoned somehow, so that clicking to open other pages throws similar exceptions. For normal users who don't start zim in a terminal this all happens silently.
I propose to treat .txt files as pages only if they contain a valid zim header. Now as workaround I just rename .txt to .log, then zim treats the text as normal attachment and does not get confused anymore.
To reproduce add a non-utf8 .txt file to your notebook and start zim:
$ echo "Chuchichästli" | iconv -t latin1 >~/mynotebook/
$ zim -V -D
_______
ERROR: Got an exception while indexing "<IndexPath: non-utf8>":
Traceback (most recent call last):
File "/home/
self.
File "/home/
for type, href, _ in page.get_links():
File "/home/
tree = self.get_
File "/home/
self._parsetree = self._fetch_
File "/home/
lines = lines or self.source.
File "/home/
lines = self._readlines()
File "/home/
lines = file.readlines()
File "/usr/lib/
return self.reader.
File "/usr/lib/
data = self.read()
File "/usr/lib/
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-10: invalid data
tags: | added: error-handling import |
tags: |
added: unicode removed: error-handling import missing redesign |
tags: | added: index |
There are two issues here. One is that zim can not handle the non-utf8 content. This I can not fix, but error handling should be more graceful so it does not hamper usage.
Second issue is that zim can not detect that this file is an attachment other than opening it and looking at the headers. So it assumes it to be a page, hence it shows up in the index. Maybe we should build in such a check and ignore any file that does not have the zim headers. However this will need some careful design as not to cause zim to do a lot of slow file reads when listing pages in a certain folder.