indexing fails silently with non-utf8 text attachments

Reported by Oliver Joos on 2011-01-20
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Zim
Medium
Unassigned

Bug Description

I use zim rev336 on Ubuntu 10.04.1. Once I attached a simple .txt file to a page and saw that it appears in zims outline sidebar as if it was a subpage. Well, this is surprising but the attached text can be displayed within zim, which is nice. But afterwards every Search got no results, and later every clicked page was empty! (just white background - quite scary for a moment)

Starting "zim -V -D" shows that indexing throws an exception (see below) if the .txt attachment contains non-utf8 characters. And later also zims page history gets poisoned somehow, so that clicking to open other pages throws similar exceptions. For normal users who don't start zim in a terminal this all happens silently.

I propose to treat .txt files as pages only if they contain a valid zim header. Now as workaround I just rename .txt to .log, then zim treats the text as normal attachment and does not get confused anymore.

To reproduce add a non-utf8 .txt file to your notebook and start zim:
$ echo "Chuchichästli" | iconv -t latin1 >~/mynotebook/
$ zim -V -D

_____________________________________________________
ERROR: Got an exception while indexing "<IndexPath: non-utf8>":
Traceback (most recent call last):
  File "/home/oliver/zim_NOBACKUP/zim/zim/index.py", line 436, in _do_update
    self._index_page(path, page)
  File "/home/oliver/zim_NOBACKUP/zim/zim/index.py", line 525, in _index_page
    for type, href, _ in page.get_links():
  File "/home/oliver/zim_NOBACKUP/zim/zim/notebook.py", line 1724, in get_links
    tree = self.get_parsetree()
  File "/home/oliver/zim_NOBACKUP/zim/zim/notebook.py", line 1626, in get_parsetree
    self._parsetree = self._fetch_parsetree()
  File "/home/oliver/zim_NOBACKUP/zim/zim/stores/files.py", line 190, in _fetch_parsetree
    lines = lines or self.source.readlines()
  File "/home/oliver/zim_NOBACKUP/zim/zim/fs.py", line 919, in readlines
    lines = self._readlines()
  File "/home/oliver/zim_NOBACKUP/zim/zim/fs.py", line 937, in _readlines
    lines = file.readlines()
  File "/usr/lib/python2.6/codecs.py", line 674, in readlines
    return self.reader.readlines(sizehint)
  File "/usr/lib/python2.6/codecs.py", line 583, in readlines
    data = self.read()
  File "/usr/lib/python2.6/codecs.py", line 472, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-10: invalid data

There are two issues here. One is that zim can not handle the non-utf8 content. This I can not fix, but error handling should be more graceful so it does not hamper usage.

Second issue is that zim can not detect that this file is an attachment other than opening it and looking at the headers. So it assumes it to be a page, hence it shows up in the index. Maybe we should build in such a check and ignore any file that does not have the zim headers. However this will need some careful design as not to cause zim to do a lot of slow file reads when listing pages in a certain folder.

Changed in zim:
status: New → Confirmed
importance: Undecided → Medium
tags: added: missing redesign
Oliver Joos (oliver-joos) wrote :

Zim pages are always utf8, and that seems perfectly ok to me.

> build in such a check and ignore any file that does not have the zim headers
Yes, I think so. But I don't see why this would slow down something. All .txt files are opened anyway now. The additional check would mean that with certain .txt files Zim can stop reading after the first few lines and does not have to update its outline ect. - so less I/O and less CPU needed. Files with other postfixes should not be affected at all.
An important question is: which part(s) of the header does identify a Zim page? It should stay easy to add a Zim page with an external text editor or script.

> this will need some careful design
Agreed. That's one of Zims qualities, is it? ;-)

On Fri, Jan 21, 2011 at 5:21 PM, Oliver Joos <email address hidden> wrote:
> Zim pages are always utf8, and that seems perfectly ok to me.
>
>> build in such a check and ignore any file that does not have the zim headers
> Yes, I think so. But I don't see why this would slow down something. All .txt files are opened anyway now. The additional check would mean that with certain .txt files Zim can stop reading after the first few lines and does not have to update its outline ect. - so less I/O and less CPU needed. Files with other postfixes should not be affected at all.

This is correct for indexing. However there are a few more places
where the directory listing is used. Need to double check those and
make them use the index instead.

Also need to make the filtering for the attachment folder more
advanced. Now it just ignores any text files, should be updated to
only ignore zim page text files.

> An important question is: which part(s) of the header does identify a Zim page? It should stay easy to add a Zim page with an external text editor or script.

That would just be the first line, the Content-Type line.

-- Jaap

Oliver Joos (oliver-joos) wrote :

I tried to check if using the index would change anything: I did 'echo "Hello" >MyNotebook/TestPage.txt', and with the current Zim I was not able to make this TestPage appear in Zims outline unless I rebuilt the index. If I did not miss anything, it seems ok to use the index instead of the directory listing.

> Also need to make the filtering for the attachment folder more advanced.
Do you mean the Attachment Browser plugin? For plugins it would be nice to have a function that enumerates Zim pages, so that they don't have to do the filtering on their own.

> That would just be the first line, the Content-Type line.
Sounds good to me. The first line probably never changes in the future. This also allows to add Zim pages to the mime database of freedesktop one day. (The filtering capabilities of this database are a bit limited).
The only difficulty I see here is a BOM:
http://greeennotebook.com/2010/06/watch-out-for-the-byte-order-mark-bom-in-linux/

tags: added: error-handling import
Patrik Dufresne (ikus060-gmail) wrote :

According to my trace log, it seams I experience a similar issue using the search box.

My version of zim is 0.52

Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.6/zim/gui/searchdialog.py", line 66, in search
    self.results_treeview.set_query( string )
  File "/usr/lib/pymodules/python2.6/zim/gui/searchdialog.py", line 104, in set_query
    self.selection.search(self.query)
  File "/usr/lib/pymodules/python2.6/zim/search.py", line 236, in search
    self.update(self._process_group(query.root, selection))
  File "/usr/lib/pymodules/python2.6/zim/search.py", line 293, in _process_group
    self._process_content(contentterms, scope, group.operator))
  File "/usr/lib/pymodules/python2.6/zim/search.py", line 396, in _process_content
    tree = page.get_parsetree()
  File "/usr/lib/pymodules/python2.6/zim/notebook.py", line 1856, in get_parsetree
    self._parsetree = self._fetch_parsetree()
  File "/usr/lib/pymodules/python2.6/zim/stores/files.py", line 210, in _fetch_parsetree
    lines = lines or self.source.readlines()
  File "/usr/lib/pymodules/python2.6/zim/fs.py", line 1015, in readlines
    lines = self._readlines()
  File "/usr/lib/pymodules/python2.6/zim/fs.py", line 1033, in _readlines
    lines = file.readlines()
  File "/usr/lib/python2.6/codecs.py", line 674, in readlines
    return self.reader.readlines(sizehint)
  File "/usr/lib/python2.6/codecs.py", line 583, in readlines
    data = self.read()
  File "/usr/lib/python2.6/codecs.py", line 472, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 36-38: invalid data

Oliver Joos (oliver-joos) wrote :

@Patrik: what did you do to get the Traceback above? I was not able to reproduce it. What do you search for? Can you isolate which of your pages is causing this? What does it contain?

tags: added: unicode
removed: error-handling import missing redesign
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers