Zim

chinese path or filename supported

Bug #572805 reported by aosp on 2010-05-01
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Zim
High
Unassigned

Bug Description

I am using zim v0.46 on Windows.

== the zim on my Chinese Windows OS can't running correctly with some errors :
* zim can't run in a path with Chinese character. It will return "UnicodeDecodeError: 'ascii' codec can't decode byte 0xd6 in position 3: ordinal not in range(128)" at launch.
* it can't create a note with Chinese page name.
* it can't attach a image or other attachment from Chinese named folders.

And the errors disappeared after I modified some source code.

I try to check the code and maybe find the problem.In Chinese Windows OS we can't directly use the following code for a path:
unicode(path, 'utf-8')
or
path.encode('utf-8')

but the following is correct:
unicode(path, 'gb2312')

if using the following code, it maybe support Korean or Japanese.
unicode(path, sys.getfilesystemencoding())

* line 388 in zim/fs.py
 class WindowsPath(UnixPath):
...
 def _abspath(path):
...
  # return os.path.abspath(path) # original code
  # my fixed code
  path = os.path.abspath(path)
  if isinstance(path, unicode) :
   return path
  else :
   return unicode(path, sys.getfilesystemencoding())

* before class UnixPath in zim/fs.py
# my added code
def tryUnicode(path):
 if isinstance(path, unicode) :
  return path
 elif isinstance(path, Dir) :
  return unicode(path)
 else :
  return unicode(path, sys.getfilesystemencoding())

* line 191 in zim/fs.py
class UnixPath(object):
...
 def __init__(self, path):
...
   # path = map(unicode, path) # original code
   # my fixed code
   path = map(tryUnicode, path)

so that all path are ready unicode in Windows

* line 166 in zim/fs.py
def isdir(path):
 # return os.path.isdir(path.encode('utf-8')) # original code
 # my fixed code
 return os.path.isdir(path)

def isfile(path):
 # return os.path.isfile(path.encode('utf-8'))
 # my fixed code
 return os.path.isfile(path)

* line 275 in zim/fs.py
 def iswritable(self):
...
   # return os.access(self.path.encode('utf-8'), os.W_OK) # original code
   # my fixed code
   return os.access(self.path, os.W_OK)

* line 40 in zim.py
 # argv = [arg.decode('utf-8') for arg in sys.argv] # original code
 # my fixed code.I think it is not need to decode in Windows.
 argv = [arg for arg in sys.argv]

* line 69 in zim/applications.py
 def _checkargs(self, cwd, args):
...
  # argv = [a.encode('utf-8') for a in self._cmd(args)] # original code
  # my fixed code
  argv = [a for a in self._cmd(args)]
  # if cwd: # original code
  # cwd = unicode(cwd).encode('utf-8') # original code
  return cwd, argv

* line 1021 in zim/gui/widgets.py
 def get_file(self):
...
  # else: return File(path) # original code
  # my fixed code. it fix the file name string that GTK returned to unicode.
  else: return File(path.decode('utf-8'))

On Sat, 2010-05-01 at 05:54 +0000, aosp wrote:
> == the zim on my Chinese Windows OS can't running correctly with some errors :
> * zim can't run in a path with Chinese character. It will return "UnicodeDecodeError: 'ascii' codec can't decode byte 0xd6 in position 3: ordinal not in range(128)" at launch.
> * it can't create a note with Chinese page name.
> * it can't attach a image or other attachment from Chinese named folders.
>
> And the errors disappeared after I modified some source code.
>
> I try to check the code and maybe find the problem.In Chinese Windows OS we can't directly use the following code for a path:
> unicode(path, 'utf-8')
> or
> path.encode('utf-8')
>
> but the following is correct:
> unicode(path, 'gb2312')
>
> if using the following code, it maybe support Korean or Japanese.
> unicode(path, sys.getfilesystemencoding())

This looks like a good fix, will include it. Only question is whether we
need to have compatibility mode for current users. Really need to be
sure this does not break anything for current users.

Regards,

Jaap

aosp (aosp) wrote :

I think this problems only appear in Windows OS except English Windows OS.
I don't know how many Chinese(or other language) users of zim. But I had token notes in zim for a long time since it re-coded in Python. And I modify the source code for Chinese after every zim new version released because zim become more powerful. I like to do more experiments on different language system to exam the new code which I submitted does not break anything for current users.

Regard,

Aosp

See also bug #561121 for issue with locale "C".

Changed in zim:
status: New → Confirmed
importance: Undecided → High

Summarizing this change is based on the fact that most python filesystem functions already do encoding themselves. All we need to do is decode back to unicode after reading the files. Seems a good idea to at least try decode as UTF-8 when decoding in the preferred encoding fails (see bug #561121). Any file that can not be decoded even after fallback should be treated as invalid (?).

To avoid unicode encoding errors we should also do the encoding ourselves and handle errors. I propose applying utf-8 + url encoding for any chars that could not be encoded.

An exception to this rule is for win32 where the API is slightly different and prefers unicode strings.

See this page for some details: http://kofoto.rosdahl.net/wiki/UnicodeInPython

Alex Tu (alextu) wrote :

Thanks for your kindly response.
But here is an issue when I excute python code in windows.

I follow README file and install
gtk+-2.16.6
python-2.6.5
python-gtk-2.16.0
python-gobject-2.20.0

When I excute D:\zim-0.46>test.py
It cause an error and response:

WARNING: Can not import 'xdg.Mime' - falling back to 'mimetypes'
Traceback (most recent call last):
  File "D:\zim-0.46\test.py", line 326, in <module>
    main()
  File "D:\zim-0.46\test.py", line 169, in main
    test = loader.loadTestsFromName(name)
  File "C:\Python26\lib\unittest.py", line 584, in loadTestsFromName
    parent, obj = obj, getattr(obj, part)
AttributeError: 'module' object has no attribute 'applications'

Is there any more I need to do?

@CCTU: this is a generic error when running the test. May or may not be related to the bug report above.

Made fixes for filesystem encoding. Committed in revision 239. Works for me with both ascii and utf-8 encoding.

Please test with Chinese encoding and let me know if any errors.

Attached a code snapshot for testing.

Changed in zim:
status: Confirmed → In Progress
Alex Tu (alextu) wrote :

I am a chinese windows XP user.
I could see chinese index and search chinese character on "pyzim revision 239" you provided.
It seems fixed this issue.

But I need install pyCairo before excute "pyzim revision 239" otherwise it shows

------------------------------------
D:\zim-0.46-chinese>zim.py
WARNING: Can not import 'xdg.Mime' - falling back to 'mimetypes'
Traceback (most recent call last):
  File "D:\zim-0.46-chinese\zim.py", line 45, in <module>
    zim.main(argv)
  File "D:\zim-0.46-chinese\zim\__init__.py", line 290, in main
    import zim.gui
  File "D:\zim-0.46-chinese\zim\gui\__init__.py", line 30, in <module>
    import gtk
  File "C:\Python26\lib\site-packages\gtk-2.0\gtk\__init__.py", line 40, in <mod
ule>
    from gtk import _gtk
ImportError: No module named cairo
------------------------------------

so I installed pyCairo 1.8 from http://alex.matan.ca/install-cairo-wxpyton-pycairo-python-windows then pyzim works.

But I still could not execute test.py, it response error of bad filename like below.
------------------------------------
D:\zim-0.46-chinese>test.py
Traceback (most recent call last):
  File "D:\zim-0.46-chinese\test.py", line 326, in <module>
    main()
  File "D:\zim-0.46-chinese\test.py", line 155, in main
    tests.set_environ()
  File "D:\zim-0.46-chinese\tests\__init__.py", line 43, in set_environ
    shutil.rmtree(tmpdir)
  File "C:\Python26\lib\shutil.py", line 216, in rmtree
    rmtree(fullname, ignore_errors, onerror)
  File "C:\Python26\lib\shutil.py", line 216, in rmtree
    rmtree(fullname, ignore_errors, onerror)
  File "C:\Python26\lib\shutil.py", line 216, in rmtree
    rmtree(fullname, ignore_errors, onerror)
  File "C:\Python26\lib\shutil.py", line 221, in rmtree
    onerror(os.remove, fullname, sys.exc_info())
  File "C:\Python26\lib\shutil.py", line 219, in rmtree
    os.remove(fullname)
WindowsError: [Error 123] 檔案名稱、目錄名稱或磁碟區標籤語法錯誤。: './tests/tmp
/export_ExportedFiles\\utf8\\??????'
------------------------------------

The file name under .\tests\tmp\export_ExportedFiles\utf8\ is "טכניון".
It looks like encoded by Hebrew.
Here attached the snapshot of folder ".\tests\tmp\export_ExportedFiles\utf8\ טכניון" for reference.
http://picasaweb.google.com.tw/lh/photo/EtC4etmixRgBDZlzHYzs8A?feat=directlink

Zim is really a good note application.
I am very happy about chinese index and chinese search supported in windows XP.
Thanks very much! :-)

Updated the snapshot with some more fixes (revision 242). Just tested under windows, and no tests failing due to character encoding (some other errors I still need to investigate).

Please confirm chines characters are still working...

@CCTU: the pycairo dependency is not due to zim, this is due to the pygtk version you installed.

Alex Tu (alextu) wrote :

After tested v242 I found some issues.

Issue 1.
It works fine if I just create page with Chinese title in first layer.
But it will crash if I create subpage with Chinese title or English title after created some page with chinese title.
Here attached snapshot and error log file.
Subpage with Chinese title:
snapshot : http://picasaweb.google.com.tw/lh/photo/UhhcBKnS90vIOsHTiDL_pw?feat=directlink
error log : http://dl.dropbox.com/u/2495108/v242_chinese_sub_title_failed.png.txt

Subpage with English title.
snapshot : http://picasaweb.google.com.tw/lh/photo/6-u2lTsqEZ4xtEi3yi7sqw?feat=directlink
error log : http://dl.dropbox.com/u/2495108/v242_chinese_sub_title_failed_eng.png.txt

Issue 2.
If I saved note before issue 1 occur, some pages with Chinese title will disappear in next open by zim.
And this issue does not find on zim-ver239.
Here attached snapshot and error log file.
snapshot : http://picasaweb.google.com.tw/lh/photo/pYwBlCWhhvNyMUEcWeQryA?feat=directlink
error log : http://dl.dropbox.com/u/2495108/V242_index_faile.png.txt

On the other hand, I found some issue in zim-ver239.

Issue1.
Like issue 1 of ver 242.

Issue2.
Some times, search content function will not response all content that write in Chinese if I search Chinese characters.

aosp (aosp) wrote :

Thanks for your work.
I tried the Snapshot of zim sources at rev242 on my WIndows, but something still not working correctly just like CCTU described.
I think that is still path unicode problem. At line 205 in decode function in fs.py, I try to modify to that:

202 def decode(path):
203 if isinstance(path, unicode):
204 try:
205 return path.decode(ENCODING) --> return path

After modified to return path directly instead of decoding, and zim will completely work in Chinese. I didn't try it on my Chinese Ubuntu yet.

Can someone give me a few page names with chinese characters that can be used as file names under the gb2312 encoding ? I want to add these to the test suite to ensure this never breaks again.

I see some names in the screenshot, but I can not type them myself ...

Maybe I asked this already before but can not find it. Please remind me in that case.

aosp (aosp) wrote :

there are some of my notes

Alex Tu (alextu) wrote :

Attached note is created in chinese windows XP, but I don't know if it encoded by gb2312.
I don't know how to check what encoding it used.

This note has two pages with chinese title.
The snap shot of attached note is http://picasaweb.google.com.tw/lh/photo/qwIeyUaR1KmqBac44KQ_gg?feat=directlink

OK, applied another patch - hope I got it right this time. Please test it.

aosp (aosp) wrote :

the rev246 version is running normal in Chinese, but after I run 'Re-build Index' command the tree view will be crashed.

I try to check out the index.db file. The contents in basename field in pages table had been un-normal Chinese character code again. I think maybe the sqlite index function did not encode or decode.

Alex Tu (alextu) wrote :

rev246 could add pages with chinese title, but It produce some title with garbage code in next zim start.
For example:
step 1. This is screenshot of add pages with chinese title, and it is ok.
http://picasaweb.google.com.tw/lh/photo/oIn9KHYE3oSBvXaqJUO_Cg?feat=directlink

step 2. this is after restart zim, some garbage title produced, and the latest title is gone.
http://picasaweb.google.com.tw/lh/photo/0i_Uzdr3S8iIDcizXs4Srg?feat=directlink

Attached log message for v246 is the message when I restart zim(step2).

OK, one more fix applied. Now tested successfully on a windows system with Greek characters in the page name. Should work for Chinese characters as well. See rev256.

Alex Tu (alextu) wrote :

Excuse me, but where could I get rev256 for test chinese page name?

On Mon, May 31, 2010 at 6:02 PM, CCTU <email address hidden> wrote:
> Excuse me, but where could I get rev256 for test chinese page name?

Sorry for not uploading a snapshot. The pre-release of the windows
installer should be up to rev257. See this mail for details:
https://lists.launchpad.net/zim-wiki/msg00730.html

-- Jaap

Alex Tu (alextu) wrote :

That is a perfect solution for chinese page name.
Thanks very much!!

But I found two issue in v257.

Issue 1:
An error message pop up when I close Zim.
http://picasaweb.google.com.tw/lh/photo/eh5O9r1mK_vlAbIPEtKu_Q?feat=directlink

It seems like there lost some plug-in in zim windows package.
attached file "error_log_after_close_zim" is log message.

Issue 2:
In search window, I could not double click searched item to open page.
http://picasaweb.google.com.tw/lh/photo/XiykJLauPrkHAhRK1iIslA?feat=directlink
This function is ok in previous version.

On Tue, Jun 1, 2010 at 8:18 AM, CCTU <email address hidden> wrote:
> That is a perfect solution for chinese page name.
> Thanks very much!!
>
> But I found two issue in v257.
>
> Issue 1:
> An error message pop up when I close Zim.
> http://picasaweb.google.com.tw/lh/photo/eh5O9r1mK_vlAbIPEtKu_Q?feat=directlink
>
> It seems like there lost some plug-in in zim windows package.
> attached file "error_log_after_close_zim" is log message.

I think this popup is a know issue, there are several bugs open
already related to this.

> Issue 2:
> In search window, I could not double click searched item to open page.
> http://picasaweb.google.com.tw/lh/photo/XiykJLauPrkHAhRK1iIslA?feat=directlink
> This function is ok in previous version.
>
> ** Attachment added: "error_log_after_close_zim"
>   http://launchpadlibrarian.net/49482196/zim.exe.log

This is a new bug, please open a new bug report and attach this error log.

Regards,

Jaap

aosp (aosp) wrote :

I tested the v257. The bug of Chinese page name had been fixed, but I found another issue. If I put the zim in a path which contained some Chinese just like this:

c:\test\(some Chinese characters)\zim\zim.py

The software can run but can't not open the dialog of the menu "Edit - Preference".

Error Logs:

Traceback (most recent call last):
  File "C:\test\新建文件夹\zim\zim\gui\__init__.py", line 1060, in show_preferences
  File "C:\test\新建文件夹\zim\zim\gui\preferencesdialog.py", line 75, in __init__
  File "C:\test\新建文件夹\zim\zim\gui\preferencesdialog.py", line 163, in __init__
  File "C:\test\新建文件夹\zim\zim\gui\preferencesdialog.py", line 291, in __init__
  File "C:\test\新建文件夹\zim\zim\gui\preferencesdialog.py", line 262, in __init__
  File "C:\test\新建文件夹\zim\zim\plugins\__init__.py", line 45, in list_plugins
  File "C:\test\新建文件夹\zim\zim\fs.py", line 317, in __init__
zim.errors.Error: BUG: invalid input, file names should be in ascii, or given as unicode

Unspecified error...

If the path is all in English character, the error will not appear.

@aosp: could you run the following and post the result?

In a dos prompt run:

  $ python
  >>> import sys
  >>> sys.path

thanks.

aosp (aosp) wrote :

Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.path
['', 'E:\\Development\\python\\lib\\site-packages\\pil-1.1.7b1-py2.5-win32.egg',
 'E:\\Development\\python\\lib\\site-packages\\django-1.1-py2.5.egg', 'E:\\Devel
opment\\python\\lib\\site-packages\\web.py-0.33-py2.5.egg', 'E:\\Development\\py
thon\\lib\\site-packages\\flup-1.0.3.dev_20100221-py2.5.egg', 'E:\\Development\\
python\\lib\\site-packages\\simplejson-2.1.1-py2.5-win32.egg', 'E:\\Development\
\python\\lib\\site-packages\\setuptools-0.6c11-py2.5.egg', 'E:\\Development\\pyt
hon\\lib\\site-packages\\paramiko-1.7.4-py2.5.egg', 'D:\\Program Files\\ArcGIS\\
bin', 'C:\\Windows\\system32\\python25.zip', 'E:\\Development\\python\\DLLs', 'E
:\\Development\\python\\lib', 'E:\\Development\\python\\lib\\plat-win', 'E:\\Dev
elopment\\python\\lib\\lib-tk', 'E:\\Development\\python', 'E:\\Development\\pyt
hon\\lib\\site-packages', 'E:\\Development\\python\\lib\\site-packages\\gtk-2.0'
]
>>>

Hmm, no unicode paths in sys.path - that doesn't help much...

OK, added another check for localized names in the file path, hope this helps.

Closing this bug as unicode file names now tested to work on windows. If you find any corner cases please open a new report.

Thanks for your help getting this working.

Changed in zim:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers