browsing annotations in the main interface the results cannot be searched in Chinese

Bug #1929325 reported by liuyun
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
calibre
Fix Released
Undecided
Unassigned

Bug Description

The calibre version: calibre 5.18 64bit
The operating system:win10 20H2

When browsing annotations in the main interface (not the reading interface,the reading interface search is normal), the results cannot be searched in Chinese, but can be found in English.

Revision history for this message
Kovid Goyal (kovid) wrote :

Works for me, steps I tried:

1) Open an epub book
2) Create a highlight
3) Add the notes 搜索hello to it
4) Close book
5) Restart calibre
6) Open the annotations browser and type 搜索in the search bar
7) The newly created note is found

Kovid Goyal (kovid)
Changed in calibre:
status: New → Invalid
Revision history for this message
liuyun (liuyun99) wrote :

Thank you for your reply. Follow the steps from 1 to 7 and found that the Chinese language is still not searchable. Some English can be searched(e.g. book), some are not (e.g. hello). As shown in the following screen recording.(The gif pictures in the email reply cannot be displayed, so I gave the link.)
https://i.loli.net/2021/05/24/8meb4NhSQaGjcCg.gif

Revision history for this message
Kovid Goyal (kovid) wrote :

Strange dont see how that's possible. Attach the metadata.db file from your calibre library folder and I will see if I can reproduce with that.

Revision history for this message
liuyun (liuyun99) wrote :

My calibre is not installed on Disk C, but disk D. I don't know if it's related to the bug.

Revision history for this message
liuyun (liuyun99) wrote :

Maybe I found the problem. English words have spaces between them, and matching is done by word. Chinese characters can only be matched exactly with text separated by separators (spaces, commas, periods, etc.), as shown in the gif. And this is well solved in the reading interface search by selecting the include or regular expression option.

Revision history for this message
Kovid Goyal (kovid) wrote : Re: calibre bug 1929325

Sadly this is a limitation of SQLITE FTS5 which is what the
annotations browser uses to do full text searching. It only tokenizes
input based on delimiters, not general ICU tokenization.
https://<email address hidden>/msg112010.html

Until and unless it gains support for unicode tokenization, searching in
languages that dont use delimiters is not going to work. The FTS5 module
does support creating custom tokenizers, so one could in theory create
one based on unicode, however doing so is too much work for me, patches
welcome.

 status wontfix

Changed in calibre:
status: Invalid → Won't Fix
Revision history for this message
liuyun (liuyun99) wrote :

Sqlite + FTS5 +【 Chinese Tokenizer】 maybe a solution.
Simple tokenizer: A sqlite3 fts5 tokenizer which supports Chinese and PinYin.
https://github.com/wangfenjin/simple

Revision history for this message
Kovid Goyal (kovid) wrote :

That's not really a workable solution. The problem is that tokenizer
will work for chinese, but not for japanese/thai/khmer which are the
other languages that dont have word separators. So one needs a full ICU
based tokenizer, not a "simple" one. And then IIRC changing tokenizers
requires a full rebuild of the index, which is not something to be
undertaken lightly.

A better workaround would be to pre-tokenize the text fed to SQLITE by
breaking up the words using ICU and separating them with something like
the zero width non-joiner, before feeding to sqlite.
The problems with this approach are:

1) One would need to tokenize the queries as well
2) annotations dont have a defined language, unicode word breaking
requires a language. So one would have to split up text using unicode
script properties, then break the chunks based on detected language.

This is a fair bit of work but probably easier than writing a custom
tokenizer. calibre already has ICU bindings that can be used to split
text into words given a language, used for spell checking the editor.

Revision history for this message
liuyun (liuyun99) wrote :

Thank you for your reply, you have considered it thoroughly, hope calibre is getting better and better~

Revision history for this message
Kovid Goyal (kovid) wrote : Fixed in master

Fixed in branch master. The fix will be in the next release. calibre is usually released every alternate Friday.

 status fixreleased

Changed in calibre:
status: Won't Fix → Fix Released
Revision history for this message
Eli Schwartz (eschwartz) wrote :

Ref: https://github.com/kovidgoyal/calibre/commit/d8752252e61d36a249923b59ef5f524f62f1a393

And surrounding changes. Note that calibre now includes a SQLite fts plugin based on ICU and libstemmer.

Since all the C/C++ changes.were present in the latest release (5.22), you should be able to follow the calibre manual's development guide to plug in the python sources and test this out in development mode.

Then you could check if it works, and if not, offer additional improvement suggestions ahead of the general public release?

Revision history for this message
Kovid Goyal (kovid) wrote : Re: calibre bug 1929325

I have uploaded beta builds that can be used for testing:
https://download.calibre-ebook.com/betas/

Revision history for this message
liuyun (liuyun99) wrote :

Thanks to the author's hard work, I've updated calibre 5.23, and the function browsing annotations searching works fine. Also, I just submitted a new feature enhancement suggestion.

https://bugs.launchpad.net/calibre/+bugs?search=Search&field.bug_reporter=liuyun99

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.