Comment 2 for bug 383102

Revision history for this message
Alexander Belchenko (bialix) wrote : Re: [Bug 383102] [NEW] bzr search can't find non-ascii text

Robert Collins пишет:
> status triaged
> importance medium
>
>> As you could see search tries to search unicode text in the plain file
>> (cp1251 encoded). It's never could match.
>
> So, if the file content was utf8 it would be fine. Is there some way
> bzr-search can determine the encoding of the file at the time it indexes
> it? I know we can use the BOM for unicode text files. Perhaps there is a
> library out there that can do a good job.
>
> bzr-search needs a fixed index it can lookup in quickly, so it needs to
> generate unicode terms from the files it indexes. To date its been
> pretty simplistic and assumed all content was utf8 :- clearly not
> true :).

I think bzr-search should use files content "as is", without decoding it
to unicode. Because there is currently no way to absolutely correctly
guess encoding and bzr has no file properties to attach this sort of
info to the committed content.

In qbzr we're using special command-line option --encoding to specify
file content encoding for diff/annotate. This approach works well.
Default encoding is utf-8 there.

I suggest to provide similar option to search command, e.g.

bzr search тест --encoding cp1251

so this encoding argument will be used to encode command-line (unicode)
argument тест to some specific encoding and then used verbatim to search.

Does it make sense for you?