bzr search can't index non-ascii text

Bug #383102 reported by Alexander Belchenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
bzr search plugin
Triaged
High
Unassigned

Bug Description

cp1251 strings in indexed content are ignored by the indexing process.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 383102] [NEW] bzr search can't find non-ascii text

 status triaged
 importance medium

> As you could see search tries to search unicode text in the plain file
> (cp1251 encoded). It's never could match.

So, if the file content was utf8 it would be fine. Is there some way
bzr-search can determine the encoding of the file at the time it indexes
it? I know we can use the BOM for unicode text files. Perhaps there is a
library out there that can do a good job.

bzr-search needs a fixed index it can lookup in quickly, so it needs to
generate unicode terms from the files it indexes. To date its been
pretty simplistic and assumed all content was utf8 :- clearly not
true :).

-ROb

Changed in bzr-search:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Alexander Belchenko (bialix) wrote :

Robert Collins пишет:
> status triaged
> importance medium
>
>> As you could see search tries to search unicode text in the plain file
>> (cp1251 encoded). It's never could match.
>
> So, if the file content was utf8 it would be fine. Is there some way
> bzr-search can determine the encoding of the file at the time it indexes
> it? I know we can use the BOM for unicode text files. Perhaps there is a
> library out there that can do a good job.
>
> bzr-search needs a fixed index it can lookup in quickly, so it needs to
> generate unicode terms from the files it indexes. To date its been
> pretty simplistic and assumed all content was utf8 :- clearly not
> true :).

I think bzr-search should use files content "as is", without decoding it
to unicode. Because there is currently no way to absolutely correctly
guess encoding and bzr has no file properties to attach this sort of
info to the committed content.

In qbzr we're using special command-line option --encoding to specify
file content encoding for diff/annotate. This approach works well.
Default encoding is utf-8 there.

I suggest to provide similar option to search command, e.g.

bzr search тест --encoding cp1251

so this encoding argument will be used to encode command-line (unicode)
argument тест to some specific encoding and then used verbatim to search.

Does it make sense for you?

Revision history for this message
Alexander Belchenko (bialix) wrote : Re: bzr search can't find non-ascii text

I've implemented --encoding option, see attached diff. Unfortunately search does not match non-ascii text for me anyway, even with correct encoding.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 383102] [NEW] bzr search can't find non-ascii text

On Thu, 2009-06-04 at 09:01 +0000, Alexander Belchenko wrote:
> I think bzr-search should use files content "as is", without decoding it
> to unicode. Because there is currently no way to absolutely correctly
> guess encoding and bzr has no file properties to attach this sort of
> info to the committed content.

This works too; OTOH it would be good to handle things like png with
metadata headers more sensibly.

> In qbzr we're using special command-line option --encoding to specify
> file content encoding for diff/annotate. This approach works well.
> Default encoding is utf-8 there.

> I suggest to provide similar option to search command, e.g.
>
> bzr search тест --encoding cp1251
>
> so this encoding argument will be used to encode command-line (unicode)
> argument тест to some specific encoding and then used verbatim to search.
>
> Does it make sense for you?

It certainly works better with bzr's lack of knowledge of file
encodings. But how will bzr-search know how to output the file's
contents sensibly? (For the hit preview).

-Rob

Revision history for this message
Alexander Belchenko (bialix) wrote :

Robert Collins пишет:
> On Thu, 2009-06-04 at 09:01 +0000, Alexander Belchenko wrote:
>> I think bzr-search should use files content "as is", without decoding it
>> to unicode. Because there is currently no way to absolutely correctly
>> guess encoding and bzr has no file properties to attach this sort of
>> info to the committed content.
>
> This works too; OTOH it would be good to handle things like png with
> metadata headers more sensibly.

Well, search for non-ascii text with my encoding patch *does not* work
for me. I don't know how to look at your raw indices, but I can provide
testing branch with russian text.

>> In qbzr we're using special command-line option --encoding to specify
>> file content encoding for diff/annotate. This approach works well.
>> Default encoding is utf-8 there.
>
>> I suggest to provide similar option to search command, e.g.
>>
>> bzr search тест --encoding cp1251
>>
>> so this encoding argument will be used to encode command-line (unicode)
>> argument тест to some specific encoding and then used verbatim to search.
>>
>> Does it make sense for you?
>
> It certainly works better with bzr's lack of knowledge of file
> encodings. But how will bzr-search know how to output the file's
> contents sensibly? (For the hit preview).

There is only one way today: show it "as is". This is how annotate, cat
  and diff works today. And people don't complain. I don't see the way
to make it better without file properties.

Revision history for this message
Robert Collins (lifeless) wrote :

On Tue, 2009-06-09 at 09:06 +0000, Alexander Belchenko wrote:

> Well, search for non-ascii text with my encoding patch *does not*
> work
> for me. I don't know how to look at your raw indices, but I can
> provide
> testing branch with russian text.

That would be great.
..
> There is only one way today: show it "as is". This is how annotate,
> cat
> and diff works today. And people don't complain. I don't see the
> way
> to make it better without file properties.

Could you also provide a small .py file that will do a test search? Just
something trivial using the API so I don't have to guess about whether
my command line encoding etc is doing the right thing.

-Rob

Revision history for this message
Alexander Belchenko (bialix) wrote :

Robert Collins пишет:
> On Tue, 2009-06-09 at 09:06 +0000, Alexander Belchenko wrote:
>
>> Well, search for non-ascii text with my encoding patch *does not*
>> work
>> for me. I don't know how to look at your raw indices, but I can
>> provide
>> testing branch with russian text.
>
> That would be great.
> ..
>> There is only one way today: show it "as is". This is how annotate,
>> cat
>> and diff works today. And people don't complain. I don't see the
>> way
>> to make it better without file properties.
>
> Could you also provide a small .py file that will do a test search? Just
> something trivial using the API so I don't have to guess about whether
> my command line encoding etc is doing the right thing.

I'm not familiar with bzr-search API unfortunately. Can you give me a
template for the testing script?

Revision history for this message
Robert Collins (lifeless) wrote :

On Tue, 2009-06-09 at 10:00 +0000, Alexander Belchenko wrote:

> I'm not familiar with bzr-search API unfortunately. Can you give me a
> template for the testing script?

For more, look at bzrlib.plugins.search.commands.cmd_search.

trans = get_transport(directory)
index = bzrlib.plugins.search.index.open_index_url(trans.base)
# query_list might be ['foo', 'bar']
query = [(query_item,) for query_item in query_list]
index._branch.lock_read()
print list(index.search(query))

So essentially, I'm asking for a value of query_list that I can use as
python source.

-Rob

Revision history for this message
Alexander Belchenko (bialix) wrote : Re: bzr search can't find non-ascii text

See attched branch with test.py script.

Revision history for this message
Alexander Belchenko (bialix) wrote :

Robert, with attached test script I can trigger this traceback:

C:\Temp\search>python "test1.py"
C:\work\Bazaar\bzr-repo\bzr\bzrlib\btree_index.py:1092: UnicodeWarning: Unicode equal comparison failed to convert both argument
s to Unicode - interpreting them as being unequal
  needed_keys = sorted(needed_keys)
Traceback (most recent call last):
  File "test1.py", line 29, in <module>
    print list(index.search(query))
  File "C:\work\Bazaar\plugins\search\index.py", line 607, in search
    for component, termlist, common_doc_keys in self._search_work(termlist):
  File "C:\work\Bazaar\plugins\search\index.py", line 555, in _search_work
    component.term_2_index.iter_entries(term_keys[2])):
  File "C:\work\Bazaar\bzr-repo\bzr\bzrlib\btree_index.py", line 1092, in iter_entries
    needed_keys = sorted(needed_keys)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd2 in position 0: ordinal not in range(128)

Revision history for this message
Alexander Belchenko (bialix) wrote :

Robert, may be you should force check in your search API to allow only plain strings in query_list.

Revision history for this message
Robert Collins (lifeless) wrote :

Thanks. The UDI errors show up because of the 'u' before foo and bar. Neither should be affecting the search. I'm digging into the index layer now.

Revision history for this message
Robert Collins (lifeless) wrote :

The script doesn't search for a matchable combination as bar isn't present.
putting
index._branch.lock_read()
def test_term(term):
    print term, list(index.search([(term,)]))
test_term(TEST_RU)
test_term(FOO)
at the bottom shows the russion cp1251 not matching and foo matching.

Revision history for this message
Robert Collins (lifeless) wrote :

Hmm, I mistyped my change - of course it should be
test_term(TEST_RU.encode('cp1251'))

Anyhow:

(Pdb) print list(term_index.iter_all_entries())
[(<bzrlib.plugins.search.index.SuggestableBTreeGraphIndex object at 0x2617190>, ('1',), '4 1 765 142'), (<bzrlib.plugins.search.index.SuggestableBTreeGraphIndex object at 0x2617190>, ('2',), '5 1 925 142'), (<bzrlib.plugins.search.index.SuggestableBTreeGraphIndex object at 0x2617190>, ('bar',), '1 1 445 142'), (<bzrlib.plugins.search.index.SuggestableBTreeGraphIndex object at 0x2617190>, ('foo',), '0 1 285 142')]

This indicates that only four terms were found during the index phase: 1, 2, bar and foo. So the cp1251 string is being treated as whitespace or something similar.

Changed in bzr-search:
importance: Medium → High
summary: - bzr search can't find non-ascii text
+ bzr search can't index non-ascii text
description: updated
Revision history for this message
Alexander Belchenko (bialix) wrote :

I see what you mean.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.