BluePrint searchtext= not returning correct results

Bug #1025357 reported by Antonio Rosales on 2012-07-16
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Low
Unassigned

Bug Description

Searching for the blueprints @ https://blueprints.launchpad.net/~ubuntu-server that contain "servercloud-q" in the "Show only blueprints containing" field (ie searchtext=) does not return any results when it should return 39. Each one of server's blueprints for this cycle were named with a prefix of "servercloud-q" to help easily find them in searching. This was working around the May/EarlyJune time frame, however seems to have recently stopped.

-Thanks.

Abel Deuring (adeuring) wrote :

As a workaround, you can search simply for "servercloud q". That's basically what was used internally before. (well, strictly speaking, the internal search term was "(servercloud & q) | servercloudq", but "servercloudq" would not yield a match.)

This issue is caused by my work on bug 29713, specifically to fix the problem described at the end of comment #7 that certain filenames cannot be searched. My conclusion was that it is best to simply not mangle any '-' inside a word.

This bug a good example that we should do this again, but slightly modified. The current situation:

The FTI data is for example

select to_tsvector('servercloud-q-cloud-archive');
                               to_tsvector
-------------------------------------------------------------------------
 'archiv':5 'cloud':4 'q':3 'servercloud':2 'servercloud-q-cloud-arch':1

and the ts_query for "servercloud-q" is:

select ftq('servercloud-q');
                  ftq
---------------------------------------
 'servercloud-q' & 'servercloud' & 'q'

(This is the same as a direct call of to_tsquery())

So, the "blocker" is that "servercloud-q" is not part of the FTI.

We should probably re-introduce a form of "mangling of hypens" so that we have a call of to_tsquery() like:

to_tsquery('servercloud-q | (servercloud & q)')

The result of this call is slightly redundant but should work:

select to_tsquery('servercloud-q | (servercloud & q)');
                         to_tsquery
-------------------------------------------------------------
 'servercloud-q' & 'servercloud' & 'q' | 'servercloud' & 'q'

Note that a simple s/-/ / for a search term will cause problems for words that are treated as file names or host names:

launchpad_dev=# select to_tsvector('file-name.txt');
    to_tsvector
-------------------
 'file-name.txt':1

select to_tsquery('file & name.txt');
         ftq
---------------------
 'file' & 'name.txt'

so, here we must keep the '-'. The "redundant looking" variant to call to_tsquery('file-name.txt | (file & name.txt)') makes searches successful both for words like "servercloud-q-cloud-archive" as described here as well as for file/host names containing dashes.

Changed in launchpad:
status: New → Triaged
Abel Deuring (adeuring) wrote :

As a side note, be careful of other pitfalls: Names like "servercloud-a-foo" and a search for "servercloud-a" (old variant) or "servercloud a" (new suggested variant) would return all names containing "servercloud", including "servercloud-b-bar" and "servercloud-q-baz" because "a" is a stop word -- too frequently used in English...

Changed in launchpad:
importance: Undecided → Critical
Abel Deuring (adeuring) wrote :

I am not sure if this is really critical. After all, a slightly modified query will work.

And while I sketched a possible fix in previous comment, I am not sure if we should implement it or if we should simply keep the current behaviour: My impression about ftq() os that it tries to do too much DWIM anyway, and I'd prefer to make the function more simple and thus its behaviour better predictable.

A bad DWIM example: A '-' preceded by a space and preceding a word is converted into a '!', i.e., the term "-foo" is treated as "find texts that do not contain the word 'foo'". That's fine for words, but breaks utterly for numbers. The TS data for a simple calculation:

select to_tsvector('123-456');
   to_tsvector
------------------
 '-456':2 '123':1

"123-456" used as a search term:

select ftq('123-456');
      ftq
----------------
 '123' & '-456'

so there is no match because '456' != '-456':

select to_tsvector('123 - 456') @@ ftq('123-456');
 ?column?
----------
 f

Changed in launchpad:
importance: Critical → Low
Curtis Hovey (sinzui) on 2012-07-17
tags: added: lp-blueprints search specifications
Antonio Rosales (arosales) wrote :

@Abel,

Thanks for your work on this issue, and your explanation of the issue. I can confirm that searching for "servercloud q" or https://blueprints.launchpad.net/~ubuntu-server?searchtext=servercloud+q does return the correct result. Per your comment 2, if I understand correctly, going forward searching for "servercloud r" (blueprints name with servercloud-r-foo) would return unique results (ie not return servercloud-q blueprints), the issue is when we name a future blueprint servercloud-q-r-foo. I just wanted to confirm our naming scheme going forward will not present any issues. We plan to name blueprints as servercloud-<cycle>-<blueprint-name>.

-Thanks.

Abel Deuring (adeuring) wrote :

Antonio,

right, if you have have a name like "servercloud-q-r-foo", the name will be found if you search for "servercloud q" as well as for "servercloud r" -- probably not what you would expect ;)

The problem is that there is not way to search for adjacent words. The core search data provided by Postgres would allow this: For each word it's position in a given text is stored, so in theory there would be the option to limit the search to "returns texts having 'servercloud' in position N and 'q' in position N+1" but AFAIK Postgres' remaining search infrastructure does not provide it.

The second possible problem are stop words, i.e., words that are not indexed. These are common English words, like 'a', 'the', 'be' etc. The "single-character" words "a", "i", "s", "t" are treated as stop words (meaning that they ae not stored in the full text index). The reason for "a" and "i" is obvious; "s" is probably not indexed because it is used as the "genetive marker" (like in "Antonio's bug 1025327"), "t" is probably dropped because of words like "can't". (The single quotation mark is parsed as a word separator.)

Anyway, what I mean is: A search for "servercloud-s" would find texts containing "servercloud-b" or "servercloud-no-single-character-at-all" and so on because the "s", being a stop word, is silently dropped from the query. Might be a reason to change the naming scheme for the releases "S" and "T" to avoid another unpeasant surprise ;)

Antonio Rosales (arosales) wrote :

@Abel,

Thanks for the additional information on the stop words. I wasn't expecting the behavior described for stop words such as:

'A search for "servercloud-s" would find texts containing "servercloud-b" or "servercloud-no-single-character-at-all" and so on because the "s", being a stop word, is silently dropped from the query.'

This is good to know as we approach the upcoming "s" ant "t" cycles (thus probably why you pointed those two stop words out). We'll try to come up with a more unique naming scheme that can avoid these stop words and still identify blueprints only for that cycle when searching.

-Thanks,
Antonio

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers