BluePrint searchtext= not returning correct results
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| Launchpad itself |
Low
|
Unassigned |
Bug Description
Searching for the blueprints @ https:/
-Thanks.
Abel Deuring (adeuring) wrote : | #2 |
As a side note, be careful of other pitfalls: Names like "servercloud-a-foo" and a search for "servercloud-a" (old variant) or "servercloud a" (new suggested variant) would return all names containing "servercloud", including "servercloud-b-bar" and "servercloud-q-baz" because "a" is a stop word -- too frequently used in English...
Changed in launchpad: | |
importance: | Undecided → Critical |
Abel Deuring (adeuring) wrote : | #3 |
I am not sure if this is really critical. After all, a slightly modified query will work.
And while I sketched a possible fix in previous comment, I am not sure if we should implement it or if we should simply keep the current behaviour: My impression about ftq() os that it tries to do too much DWIM anyway, and I'd prefer to make the function more simple and thus its behaviour better predictable.
A bad DWIM example: A '-' preceded by a space and preceding a word is converted into a '!', i.e., the term "-foo" is treated as "find texts that do not contain the word 'foo'". That's fine for words, but breaks utterly for numbers. The TS data for a simple calculation:
select to_tsvector(
to_tsvector
------------------
'-456':2 '123':1
"123-456" used as a search term:
select ftq('123-456');
ftq
----------------
'123' & '-456'
so there is no match because '456' != '-456':
select to_tsvector('123 - 456') @@ ftq('123-456');
?column?
----------
f
Changed in launchpad: | |
importance: | Critical → Low |
tags: | added: lp-blueprints search specifications |
Antonio Rosales (arosales) wrote : | #4 |
@Abel,
Thanks for your work on this issue, and your explanation of the issue. I can confirm that searching for "servercloud q" or https:/
-Thanks.
Abel Deuring (adeuring) wrote : | #5 |
Antonio,
right, if you have have a name like "servercloud-
The problem is that there is not way to search for adjacent words. The core search data provided by Postgres would allow this: For each word it's position in a given text is stored, so in theory there would be the option to limit the search to "returns texts having 'servercloud' in position N and 'q' in position N+1" but AFAIK Postgres' remaining search infrastructure does not provide it.
The second possible problem are stop words, i.e., words that are not indexed. These are common English words, like 'a', 'the', 'be' etc. The "single-character" words "a", "i", "s", "t" are treated as stop words (meaning that they ae not stored in the full text index). The reason for "a" and "i" is obvious; "s" is probably not indexed because it is used as the "genetive marker" (like in "Antonio's bug 1025327"), "t" is probably dropped because of words like "can't". (The single quotation mark is parsed as a word separator.)
Anyway, what I mean is: A search for "servercloud-s" would find texts containing "servercloud-b" or "servercloud-
Antonio Rosales (arosales) wrote : | #6 |
@Abel,
Thanks for the additional information on the stop words. I wasn't expecting the behavior described for stop words such as:
'A search for "servercloud-s" would find texts containing "servercloud-b" or "servercloud-
This is good to know as we approach the upcoming "s" ant "t" cycles (thus probably why you pointed those two stop words out). We'll try to come up with a more unique naming scheme that can avoid these stop words and still identify blueprints only for that cycle when searching.
-Thanks,
Antonio
As a workaround, you can search simply for "servercloud q". That's basically what was used internally before. (well, strictly speaking, the internal search term was "(servercloud & q) | servercloudq", but "servercloudq" would not yield a match.)
This issue is caused by my work on bug 29713, specifically to fix the problem described at the end of comment #7 that certain filenames cannot be searched. My conclusion was that it is best to simply not mangle any '-' inside a word.
This bug a good example that we should do this again, but slightly modified. The current situation:
The FTI data is for example
select to_tsvector( 'servercloud- q-cloud- archive' );
to_tsvector ------- ------- ------- ------- ------- ------- ------- ------- ------- --- q-cloud- arch':1
-------
'archiv':5 'cloud':4 'q':3 'servercloud':2 'servercloud-
and the ts_query for "servercloud-q" is:
select ftq('serverclou d-q');
ftq ------- ------- ------- ------- ----
-------
'servercloud-q' & 'servercloud' & 'q'
(This is the same as a direct call of to_tsquery())
So, the "blocker" is that "servercloud-q" is not part of the FTI.
We should probably re-introduce a form of "mangling of hypens" so that we have a call of to_tsquery() like:
to_tsquery( 'servercloud- q | (servercloud & q)')
The result of this call is slightly redundant but should work:
select to_tsquery( 'servercloud- q | (servercloud & q)');
to_ tsquery ------- ------- ------- ------- ------- ------- ------- -----
-------
'servercloud-q' & 'servercloud' & 'q' | 'servercloud' & 'q'
Note that a simple s/-/ / for a search term will cause problems for words that are treated as file names or host names:
launchpad_dev=# select to_tsvector( 'file-name. txt');
to_tsvector
-------------------
'file-name.txt':1
select to_tsquery('file & name.txt'); ------- -------
ftq
-------
'file' & 'name.txt'
so, here we must keep the '-'. The "redundant looking" variant to call to_tsquery( 'file-name. txt | (file & name.txt)') makes searches successful both for words like "servercloud- q-cloud- archive" as described here as well as for file/host names containing dashes.