bug search fails to find results when punctuation is adjacent to regular text in the document (e.g. '"from"', '<div>')

Bug #29713 reported by Stuart Bishop on 2006-01-26
66
This bug affects 10 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Critical
Abel Deuring

Bug Description

This bug has expanded a bit since it was originally filed in 2006. Here is the current state of things.

The Problem:
============

Doing a bug search can fail (turn up no results) despite the fact that the *exact* search string appears in the titles of some bugs that should be within the scope of the search.

Examples:
=========

 * See bug 2753 (now a dupe of this). The thing being searched for was 'div' and the text indexed contained ' <div> '.

 * See bug #360642, which is now marked as a dup of this. The
   reporter says that searching for "from" failed to find results,
   even though that's in the title of
   https://bugs.edge.launchpad.net/ubuntu/+source/thunderbird/+bug/357864.
   Here is the title of that bug, using "/" as the delimiter since the
   title itself contains both double quotes and parens: /Editing the
   "From" field for the current email only (as text, not dropdown)/.

   I re-tested on 2010-02-22, and searching for either "from" or
   "From", with or without double quotes around it, still fails to
   turn up that bug. When I did a search for "from" (with no double
   quotes) with the "Across all project" radio button selected, I got
   exactly one result: 508760. It seems very unlikely that there'd be
   exactly one hit for a search on "from" :-).

 * There are two bugs with the string "community-contributions.py" in
   their titles: bug #513608 (as of 2010-02-22 was in state
   "confirmed", with summary "community-contributions.py script should
   use Launchpad to determine who is not a Canonical employee") and
   bug #432742 (state "fix committed", with summary
   "community-contributions.py script erroring on some Unicode (?)
   input"). Both are in launchpad-foundations (not sure why, but no
   matter).

   Anyway, searching for "community-contributions.py" fails to turn up
   any results when done across all projects, nor in
   "launchad-project", nor in "launchpad-foundations", nor in
   "launchpad".

   Removing the ".py" and searching for "community-contributions" in
   launchpad-project gets two hits: 393407 (which contains the words
   "community" and "contributed" separately) and 484824 (which
   contains "community" and "contributions" separately), but we still
   don't get the bugs that have the exact match in their titles.

   Meanwhile, searching for "community-contributions" (again without
   the ".py") with "Across all projects" checked results in 19 hits
   (312766, 374090, 459701, 459701, 265028, 456301, 393407, 418469,
   484824, 250402, 263554, 250402, 250402, 357358, 411358, 453775,
   495391, 459701, 509094), none of which are the two I'm looking for.

A Possible Non-Example:
=======================

 * In the original repro recipe for this bug, the reporter said "If I
   search for 'sqlobject' on
   https://launchpad.net/products/launchpad/+bugs , I get no results
   despite this term being in the title of Bug #3096, which is
   currently in 'confirmed' status. Interestingly, you can see this
   bug in the full bug list."

   But bug #3096 is in "launchpad-foundations", and I'm not sure that
   searching for it in "launchpad" would work anyway, since
   "launchpad" is (AFAICT) just a grab-bag temporary holding area
   anyway. So it may be that the original bug report here was a
   misunderstanding, but that coincidentally, there is a real bug
   whose symptoms match those that the original report described!

Possible causes
===============

Tokenisation of terms is done both in-DB and in-python, if these are mismatched we may have terms that simply cannot be searched on because the supplied search query won't ever match the indexed terms,

Related branches

description: updated

I tried the search with 'sqlobject' and 'sqlobject.select' on the https://launchpad.net/products/launchpad/+bugs and got some results, but bug 3096 didn't show up.

Weird.

Changed in malone:
status: Unconfirmed → Confirmed
Changed in malone:
assignee: nobody → stub
gpothier (gpothier) wrote :

Probably related: searching for "2.6.20-12" does not produce relevant results, I'd expect to find Bug #94083 and #93648, which have that exact string in the title, but they don't appear.

Stuart Bishop (stub) on 2008-12-19
Changed in malone:
status: Confirmed → Triaged
Karl Fogel (kfogel) on 2010-02-22
summary: - Search for sqlobject bugs in launchpad product fails to find any results
+ bug search fails to find results despite exact search string being in
+ bug titles
Karl Fogel (kfogel) on 2010-02-22
description: updated
description: updated
description: updated
Stuart Bishop (stub) on 2011-07-04
Changed in launchpad:
assignee: Stuart Bishop (stub) → nobody
Ursula Junque (ursinha) on 2011-08-18
tags: added: search ubuntu-qa
removed: lp-bugs
Changed in launchpad:
importance: Medium → High
Curtis Hovey (sinzui) on 2011-10-22
Changed in launchpad:
importance: High → Low

I still believe this is high given its complete breakage of basic queries.

Changed in launchpad:
importance: Low → High
Francis J. Lacoste (flacoste) wrote :

Escalated by Kate

tags: added: escalated
Changed in launchpad:
importance: High → Critical
description: updated
Graham Binns (gmb) wrote :

I've assigned this to myself for the purposes of investigation; since Yellow Squad is but a couple of weeks (excepting the Christmas break and the Thunderepic in January) from feature rotation I might not be able to fix this in the time available, but I might at least be able to shed some more light on it.

Changed in launchpad:
assignee: nobody → Graham Binns (gmb)
Graham Binns (gmb) on 2011-12-12
Changed in launchpad:
assignee: Graham Binns (gmb) → nobody
description: updated
summary: - bug search fails to find results despite exact search string being in
- bug titles
+ bug search fails to find results when punctuation is adjacent to regular
+ text in the document (e.g. '"from"', '<div>')

Another example: In the Mixxx project, searching for '1.10 crash' or even '1.10 crashes' does not include a bug titled 'mixxx.exe 1.10 Beta1 immediately crashes at startup' in its results, I'm guessing due to the presence of the period within the version number.

Abel Deuring (adeuring) wrote :

_build_search_text_clause() in bugtasksearch.py generates these search
clauses:

    SQL("BugTaskFlat.fti @@ ftq(?)", params=(searchtext,))

So, no tokenisation in Python.

I added two bugs with some of "bad search terms" mentioned above to my
local launchpad_dev DB:

select bug.description, bugtaskflat.fti
    from bugtaskflat, bug where bug.id=bugtaskflat.bug and bug.id>=16;

row 16:
  description: from "from" foo "bar" <div> community-contributions.py SQLObject.select 2.6.20-12 1.10 crash
  fti: '-12':8 '1.10':9 '2.6.20':7 'bar':4 'community-contributions.py':5 'crash':10 'foo':3 'sqlobject.select':6

row 17:
  description: from "from" foo "bar" div community-contributions.py SQLObject.select 2.6.20-12 1.10 crash
  fti: '-12':10 '1.10':11 '2.6.20':9 'bar':5 'community-contributions.py':7 'crash':12 'div':6 'foo':4 'sqlobject.select':8 'xxx':1B

The only difference between these rows is '<div>' vs. 'div'

Neither '<div>' nor 'div' appear in the first FTI: it seems that the FTI
tokenizer simply drops anything between '<' and '>'.

search queries:

select bug from bugtaskflat where fti @@ ftq('sqlobject.select');
-> no result.

select ftq('sqlobject.select');
                    ftq
--------------------------------------------
 'sqlobject' & 'select' | 'sqlobjectselect'
(1 row)

So, ftq('sqlobject.select') generates a reasonable expression -- but the
full text index stores 'sqlobject.select' instead of two words 'sqlobject'
and 'select'.

The query below works though:

select bug from bugtaskflat where fti @@ 'sqlobject.select';

 bug
-----
  16
  17

A search for "community-contributions.py" has the same problem: The index
stores the complete word, but:

select ftq('community-contributions.py');
                              ftq
---------------------------------------------------------------
 'communiti' & 'contribut' & 'py' | 'communitycontributionspi'
(1 row)

"From" is probably in the set of stop words. I am not sure if it makes
sense to remove "from" from this set...

Abel Deuring (adeuring) wrote :

select bug from bugtaskflat where fti @@ ftq('2.6.20-12');
 bug
-----
(0 rows)

select ftq('2.6.20-12'); ftq
-------------------------------------------------
 ( '2' & '6' | '26' ) & ( '20' & '12' | '2012' )
(1 row)

while the fTI stores these numbers:

  fti: '-12':8 '1.10':9 '2.6.20':7 '

Abel Deuring (adeuring) wrote :

There at least two issues:

1. Treatment of '<' and '>'.

Postgres' text search machinery seems to simply ignore HTML/XML nodes,
at least in the config we use:

select to_tsvector('<div>');
 to_tsvector
-------------

(1 row)

select to_tsquery('<div>');
NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
 to_tsquery
------------

(1 row)

But the function ftq() (part of our schema definition) strips '<' and '>'
before it generates the query object:

select ftq('<div>');
  ftq
-------
 'div'

2. Treatment of punctuation. From the definition of ftq():

        punctuation = r"[^\w\s\-\&\|\!\(\)']"
        query = re.sub(r"(?u)(\w)%s+(\w)" % (punctuation,), r"\1-\2", query)

This replaces 'object.attr' with 'object-attr' (as well as 'object<attr'
-> 'object-attr'), and this is later replaced with
'((object&attr)|(objectattr)':

        def hyphen_repl(match):
            bits = match.group(0).split("-")
            return "((%s)|%s)" % ("&".join(bits), "".join(bits))
        query = re.sub(r"(?u)\b\w+-[\w\-]+\b", hyphen_repl, query)

but the FTI stores the original 'object.attr'. It might make sense
to search for 'object' & 'attr' or for 'objectattr' but we should
additionally keep the original 'object.attr' in the query.

Abel Deuring (adeuring) on 2012-06-12
Changed in launchpad:
assignee: nobody → Abel Deuring (adeuring)
status: Triaged → In Progress

Note that this is some of our oldest code, based on tsearch2 with
PostgreSQL 8.0. Since then, tsearch has been moved into PostgreSQL
core, improved and documented (Chapter 12 of the PostgreSQL 9.1
manual, Full Text Search). Many of the issues may be fixable with
properly configuration (stop word lists etc.), and new facilities may
make it possible to simplify this old code (pluggable parsers etc.)

--
Stuart Bishop <email address hidden>

Abel Deuring (adeuring) wrote :
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
Changed in launchpad:
status: In Progress → Fix Committed
William Grant (wgrant) on 2012-06-24
tags: added: bad-commit-15464
William Grant (wgrant) on 2012-06-25
Changed in launchpad:
status: Fix Committed → In Progress
Abel Deuring (adeuring) on 2012-06-26
Changed in launchpad:
status: In Progress → Fix Committed
Abel Deuring (adeuring) wrote :

Tagged as qa-ok. The issues described here and in bug 1015511 and bug 1015519 are not fixed though.

tags: added: qa-ok
removed: qa-needstesting
William Grant (wgrant) on 2012-06-28
Changed in launchpad:
status: Fix Committed → Fix Released

Thank you so much, for fixing such a longstanding bug. (Excuse me you consider this confirmation just bug spam.) It affected me recently, when LP did not find existing reports with this subject:

  package manpages 3.35-0.1ubuntu1 failed to install/upgrade: trying to overwrite '/usr/share/man/man1/getent.1.gz', which is also in package libc-bin 2.15-0ubuntu15

(That was bug 1017289.) I had to browse the list of recent bugs for that package, to find the report, and mark duplicates. I tested an upload of the same crashfile, today, and LP directed me to the existing bug. To me this is the difference between the bug reporter feeling lost in a maze of twisty little passages, or, feeling well-guided, and that's the difference between participating or giving up.

Seeing a bug that's a million numbers old, getting fixed, is a morale booster in itself, too. A million thanks. :)

Abel Deuring (adeuring) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 29.06.2012 21:31, Edward Donovan wrote:
> Thank you so much, for fixing such a longstanding bug. (Excuse me
> you consider this confirmation just bug spam.) It affected me
> recently,

Bug reports are for conversations about bugs, and conversations have
also some social aspects. So your comment is definitely not spam.

> when LP did not find existing reports with this subject:
>
> package manpages 3.35-0.1ubuntu1 failed to install/upgrade: trying
> to overwrite '/usr/share/man/man1/getent.1.gz', which is also in
> package libc-bin 2.15-0ubuntu15
>
> (That was bug 1017289.) I had to browse the list of recent bugs
> for that package, to find the report, and mark duplicates. I
> tested an upload of the same crashfile, today, and LP directed me
> to the existing bug. To me this is the difference between the bug
> reporter feeling lost in a maze of twisty little passages, or,
> feeling well-guided, and that's the difference between
> participating or giving up.
>
> Seeing a bug that's a million numbers old, getting fixed, is a
> morale booster in itself, too. A million thanks. :)

welcome :)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iD8DBQFP8VLyekBPhm8NrtARAkQSAKCOXZ495PKgFRLebAfhSKEivrQcWACgh0di
rwj1AgIRUb+cp3wcqQDbJic=
=omP4
-----END PGP SIGNATURE-----

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers