Bug #384096 “Near-term steps to improve relevance ranking” : Bugs : KARL3

Revision history for this message

Tres Seaver (tseaver) wrote on 2009-06-08: Re: [Bug 384096] [NEW] Near-term steps to improve relevance ranking

#1

Download full text (3.4 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Launchpad Bug Tracker wrote:
> You have been subscribed to a public bug by Paul Everitt (paul-agendaless):
>
> OSI users have given a verdict: they miss the quality of the search
> results from Xapian. No big surprise, this was something we mentioned
> as a downside in the move away from Crapian....ahem...Xapian.
>
> We hope to schedule a call with one of the smart textindex folks to see
> what R&D next week to talk about more substantive options on the ranking
> system.
>
> This ticket is focused on what we can do immediately. At the time of
> the de-Xapian decision, we mentioned that we could make words in the
> title score higher than words in the body via brute force: repeat the
> title words by a factor, say 10, when extracting the searchable text.
>
> We can do this, test it out on the staging server, and gauge the impact.
>
> If there are other idea, this ticket is a good place for them.
>
> ** Affects: karl3
> Importance: Medium
> Assignee: Shane Hathaway (shane-hathawaymix)
> Status: New
>

Here is a sketch from a client project which does this (note that I
didn't write this, and now that i look at it, the
quality-of-implementation is pretty low: I'm gonna have to fix it :()::

    def SearchableText(self):
        """
        Override searchable text taking field search weights into
        account, as well as possible extra search tuning information.

        The goal is to change word occurrences in order to manipulate
        the relevance ranking of searches on the SearchableText full
        text index. This way we won't have to do sorting of search
        results -- they should already be in the right order.
        """
        text = Content.SearchableText(self)
        for field_id, weight in self._field_search_weights.items():
            field = getattr(self, field_id, '')
            if callable(field):
                field_text = field()
            else:
                field_text = field
            if field_text is not None:
                try:
                    foo = field_text + ''
                except:
                    # this is a problem, skip this field
                    continue
                field_text = (field_text + ' ') * weight
                text.extend(field_text.split())

if not ISearchTunable.isImplementedBy(self):
return text

        # do extra search tuning
        if self.very_relevant_terms:
            term_text = (self.very_relevant_terms + ' ') * \
                        self.VERY_IMPORTANT_TERM_FACTOR
            text.extend(term_text.split())

        if self.relevant_terms:
            term_text = (self.relevant_terms + ' ') * \
                        self.IMPORTANT_TERM_FACTOR
            text.extend(term_text.split())

I think the the ITextIndexData adapter implementations could easily pick
this strategy up (in karl.content.models.adapters).

Perhaps we should consider applying this policy in the OSI package?

Tres.
- --
===================================================================
Tres Seaver +1 540-429-0999 <email address hidden>
Palladion Software "...

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Launchpad Bug Tracker wrote:
> You have been subscribed to a public bug by Paul Everitt (paul-agendaless):
> 
> OSI users have given a verdict: they miss the quality of the search
> results from Xapian.  No big surprise, this was something we mentioned
> as a downside in the move away from Crapian....ahem...Xapian.
> 
> We hope to schedule a call with one of the smart textindex folks to see
> what R&D next week to talk about more substantive options on the ranking
> system.
> 
> This ticket is focused on what we can do immediately.  At the time of
> the de-Xapian decision, we mentioned that we could make words in the
> title score higher than words in the body via brute force: repeat the
> title words by a factor, say 10, when extracting the searchable text.
> 
> We can do this, test it out on the staging server, and gauge the impact.
> 
> If there are other idea, this ticket is a good place for them.
> 
> ** Affects: karl3
>      Importance: Medium
>      Assignee: Shane Hathaway (shane-hathawaymix)
>          Status: New
>

Here is a sketch from a client project which does this (note that I
didn't write this, and now that i look at it, the
quality-of-implementation is pretty low:  I'm gonna have to fix it :()::

def SearchableText(self):
        """
        Override searchable text taking field search weights into
        account, as well as possible extra search tuning information.

The goal is to change word occurrences in order to manipulate
        the relevance ranking of searches on the SearchableText full
        text index. This way we won't have to do sorting of search
        results -- they should already be in the right order.
        """
        text = Content.SearchableText(self)
        for field_id, weight in self._field_search_weights.items():
            field = getattr(self, field_id, '')
            if callable(field):
                field_text = field()
            else:
                field_text = field
            if field_text is not None:
                try:
                    foo = field_text + ''
                except:
                    # this is a problem, skip this field
                    continue
                field_text = (field_text + ' ') * weight
                text.extend(field_text.split())

if not ISearchTunable.isImplementedBy(self):
            return text

# do extra search tuning
        if self.very_relevant_terms:
            term_text = (self.very_relevant_terms + ' ') * \
                        self.VERY_IMPORTANT_TERM_FACTOR
            text.extend(term_text.split())

if self.relevant_terms:
            term_text = (self.relevant_terms + ' ') * \
                        self.IMPORTANT_TERM_FACTOR
            text.extend(term_text.split())

I think the the ITextIndexData adapter implementations could easily pick
this strategy up (in karl.content.models.adapters).

Perhaps we should consider applying this policy in the OSI package?

Tres.
- --
===================================================================
Tres Seaver          +1 540-429-0999          tseaver@palladion.com
Palladion Software   "Excellence by Design"    http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFKLQ6w+gerLs4ltQ4RArgCAJ9pr9pr57MqODi+0mXHUkRlQDmW6QCfTZyQ
SupIfo/Pdl0Ip1CGQfIF4Sw=
=qgRM
-----END PGP SIGNATURE-----

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2009-06-08:

#2

On Jun 8, 2009, at 9:14 AM, Tres Seaver wrote:

[snip]

> I think the the ITextIndexData adapter implementations could easily
> pick
> this strategy up (in karl.content.models.adapters).
>
> Perhaps we should consider applying this policy in the OSI package?

Yep, this is precisely the kind of thing I had hoped to look at.
Tres, if you have bandwidth and interest, we can re-assign this one to
you.

--Paul

Revision history for this message

Tres Seaver (tseaver) wrote on 2009-06-08:

#3

I'll take this on.

Changed in karl3:
assignee:	Shane Hathaway (shane-hathawaymix) → Tres Seaver (tseaver)
status:	New → In Progress

Revision history for this message

Tres Seaver (tseaver) wrote on 2009-06-08:

#4

Committed in r3092. Please verify and close.

Changed in karl3:
assignee:	Tres Seaver (tseaver) → Paul Everitt (paul-agendaless)
status:	In Progress → Fix Committed

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2009-06-08:

#5

Alas, the change didn't have much an impact on the use case originally described. I'll leave this closed and explain that we might not have an immediate/easy answer.

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2009-06-09:

#6

Tres said he would do this:

My only contribution was to change the indexing-time behavior. At query
time, there might need to be adjustments to either the application or to
the index code itself to allow for the greater weight of the term in the
title to affect the ordering of the search results.

Changed in karl3:
assignee:	Paul Everitt (paul-agendaless) → Tres Seaver (tseaver)
status:	Fix Committed → In Progress

Revision history for this message

Tres Seaver (tseaver) wrote on 2009-06-10:

#7

I have just checked in a change which tries to get the "natural" order of the texts
index used when returning otherwise unsorted results.

We should try this out on the staging server to see if it improves the reported
bad results order.

Changed in karl3:
assignee:	Tres Seaver (tseaver) → Paul Everitt (paul-agendaless)
status:	In Progress → Fix Committed

Revision history for this message

Chris McDonough (chrism-plope) wrote on 2009-06-11:

#8

I had to revert the latest change that used text index weighting because it invalidated catalog query security. I've reopened this bug as a result.

Changed in karl3:
assignee:	Paul Everitt (paul-agendaless) → nobody
status:	Fix Committed → Confirmed

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2009-06-11:

#9

Tres, should we try on a next step?

Before it got reverted (e.g. you can test on kdi-dev as it hasn't been reverted there), I think we didn't see much improvement in search scoring.

Changed in karl3:
assignee:	nobody → Tres Seaver (tseaver)

Revision history for this message

Chris McDonough (chrism-plope) wrote on 2009-06-11:

#10

I've committed an alternate fix that doesn't ignore security.

Changed in karl3:
status:	Confirmed → Fix Committed

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2009-06-15:

#11

Let the record show that results now rock. According to Chris, we were previously returning search results by modified date, and thus, were getting no help from the machinery.

The search for "Eng" now shows Ellen Eng as the first result.

Changed in karl3:
status:	Fix Committed → Fix Released

KARL3

Near-term steps to improve relevance ranking

Bug Description

Other bug subscribers

Remote bug watches