Near-term steps to improve relevance ranking

Bug #384096 reported by Paul Everitt
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL3
Fix Released
Medium
Tres Seaver

Bug Description

OSI users have given a verdict: they miss the quality of the search results from Xapian. No big surprise, this was something we mentioned as a downside in the move away from Crapian....ahem...Xapian.

We hope to schedule a call with one of the smart textindex folks to see what R&D next week to talk about more substantive options on the ranking system.

This ticket is focused on what we can do immediately. At the time of the de-Xapian decision, we mentioned that we could make words in the title score higher than words in the body via brute force: repeat the title words by a factor, say 10, when extracting the searchable text.

We can do this, test it out on the staging server, and gauge the impact.

If there are other idea, this ticket is a good place for them.

Tags: search
Revision history for this message
Tres Seaver (tseaver) wrote : Re: [Bug 384096] [NEW] Near-term steps to improve relevance ranking
Download full text (3.4 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Launchpad Bug Tracker wrote:
> You have been subscribed to a public bug by Paul Everitt (paul-agendaless):
>
> OSI users have given a verdict: they miss the quality of the search
> results from Xapian. No big surprise, this was something we mentioned
> as a downside in the move away from Crapian....ahem...Xapian.
>
> We hope to schedule a call with one of the smart textindex folks to see
> what R&D next week to talk about more substantive options on the ranking
> system.
>
> This ticket is focused on what we can do immediately. At the time of
> the de-Xapian decision, we mentioned that we could make words in the
> title score higher than words in the body via brute force: repeat the
> title words by a factor, say 10, when extracting the searchable text.
>
> We can do this, test it out on the staging server, and gauge the impact.
>
> If there are other idea, this ticket is a good place for them.
>
> ** Affects: karl3
> Importance: Medium
> Assignee: Shane Hathaway (shane-hathawaymix)
> Status: New
>

Here is a sketch from a client project which does this (note that I
didn't write this, and now that i look at it, the
quality-of-implementation is pretty low: I'm gonna have to fix it :()::

    def SearchableText(self):
        """
        Override searchable text taking field search weights into
        account, as well as possible extra search tuning information.

        The goal is to change word occurrences in order to manipulate
        the relevance ranking of searches on the SearchableText full
        text index. This way we won't have to do sorting of search
        results -- they should already be in the right order.
        """
        text = Content.SearchableText(self)
        for field_id, weight in self._field_search_weights.items():
            field = getattr(self, field_id, '')
            if callable(field):
                field_text = field()
            else:
                field_text = field
            if field_text is not None:
                try:
                    foo = field_text + ''
                except:
                    # this is a problem, skip this field
                    continue
                field_text = (field_text + ' ') * weight
                text.extend(field_text.split())

        if not ISearchTunable.isImplementedBy(self):
            return text

        # do extra search tuning
        if self.very_relevant_terms:
            term_text = (self.very_relevant_terms + ' ') * \
                        self.VERY_IMPORTANT_TERM_FACTOR
            text.extend(term_text.split())

        if self.relevant_terms:
            term_text = (self.relevant_terms + ' ') * \
                        self.IMPORTANT_TERM_FACTOR
            text.extend(term_text.split())

I think the the ITextIndexData adapter implementations could easily pick
this strategy up (in karl.content.models.adapters).

Perhaps we should consider applying this policy in the OSI package?

Tres.
- --
===================================================================
Tres Seaver +1 540-429-0999 <email address hidden>
Palladion Software "...

Read more...

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

On Jun 8, 2009, at 9:14 AM, Tres Seaver wrote:

[snip]

> I think the the ITextIndexData adapter implementations could easily
> pick
> this strategy up (in karl.content.models.adapters).
>
> Perhaps we should consider applying this policy in the OSI package?

Yep, this is precisely the kind of thing I had hoped to look at.
Tres, if you have bandwidth and interest, we can re-assign this one to
you.

--Paul

Revision history for this message
Tres Seaver (tseaver) wrote :

I'll take this on.

Changed in karl3:
assignee: Shane Hathaway (shane-hathawaymix) → Tres Seaver (tseaver)
status: New → In Progress
Revision history for this message
Tres Seaver (tseaver) wrote :

Committed in r3092. Please verify and close.

Changed in karl3:
assignee: Tres Seaver (tseaver) → Paul Everitt (paul-agendaless)
status: In Progress → Fix Committed
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Alas, the change didn't have much an impact on the use case originally described. I'll leave this closed and explain that we might not have an immediate/easy answer.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Tres said he would do this:

My only contribution was to change the indexing-time behavior. At query
time, there might need to be adjustments to either the application or to
the index code itself to allow for the greater weight of the term in the
title to affect the ordering of the search results.

Changed in karl3:
assignee: Paul Everitt (paul-agendaless) → Tres Seaver (tseaver)
status: Fix Committed → In Progress
Revision history for this message
Tres Seaver (tseaver) wrote :

I have just checked in a change which tries to get the "natural" order of the texts
index used when returning otherwise unsorted results.

We should try this out on the staging server to see if it improves the reported
bad results order.

Changed in karl3:
assignee: Tres Seaver (tseaver) → Paul Everitt (paul-agendaless)
status: In Progress → Fix Committed
Revision history for this message
Chris McDonough (chrism-plope) wrote :

I had to revert the latest change that used text index weighting because it invalidated catalog query security. I've reopened this bug as a result.

Changed in karl3:
assignee: Paul Everitt (paul-agendaless) → nobody
status: Fix Committed → Confirmed
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Tres, should we try on a next step?

Before it got reverted (e.g. you can test on kdi-dev as it hasn't been reverted there), I think we didn't see much improvement in search scoring.

Changed in karl3:
assignee: nobody → Tres Seaver (tseaver)
Revision history for this message
Chris McDonough (chrism-plope) wrote :

I've committed an alternate fix that doesn't ignore security.

Changed in karl3:
status: Confirmed → Fix Committed
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Let the record show that results now rock. According to Chris, we were previously returning search results by modified date, and thus, were getting no help from the machinery.

The search for "Eng" now shows Ellen Eng as the first result.

Changed in karl3:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.