Search could find alternate spellings

Bug #688785 reported by Blastermind
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
Low
Unassigned

Bug Description

When doing web searches in Launchpad, the system is not tolerant for errors.

A single wrong letter means total failure. For example searching for "inkscaep" on Ubuntu package list returns no results. This is not usable.

A simple way to get around typos and such is to calculate the levenshtein distance and select the minimum.

I have attached a sample code to demonstrate. It contains a list of all Ubuntu packages and a simple matcher. On my machines all queries are instant.

Some examples.

"inchcape" matches to "inkscape"

"openorifice.org" matches to "openoffice.org"

"pthon" matches to "python"

All of Launchpad's queries, like packages, users, projects and so on, should have this kind of error correction.

Revision history for this message
Blastermind (blastermind) wrote :
Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 688785] Re: Improvements to Launchpad web searches

This is an interesting proposal and part of our overall need to fix
our search story.

A few thoughts:
 - 'instant' isn't really all that precise a metric. Here are some stats:
   - we have 60K unique product/package names
   - bringing them all back in just psql is about 200ms - it will be
substantially more on an appserver due to networking and serialisation
overheasds
   - that overhead will be per request, + the calculation time per term.

So to make an acceptable overhead - say 500ms - we'd need the total
time to select an appropriate term in a 5 term search to be
(ballparking) under 10ms on a 60K corpus. Actual tests with an
appserver would be needed to be confident that this works well
enough.

This doesn't mean we can't do it, but it may mean that rather than a
simple approach we need some indexed.

-Rob

affects: launchpad → launchpad-registry
Curtis Hovey (sinzui)
affects: launchpad-registry → launchpad-foundations
tags: added: search
summary: - Improvements to Launchpad web searches
+ Search could find alternate spellings
Revision history for this message
Gary Poster (gary) wrote :

Triaging as low because of Robert's two main points: fast-enough locally is not equivalent to fast-enough for Launchpad's use, and we're well aware of our searching needing a revamp.

Changed in launchpad-foundations:
status: New → Triaged
importance: Undecided → Medium
importance: Medium → Low
Revision history for this message
Blastermind (blastermind) wrote :

To be precise that script searches a list of 37k unique names in 0.179 seconds and this includes starting Python.

That said, some googling turned up this page showing fast levenshtein from a list of words with a trie:

http://efreedom.com/Question/1-2918771/Optimizing-Levenshtein-Distance-Algorithm

Revision history for this message
ijk (ijk) wrote :

This has been implemented in postgresql -
http://www.postgresql.org/docs/current/static/fuzzystrmatch.html
But there's a warning that it doesn't work with multibyte encodings.

Revision history for this message
ijk (ijk) wrote :

An alternative approach is to copy Google's "Did you Mean" feature.
http://stackoverflow.com/questions/307291/how-does-the-google-did-you-mean-algorithm-work#307344

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.