Find accented forms when searching (e.g. Carlos Perelló Marín with "perello")

Bug #5417 reported by Matthew Paul Thomas
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
Low
Unassigned

Bug Description

At <https://launchpad.net/people>, searching for "perello" finds David Perello, but not Carlos Perelló Marín.

In contrast, searching Google for "carlos perello" returns Carlos Perelló Marín's home page as the first result, with "Carlos Perelló" being highlighted as the matching text on the results page, and "perello" occurring nowhere in the text.

When searching for people Launchpad should, like Google, return results regardless of accented letters (since people are likely to enter them incorrectly or not at all). This probably means having a table of which letters are variants of each ASCII character (e.g. o = {OoÒÓÔÕÖØòóôõöøŌōŎŏŐőƟɵƠơǑǒǪǫǬǭȌȍȎȏȪȫȬȭȮȯȰȱṌṍṎṏṐṑṒṓỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợ}) for use in simplifying both names and search strings, and caching the simplified name for each person for searching purposes (e.g. caching "carlos perello marin" as the simplified version of "Carlos Perelló Marín").

Revision history for this message
Christian Reis (kiko) wrote :

I think we might have an easier way out just smashing the fti indexes to contain only non-accented versions of the characters, and then converting the query strings provided to the fti helper. I'm guessing, though, and Stuart as usually will have a better idea.

Changed in launchpad:
assignee: nobody → stub
Revision history for this message
Stuart Bishop (stub) wrote : Re: [Bug 5417] Find accented forms when searching (e.g. Carlos Perelló Marín with "perello")

Christian Reis wrote:
> Public bug report changed:
> https://launchpad.net/malone/bugs/5417
>
> Comment:
> I think we might have an easier way out just smashing the fti indexes to
> contain only non-accented versions of the characters, and then
> converting the query strings provided to the fti helper. I'm guessing,
> though, and Stuart as usually will have a better idea.

I think we would need to smash the values going into the indexes - tsearch2
is designed to only work in a single locale and encoding so it isn't going
to be any help to us here.

We already have code to do the deaccentification -
canonical.encoding.ascii_smash() handles the European latin based character
sets. Your still stuffed with character sets that don't have an ASCII
equivalent, such as Coptic, Greek or most of the Asian languages.

If we want to proceed, canonical.encoding.ascii_smash() needs to be brought
into the database environment by embedding the logic into the ftq() method.

--
Stuart Bishop <email address hidden> http://www.canonical.com/
Canonical Ltd. http://www.ubuntu.com/

Revision history for this message
Björn Tillenius (bjornt) wrote :

On Wed, Dec 07, 2005 at 02:03:07AM -0000, Stuart Bishop wrote:
> We already have code to do the deaccentification -
> canonical.encoding.ascii_smash() handles the European latin based character
> sets. Your still stuffed with character sets that don't have an ASCII
> equivalent, such as Coptic, Greek or most of the Asian languages.

ascii_smash() doesn't do exactly what I would expect, though. For
example, it transforms my name, 'Björn', into 'Bjoern' instead of
'Bjorn'. If people would try to find me, they would most likely search
for either 'Björn' or 'Bjorn'.

Revision history for this message
Stuart Bishop (stub) wrote :

Björn Tillenius wrote:
> Public bug report changed:
> https://launchpad.net/malone/bugs/5417
>
> Comment:
> On Wed, Dec 07, 2005 at 02:03:07AM -0000, Stuart Bishop wrote:
>
>>We already have code to do the deaccentification -
>>canonical.encoding.ascii_smash() handles the European latin based character
>>sets. Your still stuffed with character sets that don't have an ASCII
>>equivalent, such as Coptic, Greek or most of the Asian languages.
>
>
> ascii_smash() doesn't do exactly what I would expect, though. For
> example, it transforms my name, 'Björn', into 'Bjoern' instead of
> 'Bjorn'. If people would try to find me, they would most likely search
> for either 'Björn' or 'Bjorn'.

It is supposed to be doing a fairly 'standard' transliteration, although I'm
sure that not all European languages do this the same way. I don't know if
it would be a good idea to tweak the mapping to do some sort of a hybrid,
where Björn maps to Bjorn and Åiste maps to Aiste, but Ægean maps to AEgean
and Straße maps to Strasse.

I don't know what the 'correct' mapping would be, but we can tweak it easily
enough.

--
Stuart Bishop <email address hidden> http://www.canonical.com/
Canonical Ltd. http://www.ubuntu.com/

Revision history for this message
Björn Tillenius (bjornt) wrote :

On Wed, Dec 07, 2005 at 08:48:08AM -0000, Stuart Bishop wrote:
> Public bug report changed:
> https://launchpad.net/malone/bugs/5417
>
> Comment:
> Björn Tillenius wrote:
> > Public bug report changed:
> > https://launchpad.net/malone/bugs/5417
> >
> > Comment:
> > On Wed, Dec 07, 2005 at 02:03:07AM -0000, Stuart Bishop wrote:
> >
> >>We already have code to do the deaccentification -
> >>canonical.encoding.ascii_smash() handles the European latin based character
> >>sets. Your still stuffed with character sets that don't have an ASCII
> >>equivalent, such as Coptic, Greek or most of the Asian languages.
> >
> >
> > ascii_smash() doesn't do exactly what I would expect, though. For
> > example, it transforms my name, 'Björn', into 'Bjoern' instead of
> > 'Bjorn'. If people would try to find me, they would most likely search
> > for either 'Björn' or 'Bjorn'.
>
> It is supposed to be doing a fairly 'standard' transliteration, although I'm
> sure that not all European languages do this the same way. I don't know if
> it would be a good idea to tweak the mapping to do some sort of a hybrid,
> where Björn maps to Bjorn and Åiste maps to Aiste, but Ægean maps to AEgean
> and Straße maps to Strasse.

I think we need to tweak the mapping a bit. The current mapping is used
mostly for official use, for example by shipping companies, banks, and
in passports. (although the current mapping doesn't transform 'å' to
'aa' which is also common for this kind of ascii smash). As a
comparison, Google seems to map 'oe' only to 'ø', not to 'ö'.

> I don't know what the 'correct' mapping would be, but we can tweak it easily
> enough.

Yeah, there probably isn't a 'correct' mapping that fits all.

Revision history for this message
Christian Reis (kiko) wrote :

On Wed, Dec 07, 2005 at 02:03:07AM -0000, Stuart Bishop wrote:
> I think we would need to smash the values going into the indexes - tsearch2
> is designed to only work in a single locale and encoding so it isn't going
> to be any help to us here.

Yes, that's what I had intended to say.

The portuguese use cases are simple -- all transliterations of accented
characters are simply conversions to the unaccented version. So àéíõü
would become aeiou. This makes searches work a lot better in the face of
the fact that people often omit them, for various reasons.

Looking at the ascii_smash code, it is pretty easy to fix specific cases
where it gets things wrong -- just add an exception to the mapping. I
suspect it does the right thing for most of the cases, so perhaps we
could proceed with using this, and have people tell us when we get it
wrong so we can adjust.

> If we want to proceed, canonical.encoding.ascii_smash() needs to be brought
> into the database environment by embedding the logic into the ftq() method.

That sounds like a plan. Would it involve copying the code or could we
still work from a single codebase?
--
Christian Robottom Reis | http://async.com.br/~kiko/ | [+55 16] 3376 0125

Revision history for this message
Stuart Bishop (stub) wrote :

Christian Reis wrote:

> That sounds like a plan. Would it involve copying the code or could we
> still work from a single codebase?

At this stage it involves copying the code (into trusted.sql).

--
Stuart Bishop <email address hidden> http://www.canonical.com/
Canonical Ltd. http://www.ubuntu.com/

Revision history for this message
Christian Reis (kiko) wrote :

AFAICT nobody uses ascii_smash right now -- so you could probably just move it and avoid the duplication. I'm okay with it either way, AAR -- just would like to avoid the duplicated code meaning we can evolve to a situation with inconsistent behaviour between the implementations.

Is this good to go?

Dafydd Harries (daf)
Changed in launchpad:
status: New → Accepted
Revision history for this message
Martin Pool (mbp) wrote :

I think umlauts are commonly mapped to e.g. 'ue' in German, but it seems generally safer to just drop them. The ideal behaviour might have to allow for several transformations.

Curtis Hovey (sinzui)
Changed in launchpad-foundations:
assignee: Stuart Bishop (stub) → nobody
importance: Medium → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.