Patron name search should be diacritic-insensitive

Bug #1501781 reported by Dan Pearl on 2015-10-01
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Evergreen
Wishlist
Unassigned

Bug Description

This is an enhancement to allow patron name lookups to be more tolerant of spacing and special characters. This addresses problems related to finding names with diacritics, spaces, and the presence or absence of other punctuation marks. (Example, being able to find José by searching Jose, or patrons with multiple first and last names like "de la Croix" -- or is it "Delacroix"?)

Dan Pearl (dpearl) wrote :

I have working code to address this. I will check-in after local review.

Mike Rylander (mrylander) wrote :

Dan, FWIW, we use the following idiom elsewhere to remove diacritical marks:

$value = s/\pM//g

Perhaps that just needs to be added to the patron search method in the perl code?

Mike Rylander (mrylander) wrote :

Disregard my above comment ... I was thinking backwards. Sorry for the noise.

Dan Pearl (dpearl) wrote :
tags: added: pullrequest
Kathy Lussier (klussier) on 2015-12-21
Changed in evergreen:
status: New → Triaged
importance: Undecided → Medium
milestone: none → 2.next
Dan Scott (denials) wrote :

Hi Dan, while I haven't tested out the code yet, I would point out that the commit should also add the relevant install bits to Open-ILS/src/extras/install/Makefile* -- probably adding Text::Unaccent::PurePerl to the CPAN_MODULES list for each distribution, as I believe it is not packaged on Debian, Ubuntu, or Fedora.

Dan Pearl (dpearl) wrote :

I agree with Dan Scott's comment, and have pushed this revised branch

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dpearl/accent3

Kathy Lussier (klussier) wrote :

Hi Dan,

We loaded this on a MassLNC VM but get an Internal Server error when trying to navigate to the public catalog.

Kathy

Kathy Lussier (klussier) on 2016-02-26
tags: added: needsrepatch
removed: pullrequest
Dan Pearl (dpearl) wrote :

I reinstalled the code on training.cwmars.org and there is no problem getting to the public catalog. The feature works as expected. I am suspecting an environmental difference.

To install for testing, I did the following steps:
As root:
   cpan Text::Unaccent::PurePerl

As postgres:
   With psql run the XXXX.schema.patron_unaccent.sql upgrade
   script. You will need to delete the placeholder SELECT
   which is used to detect (and suppress) multiple invocations
   of the upgrade.

As root:
    Drop the new actor.pm in /usr/local/share/perl/5.14.2/OpenILS/Application/Storage/Publisher/

I restarted services and Apache, and the staff client.

I may need some guidance on how to accomplish a proper upgrade. Several of the files in the mod specify the dependency on Text::Unaccent::PurePerl, but these dependencies are, from what I can tell, effective on a full install and maybe an upgrade. I am wondering if the dependencies are looked at during upgrades. If one is simply trying to install one bug fix for testing, is there a way to express that "cpan Text::Unaccent::PurePerl" needs to be done?

Jason Stephenson (jstephenson) wrote :

Upgrades are done pretty much by doing a full install of the new version.

You should installl the prerequisites as normal during an upgrade.

Kathy Lussier (klussier) wrote :

Hi Dan,

We're looking at it a little closer here to see if it is indeed a problem with the new dependency not being installed, but, to provide more details, this is happening on Ubuntu Trusty.

We are not upgrading a current system, but use a script from tsbere that merges the branch and installs a new Evergreen system. He says the script should be installing any dependencies in the Makefile.install files.

I'll let you know if we find out anything else.

Jake Litrell (jake-3) wrote :

Just a quick FYI:
  Open-ILS/src/sql/Pg/create_database_extensions.sql
has a conflict now with pgcrypto added.

Kathy Lussier (klussier) wrote :

I resolved the conflict and re-pushed the branch to http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/kmlussier/lp1501781-accent3 so that it can be available for testing.

Blake said he would load the branch on a MOBIUS server for Bug Squashing Day so that we can see if the problem can be replicated elsewhere. However, I think I may have zeroed in on what the problem is.

As I mentioned above, we were working off a clean build whereas DPearl did his tests using an upgrade script.

If I'm reading it correctly, it looks like the upgrade script is creating a function called evergreen.unaccent_and_squash.

However, I don't see anything in the sql files used for new builds that create that function, which means the function would not have been created on our VM. Is that something that should be added to the 005.schema.actors.sql file?

Blake GH (bmagic) wrote :

Loaded this code and there is a problem setting up the db:
Results of this command:
#
perl Open-ILS/src/support-scripts/eg_db_config --update-config \
       --service all --create-database --create-schema --create-offline \
       --user evergreen --password dapassword --hostname ipaddress --port 5432 \
       --database evergreen --admin-user gadmin --admin-pass admin --load-all-sample
....
...
CREATE INDEX
psql:005.schema.actors.sql:85: ERROR: function evergreen.unaccent_and_squash(text) does not exist
LINE 1: ...r_usr_first_given_name_unaccent_idx ON actor.usr (evergreen....
                                                             ^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
psql:005.schema.actors.sql:86: ERROR: current transaction is aborted, commands ignored until end of transaction block

It looks like those index calls need to happen after that function is created:

CREATE INDEX actor_usr_first_given_name_unaccent_idx ON actor.usr (evergreen.unaccent_and_squash(first_given_name));
CREATE INDEX actor_usr_second_given_name_unaccent_idx ON actor.usr (evergreen.unaccent_and_squash(second_given_name));
CREATE INDEX actor_usr_family_name_unaccent_idx ON actor.usr (evergreen.unaccent_and_squash(family_name));

I will go ahead and remove those lines from 005.schema.actors.sql and manually run those after the database is setup.

Now I have issues adding the index:

CREATE INDEX actor_usr_first_given_name_unaccent_idx ON actor.usr (evergreen.unaccent_and_squash(first_given_name::text));
ERROR: function evergreen.unaccent(text) does not exist
LINE 1: SELECT evergreen.lowercase(evergreen.unaccent(regexp_replace...
                                   ^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.

My DB doesn't have a function called "unaccent" in the evergreen schema. Where is the code for that function?
I guess this bug wont be loaded for bug squashing day

Dan Pearl (dpearl) wrote :

The unaccent should be loaded with the create extension; there could be an ordering issue.

In the meantime, Kathy spotted the missing function, and I have put it in
http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dpearl/accent4

Thanks for everyone's help on this.

Jason Stephenson (jstephenson) wrote :

Dan, could you add a "IF NOT EXISTS" after the "CREATE EXTENSION unaccent" in the upgrade script, please?

The whole thing should then read:

CREATE EXTENSION unaccent IF NOT EXISTS;

That will help people like me who will have already created the extension because of how they have their databases set up.

Thanks!

Jason Stephenson (jstephenson) wrote :

Oh, and I checked that creating the extension does add the missing function, so that definitely needs to be done in the upgrade script.

Mike Rylander (mrylander) wrote :

This is looking good!

One comment/issue: historically, we try not to schema-qualify stuff coming from extensions into any of our own. It's happening here due to a side effect of us setting the search_path to (at least) evergreen, but I would be more comfortable if the extension (and all others we use in the future, as well) was explicitly loaded in the public schema and then we just didn't qualify the unaccent() call. That would fit with how we deal with things like ts_rankcd and friends, avoids mixing "our" code with "their" code, and helps avoid a situation where more than one version of an extension is loaded and the wrong one is used due to differing use of schema qualification.

Thoughts? Or am I just being paranoid.

Jason Stephenson (jstephenson) wrote :

I can confirm what Blake reports for a fresh install in comment #13.

The problem is that with a fresh install, the extension functions are added to schema public.

When the extension is created in an existing database, the functions are added to the first schema in the search path. In our case, this is schema evergreen.

I recommend changing the upgrade script to create the extension in schema public and then changing the paths.

So, the create extension becomes:

CREATE EXTENSION unaccent SCHEMA public IF NOT EXISTS;

Then the calls to evergreen.unaccent() should be changed to public.unaccent().

Thanks for the effort Dan and for putting up with my suggestions!

Blake GH (bmagic) wrote :

Dan,

I'm still missing function evergreen.unaccent(text)

Jason Stephenson (jstephenson) wrote :

Following up on Mike's comment #18, if the extension is created in the public schema, then we probably do not want to schema-qualify the function calls. So, evergreen.unaccent(text) becomes just unaccent(text).

Hope that helps.

Dan Pearl (dpearl) wrote :

Maybe this will do the trick. There are no more references to evergreen.unaccent.
If you are testing, I would recommend doing a DROP EXTENSION unaccent; to avoid leftovers of earlier tries.

It will now go in the public schema, and not have a prefix per Comment #22.

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dpearl/accent7

Blake GH (bmagic) wrote :

Syntax issue:

 CREATE EXTENSION unaccent SCHEMA public IF NOT EXISTS;

ERROR: syntax error at or near "IF"
LINE 1: CREATE EXTENSION unaccent SCHEMA public IF NOT EXISTS;

I took care of it on the test machine, but probably should be addressed if the code were to merge.

Dan Pearl (dpearl) wrote :
tags: added: pullrequest
removed: needsrepatch
Jason Stephenson (jstephenson) wrote :

Hi, Dan!

Thanks for your perseverance. I have driven this branch around the block, and it works for me, now.

I have added my signoff in collab/dyrcona/lp150178-unaccent-patron-search-signoff in the working repository.

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/dyrcona/lp150178-unaccent-patron-search-signoff

Since there are no tests, this will require another signoff before it can go in.

Also, I do not think that we should backport this to 2.9 and 2.8 because it introduces new dependencies on the unaccent extension and the Text::Unaccent::PurePerl module. I'm not sure about precedent, but I think we should avoid introducing new dependencies in the middle of a released series.

I will leave it targeted at those series for now so we can have a discussion on that mater.

Thanks again, Dan.

tags: added: signedoff

I have tested this code and consent to signing off on it with my name, [Christine Burns] and email address, [<email address hidden>]

Tested on Sandbox provided courtesy of MOBIUS.

Test
search José -> find Jose & José
search Delacroix -> find de la Croix

Kathy Lussier (klussier) wrote :

Adding the needstest tag just so it doesn't get overlooked! Thanks Dan, Jason and Christine!

tags: added: needsreleasenote needstest
Jason Stephenson (jstephenson) wrote :

I added Christine's signoff and pushed to a typo-corrected branch:

collab/dyrcona/lp1501781-unaccent-patron-search-signoff

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/dyrcona/lp1501781-unaccent-patron-search-signoff

I deleted the previous branch.

Not sure if the three signoffs qualify it to go in without the test, but I think it does.

I'm not pushing it now, because I think we need to have the discussion about whether or not to backport it, and release notes would be a good idea.

Galen Charlton (gmc) on 2016-03-04
tags: removed: needsreleasenote
Galen Charlton (gmc) wrote :

Looks like the patch *does* have release notes, so I've removed the needsreleasenote tag.

I'm not in favor of backporting it, as I think we should avoid adding new dependencies unless absolutely necessary.

One thing that bothers me is the use of two different unaccenting libraries; while it's not surprising that they behave the same for Latin alphabets, they don't for Greek:

Compare:

select unaccent('Άά Έέ Ήή Ίί Όό Ύύ Ώώ');
       unaccent
----------------------
 Άά Έέ Ήή Ίί Όό Ύύ Ώώ

with:

perl -Mutf8 -MText::Unaccent::PurePerl -C7 -e 'print unac_string("Άά Έέ Ήή Ίί Όό Ύύ Ώώ"), "\n"'
Αα Εε Ηη Ιι Οο Υυ Ωω

Galen Charlton (gmc) wrote :

https://gist.github.com/jfragoulis/9914900 gives an example of how to customize Pg's unaccent(), but for the sake of consistency, I'm wondering if we might be better off taking the performance hit and and writing evergreen.unaccent_and_squash() in PL/Perl so that it can use Text::Unaccent::PurePerl.

I'm not clear on how great the performance hit would be but few things are
more frustrating to users than inconsistent behavior.

On Fri, Mar 4, 2016 at 3:40 PM, Galen Charlton <email address hidden> wrote:

> https://gist.github.com/jfragoulis/9914900 gives an example of how to
> customize Pg's unaccent(), but for the sake of consistency, I'm
> wondering if we might be better off taking the performance hit and and
> writing evergreen.unaccent_and_squash() in PL/Perl so that it can use
> Text::Unaccent::PurePerl.
>
> --
> You received this bug notification because you are subscribed to
> Evergreen.
> Matching subscriptions: evergreenbugs
> https://bugs.launchpad.net/bugs/1501781
>
> Title:
> Patron name search should be diacritic-insensitive
>
> Status in Evergreen:
> Triaged
> Status in Evergreen 2.8 series:
> Triaged
> Status in Evergreen 2.9 series:
> Triaged
>
> Bug description:
> This is an enhancement to allow patron name lookups to be more
> tolerant of spacing and special characters. This addresses problems
> related to finding names with diacritics, spaces, and the presence or
> absence of other punctuation marks. (Example, being able to find José
> by searching Jose, or patrons with multiple first and last names like
> "de la Croix" -- or is it "Delacroix"?)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/evergreen/+bug/1501781/+subscriptions
>

Jason Stephenson (jstephenson) wrote :

Keeping track of custom unaccent rules could become tricky if the Pg module changes, or if there are differences for other languages, too.

Besides writing unaccent_and_squash() in PL/Perl, we could add an open-ils.storage call to unaccent a string using the unaccent extension as well as another to call unaccent and squash on a string.

It may be necessary to a more public call in some other service, maybe open-ils.actor, to work on names so it can be used where the private router is not available.

Just a thought.

Dan Pearl (dpearl) on 2016-03-07
tags: added: needsrepatch
removed: pullrequest signedoff
Changed in evergreen:
assignee: nobody → Dan Pearl (dpearl)
Jason Stephenson (jstephenson) wrote :

Out of curiosity, Dan, which approach do you plan to take?

I think wirting evergreen.unaccent() and/or evergreen.unaccent_and_squash() in PL/Perl using Text::Unaccent::PurPerl would be the quickest/easiest route to go. This will add Text::Unaccent::PurePerl as a requirement on the database server.

My suggestion of adding new calls/services is very likely overkill for this.

Anyway, I'm more than happy to help or to test anything you come up with.

Changed in evergreen:
status: Triaged → Confirmed
Jason Stephenson (jstephenson) wrote :

After a brief email exchange with Dan, I've decided to take over finishing this branch.

Our current thinking is that we'll keep the unaccent extension for PostgreSQL, but lose the dependency on Text::Unaccent::PurePerl.

The Open-ILS Perl method to unaccent strings will be implemented to call the database function through a cstore JSON query. This way, we'll have consistent unaccent behavior and will not add a dependency on a new Perl module.

I would have less objection to backporting the branch in this new form, but I still leave that question open to discussion.

Changed in evergreen:
assignee: Dan Pearl (dpearl) → Jason Stephenson (jstephenson)
Jason Stephenson (jstephenson) wrote :

Another quick look over Dan's code reminded me that his Perl changes are all in storage. I should be able to use the database function without requiring cstore.

Jason Stephenson (jstephenson) wrote :

Hi, everyone!

I have pushed a reworked branch that combines much of Dan's original work with some changes to his search code so that only the unaccent Postgres extension is used.

I also added a file of 15 pgtap tests. There are no Perl tests, but I could add some if that is desired.

It's the top 3 commits on the user/dyrcona/lp1501781-unaccent-partron-search branch in working:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dyrcona/lp1501781-unaccent-partron-search

I've restored the pullrequest tag and removed the others.

I leave it an open question if we backport this or not.

tags: added: pullrequest
removed: needsrepatch needstest
Changed in evergreen:
assignee: Jason Stephenson (jstephenson) → nobody
Changed in evergreen:
importance: Medium → Wishlist
Jason Stephenson (jstephenson) wrote :

I've been giving this one some thought, and I decided this is a feature, not a bug fix for the following reasons:

1) Patron search works. You can find patrons with diacritics if you add the diacritic to the search. You can also find patrons without diacritics when searching without diacritics. Search being sensitive to diacritics, while inconvenient, is not a bug.

2) This branch adds new translation strings in the form of descriptions and labels for org unit settings. I don't think we should be adding translation strings in backported bug fixes because they will not be translated for the bug fix release.

3) This branch adds new org unit settings. This is not something that I think we should do for bug fixes.

4) This branch introduces a new extension into the database. This is also not something that I think we should do for bug fixes.

I think we need to have a conversation at the conference or the next developers' meeting, whichever comes first, about what constitutes a bug fix and what is an enhancement, or at minimum what will prevent a bug fix from being backported in the future. The consensus from that discussion then should go on the wiki somewhere so that we can reference it later. I, of course, am volunteering to lead this discussion and write the wiki page after the meeting.

no longer affects: evergreen/2.8
no longer affects: evergreen/2.9
Changed in evergreen:
assignee: nobody → Terran McCanna (tmccanna)
Terran McCanna (tmccanna) wrote :

I have tested this code and consent to signing off on it with my name, Terran McCanna, and my email address <email address hidden>.

Bug vs New Feature: I agree it should be considered a new feature - it's definitely one I'll want to promote to my libraries!

Testing Notes: I was not able to run the pgtap tests, but I referenced them and spot-checked a variety of different special characters. Since Concerto doesn't have any patrons that have diacritics in their names, I edited patron names and then searched for them - everything I tried with diacritics and with spaces worked perfectly. I tested in both the xul client and the web client.

Wish List: The only thing I tried that did not work was apostrophes. For instance, in PINES, "O'Neal" is entered as O'Neal, O Neal, and Oneal. We also have a lot of names with apostrophes in the first names such as T'Andre. If this could be added as well before being merged into master, then that would be great, but if not, it is still a wonderful new feature worth adding in even without it.

Changed in evergreen:
assignee: Terran McCanna (tmccanna) → nobody
tags: added: signedoff
Terran McCanna (tmccanna) wrote :

PS: I also tested that it came up with the right number of matches when registering a new patron.

Mike Rylander (mrylander) wrote :

Pushed to master. Thanks, Dan and Jason! (I agree this is a feature.)

Changed in evergreen:
status: Confirmed → Fix Committed
Ben Shum (bshum) on 2016-08-09
Changed in evergreen:
milestone: 2.next → 2.11-beta
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers