ISBN search only finds records with both 10 and 13 digit ISBNs

Bug #1743613 reported by Jason Boyer on 2018-01-16
This bug affects 2 people
Affects Status Importance Assigned to Milestone

Bug Description

Eg 3.0.2

After receiving reports of records not showing up for ISBN searches I noticed that the SQL query generated for ISBN searches becomes a search for the 10 digit and 13 digit ISBNs in a single record because these lines:

to_tsquery('simple', COALESCE(NULLIF( '(' || btrim(regexp_replace(translate_isbn1013(E'9780899978482'),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), '')) AS tsq,
      to_tsquery('simple', COALESCE(NULLIF( '(' || btrim(regexp_replace(translate_isbn1013(E'9780899978482'),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), '')) AS tsq_rank

turn that ISBN into this ts_query:

'9780899978482' & '0899978487'

Which does exactly what it looks like, requires both values to be considered a match.

This can be corrected by changing the regexp_replace call to use a pipe rather than an ampersand:

to_tsquery('simple', COALESCE(NULLIF( '(' || btrim(regexp_replace(translate_isbn1013(E'9780899978482'),E'(?:\\s+|:)','|','g'),'&|') || ')', '()'), '')) AS tsq,
      to_tsquery('simple', COALESCE(NULLIF( '(' || btrim(regexp_replace(translate_isbn1013(E'9780899978482'),E'(?:\\s+|:)','|','g'),'&|') || ')', '()'), '')) AS tsq_rank

I assume this change should be limited to ISBNs because of their unusual nature but I don't have time to look into that today and wanted to get this down while it's fresh in my mind.

Mike Rylander (mrylander) wrote :

I don't think we want to do a blanket change there, as that is a construct used by all normalizers.
 It would be wrong to join the output of, say, the date range normalizer with | instead of &. I think we'll need to teach the normalizer mapping table and the code that constructs this clause about how the output should be joined.

Mike Rylander (mrylander) wrote :

Hrm... hold on a second here. If the normalizer was properly applied to the data coming out of the record, the index_vector column in mife would contain both versions, and the search would match.

Jason, did you perhaps recently add the modifier without performing a reingest after?

Elaine Hardy (ehardy) wrote :

I did an ISBN search for a record in the PINES database with just a 10 digit ISBN and was able to retrieve the record.

Jason Boyer (jboyer) wrote :

Nothing has been changed on our normalizers or the isbn field recently and there was a reingest after the last upgrade that required it but I don't recall ISBN handling changing in ages. (Manually forcing full reingests didn't help with the examples I've been given.) I'll double check some things.

Jason Boyer (jboyer) wrote :

Ah ha.
What has changed recently is our definition of oils_tsearch2. (applies normalizers to metabib.identifier_field_entry to create the index_vector.)

I changed it to this shrimpy version from the seed data (which I *thought* was current at the time but must have been out of date.)
CREATE OR REPLACE FUNCTION public.oils_tsearch2()
 RETURNS trigger
 LANGUAGE plpgsql
AS $function$
NEW.index_vector = to_tsvector((TG_ARGV[0])::regconfig, NEW.value);

Which obviously knows nothing about any normalizers. I'll update our function with a proper definition and start a reingest to mop up.

So this bug is invalid, but the reason I was looking for possible oils_tsearch2 issues is that a table was getting null values from the trigger... So maybe I'll file another bug soon? Have to see. :-/

Changed in evergreen:
status: New → Invalid
Mike Rylander (mrylander) wrote :

Ah, yep. that is a long-dead version that is superseded in the baseline schema by a script that follows the one defining that version.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers