Indexing multiple ISSNs in a single bib results in identifier|issn lookup failure

Bug #932540 reported by Dan Scott
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
High
Unassigned
2.1
Fix Released
High
Unassigned

Bug Description

* Evergreen master

The default config.metabib_field_index_norm_map contains a 'replace' entry for the ISSN identifier field that replaces spaces with "". That was meant to handle ISSNs that incorrectly contain a space instead of a hyphen; it normalizes them from: "1234 1234" to "12341234".

However, if you extend the index definition for ISSN to include, say, 780 $x and 785 $x to return hits for the preceeding and succeeding versions of the periodical, or even simply have a print and electronic version of the ISSN in the same record, then the space replacement function wreaks havoc as all of the entries which would have been separated by a space are indexed without spaces.

What we want:

conifer=# SELECT * FROM metabib.identifier_field_entry WHERE source = 603139;
    id | source | field | value | index_vector
 11097172 | 603139 | 19 | 1433-7851 0044-8249 1521-3773 0570-0833 0935-9648 | '00448249':2 '05700833':4 '09359648':5 '14337851':1 '15213773':3

What we currently have:

    id | source | field | value | index_vector
 11097172 | 603139 | 19 | 1433-7851 0044-8249 1521-3773 0570-0833 0935-9648 | '1433785100448249152137730570083309359648':1

To do less harm, I propose that we remove the replace(" ", "") rule from 950.sql in the base schemas. I suppose we'll have to write an upgrade script that removes the rule and reindexes anything with more than one ISSN.

Tags: pullrequest
Revision history for this message
Dan Scott (denials) wrote :

Oh, and in case it's not clear, in what we currently have a search for "1433-7851" in identifier|issn would return nothing; only a search for "1433785100448249152137730570083309359648" returns the desired record in that case. Which is clearly not optimal behaviour :)

Revision history for this message
Dan Scott (denials) wrote :

One rough tool in the upgrade / bug fix toolbox to reingest the field entries once the offending replace(" ", "") has been removed from the index normalization map:

SELECT metabib.reingest_metabib_field_entries(source)
  FROM metabib.identifier_field_entry
  WHERE field = 19 AND char_length(value) > 9;

Revision history for this message
Dan Wells (dbw2) wrote :

Thanks for pointing this out, Dan. I can verify that we are experiencing this same issue.

Perhaps now would be a good opportunity to consider changing the default 'joiner' used in biblio.extract_metabib_field_entry? As things stand, it is hardcoded to a single space via a wrapped function call, and I am thinking we might want to use something a little more intrusive. Not only would it save us in this case, but it could more or less eliminate the potential for false exact matches across field boundaries (i.e. the end of one field and the beginning of the next forming an "exact" match).

In some quick testing I changed the hardcoded space to a hardcoded pipe (|) and reindexed. Pipe is nice because it is already ignored when creating the vector, but I haven't looked into other special characters which might meet that criteria (and we can really pick anything we want since we are already manipulating the vector data). My testing was *very* limited, but the only obvious problem I came upon was that our ISBN normalizer needed to be told to expect a pipe-delimited list rather than space delimited, a simple change. Switching to pipe as the joiner did fix the ISSN issue. I also experimented with using 'tab' rather than space or pipe as the joiner, but right now all whitespace gets squashed to a single space in the *_field_entry tables, so the tabs were not preserved. I do think 'tab' is still a strong candidate, but I didn't pursue it further at this point.

There is some support in place for eventually allowing different joiners for different field configs. That would certainly give much greater flexibility, but we should still ask whether space is the best default joiner in any case.

Revision history for this message
Dan Scott (denials) wrote :

Dan - for the sake of fixing this specific issue in a reasonable time frame (like, the next point releases), would you support the current change?

I think a change as broad as using a different default joiner should be a conversation on open-ils-dev, and should probably have some clear examples of how the space as a default is bad and what would be fixed by changing to a different joiner (so far the false "exact match" seems to be the best example). I fear that people aren't going to see this discussion given that we're talking about ISSNs here.

Revision history for this message
Dan Wells (dbw2) wrote :

Yes, I do support the current proposed change. I looked into this a little more, and maybe our data is a little cleaner than average, but from what I can tell, spaces in ISSNs on the data side (i.e. in the bib records) are rare (I don't think we have *any* formatted that way). It is entirely possible that the space removal rule was meant to normalize bad searches rather than bad data, but I couldn't actually get it to function that way either. If I search for an issn as:

identifier|issn:1234 5678

it seems that the space is eaten by the tokenizer, so we have two tokens and nothing to remove, and the match ends up failing (since the space *is* gone in the index). If I try instead:

identifier|issn:"1234 5678"

it also fails. I am not certain why without digging even deeper, but at any rate, I wasn't able to construct a query where space removal would work from the search direction either.

Revision history for this message
Dan Scott (denials) wrote :

Created two branches with the fixes I had proposed earlier:

  * user/dbs/lp932540_fix_issn_indexing for master
  * user/dbs/lp932540_fix_issn_indexing_rel_2_1 for rel_2_1

tags: added: pullrequest
Changed in evergreen:
milestone: none → 2.2.0beta1
importance: Undecided → High
Revision history for this message
Mike Rylander (mrylander) wrote :

Master and 2.1 version merged. Thank, Dan.

Changed in evergreen:
status: New → Fix Committed
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.