non-filing indicators break title search relevance in non-English titles

Bug #825039 reported by Dan Scott
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
High
Unassigned

Bug Description

* Evergreen 2.0.6 (reproduced in master)
* OpenSRF 2.0.0
* PostgreSQL 9.0

The default xpath expression for title indexing relies on MODS32 titleInfo element, which grabs the nonSort and title elements along with intervening whitespace.

This is generally fine for English titles, which have non-filing indicators (245 indicator 2) for titles like "The foobar" or "A foobar", where the space between the non-filing article that goes into nonSort and the filing remainder that goes into title is expected. The resulting value in metabib.title_field_entry is "The foobar" or "A foobar" so exact match relevance bumps, etc, work as expected.

However, in French and other languages, it is typical for non-filing indicators to be used for titles like "l'Histoire" - in which case, nonSort gets "l'" and title gets "Histoire", and the resulting value in metabib.title_field_entry is "l' Histoire" - breaking relevance and search in general.

One step towards a fix is to extract the nodeset for the discrete elements in titleInfo instead of titleInfo itself, avoiding the empty whitespace nodes:

UPDATE config.metabib_field SET xpath = $$//mods32:mods/mods32:titleInfo/*[local-name()='nonSort' or local-name()='title' and not(@type) or local-name='subTitle' ]$$ WHERE id = 6;

However, then the default joiner that we pass to biblio.extract_metabib_field_entry() (a space) kicks in, and we're back to the same problem of "l' Histoire" for the indexed title, leading to search misery.

Consequently, we can add a condition to the joiner clause in biblio.extract_metabib_field_entry() to not use a joiner when the field_class is 'title':

            IF raw_text IS NOT NULL AND idx.field_class <> 'title' THEN

This still isn't perfect for relevance, as the keyword field entries still get "l' Histoire" - but it vastly improves title search for our French titles.

For an example MARC record to test with, see http://laurentian.concat.ca/opac/extras/supercat/retrieve/marcxml/record/599708 (and note the titleInfo elements at http://laurentian.concat.ca/opac/extras/supercat/retrieve/mods32/record/599708).

Tags: pullrequest
Revision history for this message
Dan Scott (denials) wrote :

I went with a different approach: modify MODS32 to include a titleNonfiling element that ignores the non-filing indicators and gives you the title string in one unmodified string.

Repo: working
Branch: user/dbs/fix-nonfiling-titles

Note: the upgrade script does not currently include a "reingest titles that have non-filing indicators and apostrophes" upgrade action.

Revision history for this message
Dan Scott (denials) wrote :

Added a reingest UPDATE statement to the upgrade script that checks for apostrophes in the value column of 245 a entries with a non-filing indicator in metabib.full_rec.

It would result in 55,000 / 2,200,000 of our records getting reingested - not terrible. So probably reasonable for most sites to apply. Pushed into the branch.

tags: added: pullrequest
Dan Scott (denials)
Changed in evergreen:
importance: Undecided → High
Revision history for this message
Dan Scott (denials) wrote :

One complication: the naco_normalize() function is stripping apostrophes before we even get to the FTS parser. So:

"l'Histoire" becomes "lhistoire"

before it can be indexed. So with the current patch, users have to search with the matching article: a search on "histoire" returns nothing, whereas "l'histoire" would return the expected results. As it turns out, this problem affects not just non-filing indicators, but any apostrophe anywhere in an indexed field. A big problem.

The agreed-upon solution in IRC was to create a variant of naco_normalize() that is meant for actual text search, rather than for normalizing authority headings. We had leaned heavily on naco_normalize() for our text searching purposes but it clearly goes overboard.

Long-term, the best approach might be to try to rely directly on PostgreSQL's parsing and indexing - with the addition of the unaccent contrib module, ts_debug() should return what one would expect.

Revision history for this message
Dan Scott (denials) wrote :

Force-pushed an update to user/dbs/fix-nonfiling-titles that includes the search_normalize() variant and uses it in the appropriate places.

Revision history for this message
Dan Scott (denials) wrote :

Pushed an update to the upgrade script such that the reindexing of records is no longer limited to those titles that have non-filing indicators.

Dan Scott (denials)
Changed in evergreen:
milestone: none → 2.2.0
Changed in evergreen:
status: New → In Progress
assignee: nobody → Dan Scott (denials)
Changed in evergreen:
assignee: Dan Scott (denials) → nobody
Changed in evergreen:
milestone: 2.2.0alpha1 → 2.2.0alpha2
Revision history for this message
Dan Scott (denials) wrote :

Changed status back to "New" as nobody is actually looking at this, to my knowledge.

Changed in evergreen:
status: In Progress → New
Revision history for this message
Jason Etheridge (phasefx) wrote :

Hrmm, I tried this with L' Histoire littéraire immanente dans la poésie latine : huit exposés suivis de discussions (ISBN: 2600007474)

Before the patch:

evergreen2=# select * from metabib.title_field_entry;
-[ RECORD 1 ]+------------------------------------------------------------------------------------------------------------------------------------
id | 1
source | 1
field | 6
value | L' Histoire littéraire immanente dans la poésie latine huit exposés suivis de discussions
index_vector | 'dan':5 'de':12 'discuss':13 'expos':10 'histoir':2 'huit':9 'immanent':4 'l':1 'la':6 'latin':8 'litterair':3 'poesi':7 'suivi':11

And then after the patch and upgrade script, I see no obvious change:

evergreen2=# select * from metabib.title_field_entry;
-[ RECORD 1 ]+------------------------------------------------------------------------------------------------------------------------------------
id | 1
source | 1
field | 6
value | L' Histoire littéraire immanente dans la poésie latine huit exposés suivis de discussions
index_vector | 'dan':5 'de':12 'discuss':13 'expos':10 'histoir':2 'huit':9 'immanent':4 'l':1 'la':6 'latin':8 'litterair':3 'poesi':7 'suivi':11

Trying a new record, L' Histoire littéraire : ses méthodes et ses résultats ; mélanges offerts à Madeleine Bertaud (ISBN: 2600004696 (pbk.)):

evergreen2=# select * from metabib.title_field_entry;
-[ RECORD 1 ]+------------------------------------------------------------------------------------------------------------------------------------
id | 1
source | 1
field | 6
value | L' Histoire littéraire immanente dans la poésie latine huit exposés suivis de discussions
index_vector | 'dan':5 'de':12 'discuss':13 'expos':10 'histoir':2 'huit':9 'immanent':4 'l':1 'la':6 'latin':8 'litterair':3 'poesi':7 'suivi':11
-[ RECORD 2 ]+------------------------------------------------------------------------------------------------------------------------------------
id | 2
source | 2
field | 6
value | L'Histoire littéraire : ses méthodes et ses résultats ; mélanges offerts à Madeleine Bertaud
index_vector | 'a':11 'bertaud':13 'et':6 'histoir':2 'l':1 'litterair':3 'madelein':12 'melang':9 'method':5 'offert':10 'resultat':8 'ses':4,7

Not sure what I should be seeing.

HOWEVER

A title search for l'Histoire does show both records, but pre-patch, I couldn't get the first record to show up in searches (going to redo it just to be sure).

Revision history for this message
Jason Etheridge (phasefx) wrote :

Confirmed. So I don't grok how it works, but it sure seems to. I'd be happy to sign-off and merge unless we want more expert eyes on it.

Revision history for this message
Jason Etheridge (phasefx) wrote :

So to be clear, the second record I played with was post-patch (it may very well break pre-patch).

Revision history for this message
Lebbeous Fogle-Weekley (lebbeous) wrote :

Got the same results as Jason in an new DB (empty except for Dan's test record). I think this is ready for master as soon as I discuss with Dan some trivial merge questions.

Revision history for this message
Lebbeous Fogle-Weekley (lebbeous) wrote :

Merged, signed-off and pushed to master. Thanks Dan.

Changed in evergreen:
status: New → Fix Committed
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.