Series search in 2.5 does not retrieve 800 |t

Bug #1259665 reported by Elaine Hardy
32
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
High
Unassigned
2.5
Fix Released
Undecided
Unassigned

Bug Description

Series search in 2.5 does not retrieve values in 800 |t. For example: Search for series Hunger Games retrieves 19 record in PINES live catalog (gapines.org) running EG 2.3. Same search in 2.5 retrieves one record.

In 2.3, records with:
490 1. ‡a[The hunger games trilogy ; ‡vbk. 1]
800 1. ‡aCollins, Suzanne. ‡tHunger Games ; ‡vbk. 1.

and

490 1. ‡aHunger games ; ‡v2
830 .0 ‡aHunger games ; ‡v2.

are retrieved.

In 2.5, only records with 830 are retrieved (in PINES catalog this is one record) :

490 1. ‡aHunger games ; ‡v2
830 .0 ‡aHunger games ; ‡v2.

Revision history for this message
Ben Shum (bshum) wrote :

Confirmed to affect us as well. Setting status and targets.

Changed in evergreen:
status: New → Confirmed
importance: Undecided → Medium
milestone: none → 2.5.2
Revision history for this message
Dan Wells (dbw2) wrote :

This is a side effect of the changes made to series extraction to better support browse (see commit e710ecbe).

Basically, we modified the MODS to add an "nfi" (non-filing) version of the series and subjects so that we could get the browse to sort properly, then changed the index extraction criteria to suit. 800t wasn't given an "nfi" version, so it doesn't make the cut anymore for series index extraction. Of course, there isn't an nfi indicator for 800t at all, so we can't easily add an nfi version, which means I don't think we can make 800t browseable in a reasonable way. Assuming we decide to somehow make 800t at least searchable again, this may be a potential source of confusion.

All that said, if we are alright with the potential confusion, I think we can probably remove the "nfi" restriction from the search_xpath in the relevant config.metabib_field entry, then add it to the browse_xpath. I say "probably" because I believe that the browse_xpath must be a subset of the search_xpath, and I am not 100% positive limiting to this attribute would qualify (though I think it would work).

Revision history for this message
Elaine Hardy (ehardy) wrote :

Not having 800|t in the series index extraction is a major problem. Probably more than half of our series statements in public libraries are 800|t . While it would be preferable to have 800|t in both browse and standard searching, I argue that it is preferable to have it in standard search rather than browse if both are not possible.

Revision history for this message
Mike Rylander (mrylander) wrote :

Dan, you're correct that it would not.

We could add an "nfi" version that doesn't actually attempt "nfi" removal, of course. Another alternative is to simply split the browse and search strings, or have separate entries to cover nfi-capable tags and 800t separately. Browse can be a subset of search, of course, so if we were to go with the former, the db could be primed without a full reingest with something like:

insert into metabib.series_field_entry (field, source, value) select XXX, record, value from metabib.full_rec where tag = '800' and subfield = 't';

Where XXX is the appropriate id from config.metabib_field.

Revision history for this message
Mike Rylander (mrylander) wrote :

Actually, I retract my "would not work". This might, but I'm not sure how the ingest will react to not having data post-filter...

  xpath = $$//mods32:mods/mods32:relatedItem[@type="series"]/mods32:titleInfo$$ -- as before the update
  browse_xpath = $$*[@type="nfi"]$$ -- filter out nodes that don't support nfi from browse, including 800t, unfortunately
  browse_sort_xpath = $$*[@type="nfi"]/*[local-name() != "nonSort"]$$ -- sorting version of the above

My previous priming trick would still work fine.

Dan Wells (dbw2)
Changed in evergreen:
importance: Medium → High
milestone: 2.5.2 → none
Revision history for this message
Sarah Childs (sarahc) wrote :

I second everything Elaine has said. A series search that doesn't return results for the 800 is so misleading that it should be removed if it can't be fixed.

She mentions that more than half the series statements are 800t, and I would add that series traced in the 800t include probably 80-90% of series that people actually care about searching, since it's all series written by a single author (the vast majority of popular fiction series)

Revision history for this message
Dan Wells (dbw2) wrote :

I am retargeting this bug to get it on the 2.6 radar. I recently worked up a branch along the lines of what was discussed here, and as Mike foresaw, I did need to make one small adjustment to the ingest to get it to not die on null values.

Unfortunately, this direction isn't really going to work out because of the way 'titleInfo' is duplicating data in the MODS. We end up with duplicate data in the search index and duplicate facets as well.

Ultimately, we are going to need to either rethink the MODS additions or separate the search/facet/browse configs for series. The second is easier, and probably the next step, but I think the first would be better long term (rather than deal with an increasingly messy MODS with duplicate data elements).

At any rate, the broken fix attempt is here for anyone wanting to see what I mean:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dbwells/lp1259665_restore_800_t_search

Changed in evergreen:
milestone: none → 2.6.0-rc1
Revision history for this message
Dan Wells (dbw2) wrote :

Okay, I decided to go the simplest route and split series search/faceting config from series browse config. Here is my attempt to do so:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dbwells/lp1259665_series_browse_reconfig

Marking this as a blocker, since this will require significant reingest.

tags: added: 2.6-rc-blocker pullrequest
Revision history for this message
Dan Wells (dbw2) wrote :

Just force pushed an update to prevent the search index duplicates I talked about before. I wasn't being selective enough in my search xpath.

Revision history for this message
Ben Shum (bshum) wrote :

Seems good to me. Thanks Dan. Pushed!

Changed in evergreen:
status: Confirmed → Fix Committed
Revision history for this message
Galen Charlton (gmc) wrote :

For the benefit of the record, this bug also broke search indexing of series titles found in 490s whose first indicator is 0.

Revision history for this message
Ben Shum (bshum) wrote :

Adding series target for 2.5. Not sure if anyone would like to tackle how this might be backported there, but may be worthwhile to certain sites on 2.5 who want more functional series search.

Revision history for this message
Dan Wells (dbw2) wrote :

I am 100% in favor of getting this into 2.5. The lack of a 2.5 target was just an oversight on my part.

Revision history for this message
Elaine Hardy (ehardy) wrote :

I had noticed the 490 search issue as well but had not had the opportunity to add it here.

It would be very good if this could get backported to 2.5

Thanks!

Revision history for this message
Mike Rylander (mrylander) wrote :

FWIW, for 2.5, we should be able to prime the search side of things based on the content of metabib.full_rec, to avoid a full reingest. It won't be normalization-perfect, but it's much cheaper and 99% correct. The 2.6 upgrade will end up generating a reingest, I think, in any case, so that last (estimated) 1% of correctness a temp situation. Something like:

-- For each field id, replace XXX

    INSERT INTO metabib.series_field_entry (field,source,value)
        SELECT XXX,record,value FROM metabib.full_rec WHERE tag = '490' AND subfield = 'a';
    INSERT INTO metabib.series_field_entry (field,source,value)
        SELECT XXX,record,value FROM metabib.full_rec WHERE tag IN ('800','810','811) AND subfield = 't';
    INSERT INTO metabib.series_field_entry (field,source,value)
        SELECT XXX,record,value FROM metabib.full_rec WHERE tag = '830' AND subfield IN ('a','t');

-- Then, afterwords...
    DELETE FROM metabib.combined_series_field_entry;
    INSERT INTO metabib.combined_series_field_entry(record, metabib_field, index_vector)
        SELECT source, field, strip(COALESCE(string_agg(index_vector::TEXT,' '),'')::tsvector)
        FROM metabib.series_field_entry GROUP BY source, field;
    INSERT INTO metabib.combined_series_field_entry(record, index_vector)
        SELECT source, strip(COALESCE(string_agg(index_vector::TEXT,' '),'')::tsvector)
        FROM metabib.series_field_entry GROUP BY source;

Revision history for this message
Dan Wells (dbw2) wrote :

Backported to rel_2_5 with Mike's quick reingest process added to upgrade file. Thanks, Mike!

Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.