Incorporating part information into biblio fingerprint

Bug #1553287 reported by Kathy Lussier on 2016-03-04
This bug affects 1 person
Affects Status Importance Assigned to Milestone

Bug Description

Evergreen release: all

Currently, the title piece of the biblio fingerprint is only looking at subfield a of the 2xx fields and subfield t of the 700 field. To better identify records that are truly the same piece of work, the fingerprint should also include subfields n and p.

Here is an example where the absence of part information leads to not-so-great results.

Looking at the first result for this grouped search:

you would think all of the subsequent records would be for Mockingjay. However, most of those records are actually for the Hunger Games. They are part of the same group because they share the same title in 245a, but, if we looked at subfield p, we would see that these records are distinct.

Also pointing to a related bug at that would also improve the fingerprint. Perhaps the two could be addressed at the same time.

Kathy Lussier (klussier) on 2016-06-22
Changed in evergreen:
assignee: nobody → Kathy Lussier (klussier)
Kathy Lussier (klussier) wrote :

My approach to incorporating part information is available in the working branch at;a=shortlog;h=refs/heads/user/kmlussier/lp1553287-add-parts-to-biblio-fingerprint

The branch works for me with new records added to the system, but I'm unsure of what steps will need to be taken so that existing records are remapped correctly. I tried reingesting records and also tried running quick_metarecord_map.sql, but neither appeared to remap existing records correctly.

Any thoughts on how to proceed from here?

Changed in evergreen:
assignee: Kathy Lussier (klussier) → nobody
milestone: none →
tags: added: needsreleasenote
Kathy Lussier (klussier) wrote :

After asking about this in IRC last week, I tried a new reingest with the ingest.reingest.force_on_same_marc internal flag enabled. I see the same behavior:

* The reingest creates the new fingerprints with the part information as expected.
* The reingest also creates new mappings to a new master record when appropriate.
* However, it does not remove the old map to the previous master record. We then end up with duplicate source record entries in metabib.metarecord_source_map.

I'm guessing this is related to bug 1488655

If this change to the fingerprint is added, then, it's going to cause problems for upgraded database.

My question is whether we need to wait for 1488655 to be fixed before we can add a change to config.biblio_fingerprint to the code or is there something else that can be done during the upgrade to handle those metabib.metarecord_source_map entries that are not being deleted?

We would love to see this change to the fingerprint introduced since the current fingerprint can lead to some odd groupings when there are series using this subfield.

Kathy Lussier (klussier) wrote :

Thanks to Thomas Berezansky, I think I've found the missing piece. After upgrade, Evergreen sites will need to a) reingest records to get the new fingerprint in biblio.record_entry, b) truncate the metabib.metarecord and metabib.metarecord_source_map tables and then c) run the quick_metarecord_map.sql.

I've created release notes with this information and updated the upgrade script to output these steps. I'm not sure I did this correctly, so please let me know if I need to fix this in any way.

Initially, I was hesitant about doing a reingest for this one feature, but it looks like we're already doing a reingest that touches most records with bug 1307553.

All of these changes have been force pushed to the branch at working/user/kmlussier/lp1553287-add-parts-to-biblio-fingerprint.

tags: added: pullrequest
removed: needsreleasenote
Changed in evergreen:
importance: Medium → Wishlist
Mike Rylander (mrylander) wrote :

Please note that truncating and repopulating metabib.metarecord will destroy any existing metarecord-level holds, along with historical hold data. That's one reason we don't remove old metarecords that have no more constituents, and in some cases avoid removing out-of-date mappings. It's also why we don't recommend wholesale MR remapping in general.

If your site doesn't use M-type holds at all, then it's perfectly safe.

Galen Charlton (gmc) wrote :

I agree with the idea, but I think a couple tweaks are in order.

I tossed a music record at it:

The part component of the fingerprint was "op70", as biblio.extract_fingerprint() considers only the first node returned by the XPath expression. The total list for this record would be:

{"<subfield xmlns=\"\" code=\"n\">op. 70.</subfield>","<subfield xmlns=\"\" code=\"n\">No. 1</subfield>","<subfield xmlns=\"\" code=\"n\">op. 70.</subfield>","<subfield xmlns=\"\" code=\"n\">No. 2.</subfield>"}

Since the $n and $p are repeatable, it would be a good idea to capture all of them.

Another interesting thing about this particular record is that it effectively contains two works -- no. 1 and no. 2 of opus 70.

Galen Charlton (gmc) wrote :

Using the MODS stylesheet and setting the cbf xpath to '//mods32:titleInfo/mods32:partNumber' and format to 'mods32' helps, as the XSLT takes care of concatenating repeats of $n and $p. It still picks up only the first MODS titleIinfo that has a partNumber, but that may be good enough for non-music collections.

Kathy Lussier (klussier) wrote :

I have incorporated the suggested changed and force pushed them to my working branch. It is indeed concatenate repeats of $n in the example record you pointed to. For some records, I think it would be better if we could concatenate the subfield n and p if they appear in the same MARC tag, but I don't know if there is a way to easily do so. The Mockingjay example I cited above is one of those cases.

The 245 has:
‡aThe Hunger Games. ‡pMockingjay, ‡nPart 2

This DVD is part 2 of Mockingjay, which, in turn, is the third title in the Hunger Games series. With my current branch, the part is being stored as part2, when mockingjaypart2 would be more accurate. If Catching Fire had been divided into two parts, we would find that it the two titles would be pulled together in the same group.

Overall, though, I think this branch leads to improvement of the current behavior, where the parts aren't considered at all.

If this is ready for inclusion, I was thinking it would be best that it be included at the same time as the code for bug 1528901 since both require that metarecords be recalculated. It's better that sites get both changes in one recalculation rather than recalculating on two different occasions.

Galen Charlton (gmc) wrote :

Here's another variation to consider:

INSERT INTO config.biblio_fingerprint (name, xpath, format)
    VALUES (
INSERT INTO config.biblio_fingerprint (name, xpath, format)
    VALUES (

By adding two entries to cbf, both part numbers and part names are reliably captured, and using '/mods..' rather than '//mods..' ensures that we grab only from the primary title entry, not included titles.

I agree that this one and bug 1528901 should go in at the same time.

Kathy Lussier (klussier) wrote :

Well, that's not complicated after all! I've updated the branch to reflect Galen's suggested changes and force-pushed it to the working repo.

Kathy Lussier (klussier) on 2017-02-03
Changed in evergreen:
milestone: → 2.12-beta
Mike Rylander (mrylander) wrote :

Committed to master. Thanks, Kathy and Galen!

Changed in evergreen:
assignee: nobody → Mike Rylander (mrylander)
assignee: Mike Rylander (mrylander) → nobody
status: New → Fix Committed
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers